Prompt Quality (prompt_qual)
Contents
Metric Description
Prompt quality measures whether the prompt provided to the model is clear, specific, and actionable. A good prompt guides the model to generate relevant, well-structured responses and helps reduce hallucinations or distortions. The metric evaluates prompts across several dimensions—clarity, constraints, specificity, and category-specific criteria—depending on how the prompt is classified.
The score runs from 0 (poor prompt quality) to 100 (high prompt quality). The implementation blends (1) heuristic metrics for readability and clarity, (2) LLM-as-a-Judge for semantic analysis of constraints, examples, and structure, and (3) category-specific scoring tailored to the type of task.
How the metric works
The metric first classifies the prompt into one of five categories:
- Classification: Prompts for labeling, categorization, or classification tasks.
- Retrieval QA: Prompts for question-answering over retrieved context.
- Transformation: Prompts for rewriting, translating, or converting content.
- Creative generation: Prompts for creative writing, brainstorming, or content generation.
- Reasoning analysis: Prompts for step-by-step reasoning, analysis, or problem-solving.
Each category uses a different combination of criteria against which the prompt is evaluated.
How to interpret the score
- Closer to 100: the prompt is clear, specific, well-structured, and provides appropriate guidance for its category.
- Closer to 0: the prompt is vague, contradictory, poorly defined, or lacks important elements (e.g. unclear constraints).
This metric evaluates the prompt only, not the model’s output. For summarization tasks, consider excluding the text to be summarized from the prompt when measuring prompt quality.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: prompt_qual
Default threshold: 80
Inputs (each object in data)
prompt(strrequired): The prompt to evaluate (the instructions or text provided to the LLM).
Evaluation metadata
On successful evaluation, the metric returns eval_metadata describing how the prompt was scored:
prompt_category(str): The task category assigned to the prompt (for example classification, retrieval QA, transformation, creative generation, or reasoning analysis). Scoring weights and criteria depend on this category.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"prompt": "Classify each customer review into one of three labels: Positive, Negative, or Neutral. "
"Instructions: 1. Give a reason under 15 words. 2. Use British English spelling. "
"3. Provide your answer in JSON format. 4. Do not invent details not present in the review."
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": ["prompt_qual"],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))