Prompt Quality (prompt_qual)
Contents
Metric Description
Prompt quality measures whether the prompt provided to the model is clear, specific, and actionable. A good prompt guides the model to generate relevant, well-structured responses and helps reduce hallucinations or distortions. The metric evaluates prompts across several dimensions—clarity, constraints, specificity, output definition, and category-specific criteria—depending on how the prompt is classified.
The score runs from 0 (poor prompt quality) to 100 (high prompt quality). The implementation blends (1) heuristic readability metrics, (2) LLM-as-a-Judge for semantic analysis, and (3) category-specific weighting tailored to the type of task.
How the metric works
The metric first classifies the prompt into one of five categories:
- Classification: Prompts for labeling, categorization, or classification tasks.
- Retrieval QA: Prompts for question-answering over retrieved context.
- Transformation: Prompts for rewriting, translating, or converting content.
- Creative generation: Prompts for creative writing, brainstorming, or content generation.
- Reasoning analysis: Prompts for step-by-step reasoning, analysis, or problem-solving.
Each category uses a different combination of criteria against which the prompt is evaluated.
How to interpret the score
- Closer to 100: the prompt is clear, specific, well-structured, and provides appropriate guidance for its category.
- Closer to 0: the prompt is vague, contradictory, poorly defined, or lacks important elements (e.g. unclear constraints or missing output format).
Use the returned explanation for a narrative summary of strengths and weaknesses, and eval_metadata.suggestions for concrete, dimension-specific fixes.
This metric evaluates the prompt only, not the model’s output. For summarization tasks, consider excluding the text to be summarized from the prompt when measuring prompt quality.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: prompt_qual
Default threshold: 80
Inputs (each object in data)
prompt(strrequired): The prompt to evaluate (the instructions or text provided to the LLM).
Evaluation metadata
On successful evaluation, the metric returns eval_metadata describing how the prompt was scored:
prompt_category(str): The assigned task category (retrieval_qa,classification,transformation,creative_generation, orreasoning_analysis). Scoring weights and criteria depend on this value.suggestions(list[str]): Actionable improvement suggestions for weak dimensions. Each item names the dimension, describes the issue, and proposes a specific fix (for example, a rewrite or a sentence to add). Empty when all evaluated dimensions score well.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"prompt": "Classify each customer review into one of three labels: Positive, Negative, or Neutral. "
"Instructions: 1. Give a reason under 15 words. 2. Use British English spelling. "
"3. Provide your answer in JSON format. 4. Do not invent details not present in the review."
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": ["prompt_qual"],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))