Prompt Quality (prompt_qual)

Metric Description
API usage
Evaluation metadata

Metric Description

Prompt quality measures whether the prompt provided to the model is clear, specific, and actionable. A good prompt guides the model to generate relevant, well-structured responses and helps reduce hallucinations or distortions. The metric evaluates prompts across several dimensions—clarity, constraints, specificity, and category-specific criteria—depending on how the prompt is classified.

The score runs from 0 (poor prompt quality) to 100 (high prompt quality). The implementation blends (1) heuristic metrics for readability and clarity, (2) LLM-as-a-Judge for semantic analysis of constraints, examples, and structure, and (3) category-specific scoring tailored to the type of task.

How the metric works

The metric first classifies the prompt into one of five categories:

Classification: Prompts for labeling, categorization, or classification tasks.
Retrieval QA: Prompts for question-answering over retrieved context.
Transformation: Prompts for rewriting, translating, or converting content.
Creative generation: Prompts for creative writing, brainstorming, or content generation.
Reasoning analysis: Prompts for step-by-step reasoning, analysis, or problem-solving.

Each category uses a different combination of criteria against which the prompt is evaluated.

How to interpret the score

Closer to 100: the prompt is clear, specific, well-structured, and provides appropriate guidance for its category.
Closer to 0: the prompt is vague, contradictory, poorly defined, or lacks important elements (e.g. unclear constraints).

Important

This metric evaluates the prompt only, not the model’s output. For summarization tasks, consider excluding the text to be summarized from the prompt when measuring prompt quality.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: prompt_qual

Default threshold: 80

Inputs (each object in data)

prompt (str required): The prompt to evaluate (the instructions or text provided to the LLM).

Evaluation metadata

On successful evaluation, the metric returns eval_metadata describing how the prompt was scored:

prompt_category (str): The task category assigned to the prompt (for example classification, retrieval QA, transformation, creative generation, or reasoning analysis). Scoring weights and criteria depend on this category.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "prompt": "Classify each customer review into one of three labels: Positive, Negative, or Neutral. "
            "Instructions: 1. Give a reason under 15 words. 2. Use British English spelling. "
            "3. Provide your answer in JSON format. 4. Do not invent details not present in the review."
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": ["prompt_qual"],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How the metric works​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How the metric works

How to interpret the score

API usage

Evaluation metadata