Content Consistency (content_consist)

Metric Description
API usage
Evaluation metadata

Metric Description

Content consistency measures the internal coherence of an AI-generated response and its stability across multiple runs with the same prompt. An output is consistent when (1) its sentences align with the overall message and stay on-topic, and (2) the model produces similar content when given the same input multiple times.

The score runs from 0 (poor consistency) to 100 (strong consistency). The implementation blends (1) heuristic measures, and (2) LLM-as-a-judge evaluation of content similarity across multiple regenerated responses (see n_reruns). Use exact_output_match when responses should match verbatim across runs (normalized equality), instead of LLM-judged similarity.

How to interpret the score

Closer to 100: Output is coherent internally (no off-topic sentences) and stable across runs (similar content when regenerated).
Closer to 0: Off-topic or inconsistent sentences in the output, and/or the model gives very different answers when rerun with the same prompt.

Important

Content consistency does not measure factual correctness or relevance to the question. Pair this with factfulness, answer relevancy, and faithfulness when those qualities matter.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: content_consist

Default threshold: 80

Inputs (each object in data)

prompt or input (str, at least one required): The prompt and/or user input used to generate the output. If both are provided, a concatenation of both will be used.
output (str required): The model-generated answer to evaluate.

metric_args

exact_output_match (bool optional): When True, cross-run similarity scores each pair of responses (output plus each rerun) using normalized exact equality (text compared lowercased and stripped). When False, pairwise similarity is computed with an LLM. Default: False. Must be a boolean; otherwise evaluation returns an error.
n_reruns (int optional): How many extra completions to generate from the same combined prompt/input for comparison with the evaluated output. Default: 2. Must be an integer greater than 0; otherwise evaluation returns an error.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata about internal coherence and cross-run stability:

off_topic_sentences (list[str]): Sentences in the evaluated output flagged as not consistent with the rest of the answer (off-topic or weakly related to the overall message).
cross_run_similarity_results (list[dict]): Pairwise comparisons between the original output and each regenerated response (and between reruns). Each item includes compared_entities (which two responses are compared, e.g. labels like output and rerun_1), similarity_score (numeric similarity), and reason (a short explanation of the comparison).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "prompt": "Summarize the key benefits of renewable energy.",
            "output": "Renewable energy offers clean power with minimal environmental impact. Solar and wind are leading sources. They reduce dependence on fossil fuels and create sustainable energy systems.",
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "content_consist",
                        "metric_args": {
                            "exact_output_match": False,
                            "n_reruns": 2,
                        },
                    },
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata