Content Consistency (content_consist)
Contents
Metric Description
Content consistency measures the internal coherence of an AI-generated response and its stability across multiple runs with the same prompt. An output is consistent when (1) its sentences align with the overall message and stay on-topic, and (2) the model produces similar content when given the same input multiple times.
The score runs from 0 (poor consistency) to 100 (strong consistency). The implementation blends (1) heuristic measures, and (2) LLM-as-a-judge evaluation of content similarity across multiple regenerated responses (see n_reruns). Use exact_output_match when responses should match verbatim across runs (normalized equality), instead of LLM-judged similarity.
How to interpret the score
- Closer to 100: Output is coherent internally (no off-topic sentences) and stable across runs (similar content when regenerated).
- Closer to 0: Off-topic or inconsistent sentences in the output, and/or the model gives very different answers when rerun with the same prompt.
Content consistency does not measure factual correctness or relevance to the question. Pair this with factfulness, answer relevancy, and faithfulness when those qualities matter.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: content_consist
Default threshold: 80
Inputs (each object in data)
promptorinput(str, at least one required): The prompt and/or user input used to generate the output. If both are provided, a concatenation of both will be used.output(strrequired): The model-generated answer to evaluate.
metric_args
exact_output_match(booloptional): WhenTrue, cross-run similarity scores each pair of responses (outputplus each rerun) using normalized exact equality (text compared lowercased and stripped). WhenFalse, pairwise similarity is computed with an LLM. Default:False. Must be a boolean; otherwise evaluation returns an error.n_reruns(intoptional): How many extra completions to generate from the same combined prompt/input for comparison with the evaluated output. Default:2. Must be an integer greater than0; otherwise evaluation returns an error.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata about internal coherence and cross-run stability:
off_topic_sentences(list[str]): Sentences in the evaluated output flagged as not consistent with the rest of the answer (off-topic or weakly related to the overall message).cross_run_similarity_results(list[dict]): Pairwise comparisons between the original output and each regenerated response (and between reruns). Each item includescompared_entities(which two responses are compared, e.g. labels likeoutputandrerun_1),similarity_score(numeric similarity), andreason(a short explanation of the comparison).
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"prompt": "Summarize the key benefits of renewable energy.",
"output": "Renewable energy offers clean power with minimal environmental impact. Solar and wind are leading sources. They reduce dependence on fossil fuels and create sustainable energy systems.",
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{
"metric": "content_consist",
"metric_args": {
"exact_output_match": False,
"n_reruns": 2,
},
},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))