Context Faithfulness (ctx_faith)
Contents
Metric Description
Context faithfulness measures whether the model’s output is faithful to the information provided in the context. It evaluates whether the output is consistent with the retrieved context, avoids hallucinations or distortions, and preserves the most significant concepts from the source material. This metric is best suited for RAG (Retrieval-Augmented Generation) tasks, where the output should be grounded in the retrieved documents without inventing or contradicting information.
The score runs from 0 (low faithfulness) to 100 (high faithfulness). The implementation blends (1) an LLM-as-a-Judge that extracts claims from the output and verifies them against the context and (2) heuristic metrics such as entity alignment and semantic similarity.
How to interpret the score
- Closer to 100: the output is well-aligned with the context; claims are supported, entities match what is present in the context, and the content is semantically consistent with the source.
- Closer to 0: the output contains unsupported or contradicted claims, invents entities not present in the context, or diverges significantly from the source meaning.
Context faithfulness focuses on whether the output stays true to the retrieved context, not whether facts are universally correct. Pair this with factfulness and context sufficiency for broader evaluation.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: ctx_faith
Default threshold: 80
Inputs (each object in data)
context(strorlistrequired): The retrieved context or source documents (e.g., chunks from a knowledge base).output(strrequired): The model-generated output to evaluate (e.g., the answer generated from the context).
metric_args
max_n_claims(intoptional): Maximum number of claims to extract from the output for verification. Must be an integer greater than 0. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata with structured signals about grounding and entity alignment:
unsupported_claims_reasons(list[str]): Explanations for why specific output claims were not grounded in context.misaligned_entities(list[str]): Named entities detected in the output that do not appear in the provided context.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"context": "Paris is the capital of France. It has a population of about 2.1 million within its administrative limits.",
"output": "Paris is the capital of France with roughly 2.1 million inhabitants in the city proper.",
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "ctx_faith", "metric_args": {"max_n_claims": 5}},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))