Context Faithfulness (ctx_faith)

Metric Description
API usage
Evaluation metadata

Metric Description

Context faithfulness measures whether the model’s output is faithful to the information provided in the context. It evaluates whether the output is consistent with the retrieved context, avoids hallucinations or distortions, and preserves the most significant concepts from the source material. This metric is best suited for RAG (Retrieval-Augmented Generation) tasks, where the output should be grounded in the retrieved documents without inventing or contradicting information.

The score runs from 0 (low faithfulness) to 100 (high faithfulness). The implementation blends (1) an LLM-as-a-Judge that extracts claims from the output and verifies them against the context and (2) heuristic metrics such as entity alignment and semantic similarity.

How to interpret the score

Closer to 100: the output is well-aligned with the context; claims are supported, entities match what is present in the context, and the content is semantically consistent with the source.
Closer to 0: the output contains unsupported or contradicted claims, invents entities not present in the context, or diverges significantly from the source meaning.

Important

Context faithfulness focuses on whether the output stays true to the retrieved context, not whether facts are universally correct. Pair this with factfulness and context sufficiency for broader evaluation.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ctx_faith

Default threshold: 80

Inputs (each object in data)

context (str or list required): The retrieved context or source documents (e.g., chunks from a knowledge base).
output (str required): The model-generated output to evaluate (e.g., the answer generated from the context).

metric_args

max_n_claims (int optional): Maximum number of claims to extract from the output for verification. Must be an integer greater than 0. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with structured signals about grounding and entity alignment:

unsupported_claims_reasons (list[str]): Explanations for why specific output claims were not grounded in context.
misaligned_entities (list[str]): Named entities detected in the output that do not appear in the provided context.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "context": "Paris is the capital of France. It has a population of about 2.1 million within its administrative limits.",
            "output": "Paris is the capital of France with roughly 2.1 million inhabitants in the city proper.",
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {"metric": "ctx_faith", "metric_args": {"max_n_claims": 5}},
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata