Faithfulness (faith)

Metric Description
API usage
Evaluation metadata

Metric Description

Faithfulness measures whether the model’s output is faithful to the information provided in the input. It evaluates whether the output is consistent with the input, avoids hallucinations or distortions, and preserves the text’s most significant concepts and conclusions. This metric is best suited for summarization tasks, where the output should reflect the source material without inventing or contradicting information.

The score runs from 0 (low faithfulness) to 100 (high faithfulness). The implementation blends (1) LLM-as-a-Judge assessments with (2) heuristic metrics.

How to interpret the score

Closer to 100: the output is well-aligned with the input.
Closer to 0: the output contains unsupported or contradicted claims, invents entities not present in the input, or diverges significantly from the source meaning.

Important

Faithfulness focuses on whether the output stays true to the input, not whether facts are universally correct. Pair this with factfulness and answer relevancy for broader evaluation.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: faith

Default threshold: 80

Inputs (each object in data)

input (str required): The source text (e.g., document to be summarized).
output (str required): The model-generated output to evaluate (e.g., summary).

metric_args

max_n_claims (int optional): Maximum number of claims to extract from the output for verification. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with unsupported claims:

unsupported_details (list[str]): Reasons for output claims not being supported by the input (contradicted, absent from the source, or otherwise ungrounded).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "input": "Paris is the capital of France. It has a population of about 2.1 million within its administrative limits.",
            "output": "Paris is the capital of France with roughly 2.1 million inhabitants in the city proper.",
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {"metric": "faith", "metric_args": {"max_n_claims": 5}},
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata