Content Generation Faithfulness (content_gen_faith)

Metric Description
API usage
Evaluation metadata

Metric Description

Content generation faithfulness measures whether the model’s output stays faithful to the input for tasks where the answer is an extension of what the user gave (for example social posts, tweets, or articles derived from a brief). It checks alignment of entities, whether sentences hang together with the input and with each other, and whether the output is central to the input’s entities—not whether every fact is cited from an external document (that is closer to context faithfulness).

The score runs from 0 (weak alignment with the input) to 100 (strong alignment). The implementation blends LLM-as-a-Judge steps (content assessment) with heuristic signals.

How to interpret the score

Closer to 100: the output tends to preserve input entities, stay on-topic in meaning, and keep content aligned with the input.
Closer to 0: missing or distorted input entities, incoherent or off-input sentences, or content that drifts from what the input is about.

Important

Faithfulness here is focusing on the what, not how. Use format alignment for a more robust validation of formatting rules.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: content_gen_faith

Default threshold: 80

Inputs (each object in data)

input (str required): The source brief, instruction, or material the generated content should stay faithful to.
output (str required): The model-generated content to evaluate.

metric_args

max_n_claims (int optional): Maximum number of claims to extract from the output before centrality judging. If omitted, the implementation chooses an appropriate count. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with details about content that appears off-topic or weakly aligned with the input:

off_topic_reasons (list[str]): Reasons generated for output claims that are not aligned with the input.
off_topic_sentences (list[str]): Output sentences that are different from the rest of the output.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "input": "Write a social media post about Python programming.",
            "output": "Python is a versatile programming language used for web development, data science, and automation.",
        },
    ]

    payload = {
        "threshold": 80,  # threshold on the run level
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {"metric": "content_gen_faith", "metric_args": {"max_n_claims": 5}},
                ],
                "threshold": 80,  # threshold on the metric level
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata