Entity Faithfulness (entity_faith)

Metric Description
API usage
Evaluation metadata

Metric Description

Entity faithfulness evaluates whether the entities in the model's output are present in the input text. It detects when the model hallucinates entities, introduces facts not supported by the input, or distorts entity representations. The metric is designed for use cases where the output is expected to extract or reference only entities that appear (explicitly or in equivalent form) in the source input.

The score runs from 0 (many entities not grounded in the input) to 100 (all extracted entities are present in the input). The implementation combines (1) LLM-as-a-Judge assessments, (2) heuristic measures, and (3) word-boundary substring checks followed—depending on match_mode—by strict regex-only rejection (exact) or LLM-based grounding (strict / flexible, default).

How to interpret the score

Closer to 100: All or nearly all entities in the output can be traced back to the input; little or no hallucination or unsupported references.
Closer to 0: Many entities in the output are not found in the input; the model is introducing or distorting entity information.

Important

Entity faithfulness checks that entities mentioned in the output exist in the input, but it does not assess correctness of the answer or semantic faithfulness of claims. Pair this with faithfulness, factfulness, or answer correctness when evaluating full response quality.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: entity_faith

Default threshold: 80

Inputs (each object in data)

input (str required): The source text from which entities should be grounded (e.g., document, note, or context).
output (str or dict required): The model-generated output that extracts or references entities to evaluate. If a dict is provided, it is serialized to JSON.
prompt (str optional): Extra instructions or formatting context. When provided, the metric may extract additional entities and additional instructions from it (e.g., abbreviation definitions) to help grounding.

metric_args

domain (str optional): Domain hint for entity extraction (passed through to extraction prompts). Default: omitted (None). If set, it must be a string (non-string values are rejected).
match_mode (str optional): How strictly leftover entities are matched after an initial word-boundary, case-insensitive exact pass against input (plus any prompt-derived extra entities). Must be exactly one of exact, strict, or flexible. Default: flexible.
- exact: Any entity not matched exactly against the source corpus is counted missing (regex-only; no LLM matching step).
- strict: LLM matching allows well-established proper-noun abbreviations (e.g., USA, NASA) and abbreviations explicitly defined in the prompt; casual shorthand is rejected.
- flexible: LLM matching that also accepts common professional shorthand and widely understood equivalences (e.g., “2 wks” vs “2 weeks”, “temp” vs “temperature”).

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with grounding failures:

entities_not_found_in_source (list[dict]): Entities appearing in the output that could not be grounded in the input. Each item has text (the entity surface form from the output) and reason (why it was not found or accepted in the source).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    # Records must match the columns your metric expects (input, output).
    data = [
        {
            "input": "Nokia's regional team met with Vodafone RO to discuss the 5G rollout in Iași.",
            "output": "Entities: Nokia; Vodafone Romania; fifth-generation network deployment; Iași.",
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "entity_faith",
                        "metric_args": {
                            "domain": "telecom",
                            "match_mode": "flexible",
                        },
                    },
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata