Factfulness (factfulness)

Metric Description
API usage
Evaluation metadata

Metric Description

Factfulness measures the degree to which the model's output contains verifiable, true claims based on the provided input (context). The evaluation is done solely based on the model's knowledge, without addition web browsing.

The score runs from 0 (many false or unsupported claims) to 100 (all claims are verifiable and true).

How to interpret the score

Closer to 100: the answer contains mostly verifiable, true claims that are consistently supported by the input across multiple verification runs; high factual density indicates the output is rich with checkable information.
Closer to 0: the answer contains many false claims, unsupported assertions, or a low proportion of checkable factual content.

Important

Factfulness focuses on whether claims are verifiable against the provided input. Claims that are "non-checkable" (opinions, subjective statements, or general knowledge not requiring verification) are tracked but don't count against the score.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: factfulness

Default threshold: 70

Inputs (each object in data)

input (str required): The source context or information that serves as the ground truth for verification.
output (str required): The model-generated answer containing claims to be verified.

metric_args

n_runs (int optional): Number of parallel verification runs used for the knowledge-only verdict phase (and for the input-augmented retry when every run returns “idk” for a claim). Default: 3. Must be an integer greater than 0.
idk_penalty_weight (float optional): Weight applied to claims that remain undecidable (“idk”) across all runs when computing the true-claims score denominator. Default: 0.25. Must be a number in the inclusive range 0–1. At 0, claims that are all-idk are omitted from that denominator; otherwise each such claim contributes idk_penalty_weight instead of 1.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with claim-level verification outcomes:

false_details (list[dict]): Claims judged false against the input context. Each item has claim (the claim text) and reason (why it is considered false).
unknown_details (list[dict]): Claims that could not be verified as true from the context (undecidable or unknown). Each item has claim and reason.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    # Records must match the columns your metric expects (e.g. input, output).
    data = [
        {
            "input": "Paris is the capital and largest city of France. It has a population of over 2 million people.",
            "output": "Paris is the capital of France with a population exceeding 2 million residents. It is also the largest city in the country."
        },
    ]

    payload = {
        "threshold": 80, # threshold on the run level
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "factfulness",
                        "metric_args": {
                            "n_runs": 3,
                            "idk_penalty_weight": 0.25,
                        },
                    },
                ],
                "threshold": 80, # threshold on the metric level
                "model_slug": "o4-mini",
                "data": data
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata