Factfulness (factfulness)
Contents
Metric Description
Factfulness measures the degree to which the model's output contains verifiable, true claims based on the provided input (context). The evaluation is done solely based on the model's knowledge, without addition web browsing.
The score runs from 0 (many false or unsupported claims) to 100 (all claims are verifiable and true).
How to interpret the score
- Closer to 100: the answer contains mostly verifiable, true claims that are consistently supported by the input across multiple verification runs; high factual density indicates the output is rich with checkable information.
- Closer to 0: the answer contains many false claims, unsupported assertions, or a low proportion of checkable factual content.
Factfulness focuses on whether claims are verifiable against the provided input. Claims that are "non-checkable" (opinions, subjective statements, or general knowledge not requiring verification) are tracked but don't count against the score.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: factfulness
Default threshold: 70
Inputs (each object in data)
input(strrequired): The source context or information that serves as the ground truth for verification.output(strrequired): The model-generated answer containing claims to be verified.
metric_args
n_runs(intoptional): Number of parallel verification runs used for the knowledge-only verdict phase (and for the input-augmented retry when every run returns “idk” for a claim). Default:3. Must be an integer greater than0.idk_penalty_weight(floatoptional): Weight applied to claims that remain undecidable (“idk”) across all runs when computing the true-claims score denominator. Default:0.25. Must be a number in the inclusive range0–1. At0, claims that are all-idk are omitted from that denominator; otherwise each such claim contributesidk_penalty_weightinstead of1.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata with claim-level verification outcomes:
false_details(list[dict]): Claims judged false against the input context. Each item hasclaim(the claim text) andreason(why it is considered false).unknown_details(list[dict]): Claims that could not be verified as true from the context (undecidable or unknown). Each item hasclaimandreason.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
# Records must match the columns your metric expects (e.g. input, output).
data = [
{
"input": "Paris is the capital and largest city of France. It has a population of over 2 million people.",
"output": "Paris is the capital of France with a population exceeding 2 million residents. It is also the largest city in the country."
},
]
payload = {
"threshold": 80, # threshold on the run level
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{
"metric": "factfulness",
"metric_args": {
"n_runs": 3,
"idk_penalty_weight": 0.25,
},
},
],
"threshold": 80, # threshold on the metric level
"model_slug": "o4-mini",
"data": data
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))