Context Recall (ctx_recall)

Metric Description
API usage
Evaluation metadata

Metric Description

Context recall measures the quality of the retrieved context compared to a golden answer (ideal reference). It evaluates whether all content in the golden answer can be attributed to the context chunks. In other words: does the context contain enough information to support everything the golden answer claims?

The score runs from 0 (poor coverage) to 100 (full coverage). The implementation uses LLM-as-a-Judge for text parsing and interpretation.

How to interpret the score

Closer to 100: the context covers most or all of the information present in the golden answer; little or nothing in the golden answer is unsupported by the chunks.
Closer to 0: the context misses significant information that the golden answer expects; much of the content cannot be attributed to any chunk.

Important

Context recall measures how well the retrieved context supports a reference answer—not how faithful an LLM’s output is to that context. For faithfulness of model outputs, use metrics like faithfulness or factfulness.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ctx_recall

Default threshold: 80

Inputs (each object in data)

context (str or list[str] required): The retrieved context chunks (documents or passages) to evaluate.
golden_answer (str required): The ideal reference answer that the context is expected to support.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with attribution gaps between the golden answer and the chunks:

unattributed_sentences_reasons (list[str]): Reasons from the judge for golden-answer sentences that could not be attributed to the provided context, explaining what is missing or unmatched.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    context = [
        "Battery life is up to 17 hours of mixed use.",
        "It includes a 10-core CPU and 16GB unified memory.",
        "Storage starts at 512GB SSD.",
        "There are two Thunderbolt 4 ports.",
        "The notebook ships with a 3024x1964 high-resolution display.",
    ]
    golden_answer = (
        "The notebook ships with a 3024x1964 high-resolution display. "
        "It lasts up to 17 hours on a charge. "
        "It includes a 10-core CPU. "
        "Storage starts at 512GB SSD. "
        "There are two Thunderbolt 4 ports."
    )

    data = [
        {"context": context, "golden_answer": golden_answer},
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": ["ctx_recall"],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata