Context Relevancy (ctx_rel)
Contents
Metric Description
Context relevancy measures whether the retrieved context is relevant to the input (user question or instruction). It evaluates whether the context contains information that actually addresses what the user asked, and whether it avoids irrelevant content that might mislead or distract from the answer. In other words: is the retrieved context on-topic for the input?
The score runs from 0 (poor relevancy) to 100 (strong relevancy). The implementation blends (1) an LLM-as-a-Judge of how relevant each extracted statement from the context is to the input, and (2) a heuristic semantic similarity score between the context and the input.
How to interpret the score
- Closer to 100: the context tends to focus on the user’s input; little obvious filler or irrelevant information.
- Closer to 0: the context contains significant off-topic material, ignores the input, or only partially relates to what was asked.
High context relevancy does not guarantee the context is complete or sufficient for a good answer. Pair this with metrics like context recall and context waste when evaluating retrieval quality end-to-end.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: ctx_rel
Default threshold: 80
Inputs (each object in data)
input(strrequired): The user’s question or instruction (what the context should be relevant to).context(strorlist[str]required): The retrieved context chunks (documents or passages) to evaluate.
metric_args
max_n_statements(intoptional): Maximum number of statements to generate from the context. Each statement is compared with the input to check relevancy. An optimal value of statements to be generated is automatically calculated, this parameter will only define the cap. Default = 50.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata describing context statements that scored poorly against the input:
irrelevant_statements(list[dict]): List of statements, derived from context, that were found not relevant to input. Each object containsstatement(str) (the extracted context statement) andreason(str) (why it was judged weakly relevant to the input).
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
context = [
"Paris is the capital of France and its largest city.",
"The Eiffel Tower was completed in 1889 for the World's Fair.",
"France is known for its cuisine, including baguettes and cheese.",
]
data = [
{"input": "What is the capital of France?", "context": context},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "ctx_rel", "metric_args": {"max_n_statements": 5}},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))