Answer Relevancy (ans_rel)
Contents
Metric Description
Answer relevancy measures whether the model’s answer (output) actually addresses what was asked (input), not whether every fact is grounded in a document (that is closer to faithfulness). An answer can be relevant while adding opinion or extra detail; it is irrelevant when it drifts off-topic, ignores the question, or focuses on something the user did not ask for.
The score runs from 0 (poor match to the question) to 100 (strong match). The implementation blends (1) an LLM-as-a-Judge of how well the answer stays on the topic and (2) heuristic metrics.
How to interpret the score
- Closer to 100: the answer tends to focus on the user’s question; little obvious drift or filler that does not serve the ask.
- Closer to 0: the answer ignores the question, wanders to unrelated topics, or only partially addresses what was asked.
High relevancy does not guarantee facts are correct or supported by a source. Pair this with other metrics like factfulness and faithfulness when that matters.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: ans_rel
Default threshold: 70
Inputs (each object in data)
input(strrequired): The user’s question or instruction (what the answer should respond to).output(strrequired): The model-generated answer to evaluate.
metric_args
max_n_statements(intoptional): Maximum number of statements to split the output (answer) into, each statement being compared with the input (user input) to check the relevancy. An optimal value of statements to be generated is automatically calculated, this parameter will only define the cap. Default = 50.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata with reasons for irrelevance:
irrelevant_statements_reasons(list[str]): Explanations for why parts of the answer are not relevant to the input or question.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
# Records must match the columns your metric expects (e.g. question, answer).
data = [
{"input": "What is the capital of France?", "output": "Paris is the capital of France."},
]
payload = {
"threshold": 70, # threshold on the run level
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "ans_rel", "metric_args": {"max_n_statements": 5}},
],
"threshold": 70, # threshold on the metric level
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))