Answer Relevancy (ans_rel)

Metric Description
API usage
Evaluation metadata

Metric Description

Answer relevancy measures whether the model’s answer (output) actually addresses what was asked (input), not whether every fact is grounded in a document (that is closer to faithfulness). An answer can be relevant while adding opinion or extra detail; it is irrelevant when it drifts off-topic, ignores the question, or focuses on something the user did not ask for.

The score runs from 0 (poor match to the question) to 100 (strong match). The implementation blends (1) an LLM-as-a-Judge of how well the answer stays on the topic and (2) heuristic metrics.

How to interpret the score

Closer to 100: the answer tends to focus on the user’s question; little obvious drift or filler that does not serve the ask.
Closer to 0: the answer ignores the question, wanders to unrelated topics, or only partially addresses what was asked.

Important

High relevancy does not guarantee facts are correct or supported by a source. Pair this with other metrics like factfulness and faithfulness when that matters.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ans_rel

Default threshold: 70

Inputs (each object in data)

input (str required): The user’s question or instruction (what the answer should respond to).
output (str required): The model-generated answer to evaluate.

metric_args

max_n_statements (int optional): Maximum number of statements to split the output (answer) into, each statement being compared with the input (user input) to check the relevancy. An optimal value of statements to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with reasons for irrelevance:

irrelevant_statements_reasons (list[str]): Explanations for why parts of the answer are not relevant to the input or question.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    # Records must match the columns your metric expects (e.g. question, answer).
    data = [
        {"input": "What is the capital of France?", "output": "Paris is the capital of France."},
    ]

    payload = {
        "threshold": 70, # threshold on the run level
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {"metric": "ans_rel", "metric_args": {"max_n_statements": 5}},
                ],
                "threshold": 70,  # threshold on the metric level
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata