Summarization (summ)

Metric Description
API usage
Evaluation metadata

Metric Description

Summarization evaluates whether an LLM’s output (summary) effectively condenses the input (source text) into a concise, coherent, and factually faithful summary. The metric checks that the model captures central points, removes superfluous information, and preserves the text’s main concepts and conclusions without distortion.

The score runs from 0 (poor summary) to 100 (high-quality summary). The implementation blends (1) LLM-as-a-Judge components and (2) heuristic metrics.

How to interpret the score

Closer to 100: The summary captures key information, aligns with the source, avoids redundancy, and stays concise.
Closer to 0: The summary omits important points, contradicts or adds unsupported facts, is redundant, or is disproportionately long.

Important

This metric focuses on summarization quality: coverage, alignment, compression, and length. For entities are crucial for your use case, use entity faithfulness for a more robust check.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: summ

Default threshold: 80

Inputs (each object in data)

input (str required): The original source text to be summarized.
output (str required): The model-generated summary to evaluate.

metric_args

n_key_points (int optional): Number of key points to extract from the input. If not provided, an optimal value is derived from the input length.
max_n_claims (int optional): Maximum number of claims to generate from the output for alignment checks. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with coverage and alignment diagnostics:

missing_details (list[str]): Reasons corresponding to source details that are not fully represented in the summary.
misaligned_details (list[str]): Reasons for summary-side content that does not align well with the source.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "input": "The quick brown fox jumps over the lazy dog. This is a classic pangram that contains every letter of the English alphabet at least once. It is often used for testing purposes in typography, printing, and typewriters.",
            "output": "The quick brown fox pangram contains all alphabet letters and is used for testing.",
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {"metric": "summ", "metric_args": {"n_key_points": 5, "max_n_claims": 5}},
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata