Format Consistency (format_consist)

Metric Description
API usage
Evaluation metadata

Metric Description

Format consistency measures whether the model’s output stays internally consistent in how it is written across the whole text: same style and structure, similar depth and length, stable tone, uniform locale conventions (dates, numbers, spelling), a steady implied audience, and coherent brand-style signals. It does not compare the output to a prompt or user instructions (that is closer to format alignment); it only looks for drift or mixed patterns within the text itself.

The score runs from 0 to 100. Six categories are assessed over the entire output:

style_format — Writing style, structure, organization, formatting (headings, bullets, capitalization, punctuation).
length_scope — Sentence and paragraph length, detail level, and scope across sections.
tone_voice — Tone, personality, and emotional register (e.g. formal vs casual).
locale_formatting — Dates, numbers, currency, units, spelling variants (e.g. en-US vs en-GB).
audience_context — Implied audience, technical level, and how the text addresses the reader.
brand_guidelines — Brand voice, terminology, and style-guide-like consistency when inferable from the text.

How to interpret the score

Closer to 100: the text tends to read as one coherent piece—few sharp shifts in style, tone, locale, or audience.
Closer to 0: the text mixes incompatible styles, tones, or conventions in ways that read as inconsistent.

Important

High format consistency does not mean the content is correct, safe, or aligned with instructions. It only reflects internal uniformity. Pair with format alignment (instruction following), factfulness, content generation faithfulness, or other metrics when those matter.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: format_consist

Default threshold: 80

Inputs (each object in data)

output (str, required): The full model-generated text to evaluate for internal format and style consistency.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with extra detail about where the text looks internally inconsistent (across the categories described above):

format_inconsistency_details (list[str]): Short explanations for each category that did not look fully consistent end-to-end—drift in style, length, tone, locale, audience, or brand-style signals. If the output is consistent across all categories, this list is empty.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "output": (
                "This report outlines the organization's migration from on-premises "
                "infrastructure to a secure, scalable cloud platform. The initiative aims "
                "to reduce operating costs, improve resilience, and accelerate product "
                "delivery. A phased approach minimizes risk and preserves continuity for "
                "critical services. Workloads are prioritized by business value and "
                "technical readiness."
            ),
        },
    ]

    payload = {
        "threshold": 80,  # threshold on the run level
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": ["format_consist"],  # metric shortname
                "threshold": 80,  # threshold on the metric level
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata