Create custom run

Endpoint: POST /runs/custom

Description Creates a run when you pass an evaluations array in the JSON body: each element defines metrics and rows to score.

Sharing rules

The new run is always owned by you (your account is the individual owner). Sharing isn't accepted directly on this endpoint — visibility is derived from the data collection (if any). Runs created here are never org-only (org-only resources have no individual owner).

Without data_collection_id → the run is created private (org_id = null).
With data_collection_id:
- The linked collection is org-shared → the run is created shared with the collection's org_id. You must still be a member of that organization.
- The linked collection is private → the run is created private.

Use PUT /runs/{run_id} to share, unshare, attach, detach, or rename a custom run after it has been created.

Parameters

Body — application/json:

{
  "threshold": "integer | null",
  "model_slug": "string | null",
  "is_blocking": false,
  "data_collection_id": "integer | null",
  "alias": "string | null",
  "evaluations": [
    {
      "metrics": [
        "string  (metric shortname)",
        {
          "metric": "string  (metric shortname)",
          "metric_args": "object | null",
          "threshold": "integer | null",
          "model_slug": "string | null"
        }
      ],
      "threshold": "integer | null",
      "model_slug": "string | null",
      "data": [
        {
          "prompt": "string | null",
          "input": "string | null",
          "context": "string | null",
          "output": "string | null",
          "golden_answer": "string | null"
        }
      ]
    }
  ]
}

Each item in metrics is either a metric shortname (string) or an object with metric plus optional metric_args, threshold, and model_slug. Use the object form when the metric accepts arguments (see the metric's doc page) or when you want to override threshold/model for that metric only. Unknown argument names are rejected and missing required args (with no metric-level default) are rejected.

Error responses

401, 402 — authentication or insufficient balance.
422 — request validation failed.
400 — empty evaluations; unknown/inactive metric shortnames; duplicate alias.
404 — model_slug is not a documented model slug; data_collection_id not found; no metrics resolved.
500 — failure while creating the run.

Responses

201 — same run object shape as Get run.

Example response (201)

{
  "id": 992,
  "user": "analyst@acme.com",
  "author_email": "analyst@acme.com",
  "run_type": "Custom",
  "run_source": "API Call",
  "dataset": null,
  "data_collection": "Customer Support",
  "org_id": null,
  "number_of_metrics": 1,
  "result": 100,
  "threshold": 70,
  "model_slug": "gpt-4o",
  "alias": "smoke-test",
  "aggregate_results": {
    "ans_corr": 100
  },
  "total_cost": "0.000400000000000",
  "started_at": "2026-04-01T09:15:01Z",
  "finished_at": "2026-04-01T09:15:03Z",
  "is_gte_threshold": true,
  "evaluations": []
}

Metric shortnames

Use these in metrics (or in object form metric).

Metric shortnames (by category)

Content generation

General

RAG

Structural

Safety

Security

curl

curl -X POST "https://api.aegisevals.ai/api/v1/runs/custom" \
  -H "Authorization: Bearer sk_00000000000000000000000000000000" \
  -H "Content-Type: application/json" \
  -d '{"threshold":70,"model_slug":"gpt-4o","is_blocking":false,"alias":"smoke-test","evaluations":[{"metrics":["ans_corr"],"data":[{"prompt":"What is 2+2?","output":"4","golden_answer":"4"}]}]}'

Examples

Several rows, one metric

{
  "threshold": 75,
  "model_slug": "gpt-4o",
  "is_blocking": false,
  "alias": "support-batch-2025-03-27",
  "evaluations": [
    {
      "metrics": ["ans_corr"],
      "threshold": 75,
      "model_slug": "gpt-4o",
      "data": [
        {
          "prompt": "What is your refund policy for annual plans?",
          "output": "We refund unused months if you cancel within 14 days of renewal.",
          "golden_answer": "Annual plans are refundable for the unused portion within 14 days of the renewal charge."
        },
        {
          "prompt": "How do I export my data?",
          "output": "Open Settings → Data → Export; you will get a CSV within a few minutes.",
          "golden_answer": "Use Settings → Data → Export to download a CSV of your workspace."
        }
      ]
    }
  ]
}

RAG: context + answer metrics

Pass retrieved context with the model output. Here ctx_faith and ctx_rel run on the same rows.

{
  "threshold": 70,
  "model_slug": "gpt-4o-mini",
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": ["ctx_faith", "ctx_rel"],
      "threshold": 70,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "input": "When did the Acme Corp fiscal year end in 2024?",
          "context": "Acme Corp FY2024 ended on September 30, 2024. Revenue was $120M.",
          "output": "Acme’s 2024 fiscal year ended on September 30, 2024.",
          "golden_answer": null
        }
      ]
    }
  ]
}

Two evaluation blocks

Run one block with stricter threshold / different model than another (for example: cheap model for screening, stronger model for a smaller slice).

{
  "threshold": 80,
  "model_slug": "gpt-4o-mini",
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": ["ans_rel"],
      "threshold": 60,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "prompt": "Summarize our SLA in one sentence.",
          "output": "We target 99.9% monthly uptime excluding scheduled maintenance."
        }
      ]
    },
    {
      "metrics": ["faith"],
      "threshold": 85,
      "model_slug": "gpt-4o",
      "data": [
        {
          "prompt": "What guarantees does the SLA provide?",
          "context": "SLA: 99.9% uptime; credits apply if below target.",
          "output": "The SLA promises 99.9% uptime and service credits if we miss it."
        }
      ]
    }
  ]
}

Mixed metrics list + metric_args

Use strings when no options are needed, and objects when a metric accepts metric_args (see that metric’s doc page).

{
  "threshold": 100,
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": [
        "exact_match",
        {
          "metric": "json_equal",
          "metric_args": {
            "ignore_extra_keys": true,
            "ignore_order": false
          }
        }
      ],
      "threshold": 100,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "output": "{\"status\":\"ok\",\"items\":[1,2]}",
          "golden_answer": "{\"items\":[1,2],\"status\":\"ok\"}"
        }
      ]
    }
  ]
}

Python

import json
import os
import requests
from dotenv import load_dotenv

load_dotenv(override=True)
API_KEY = os.environ["AEGIS_API_KEY"]
BASE = os.environ["AEGIS_API_BASE_URL"].rstrip("/")

payload = {
    "threshold": 75,
    "model_slug": "gpt-4o",
    "is_blocking": False,
    "alias": "python-example",
    "evaluations": [
        {
            "metrics": ["ans_corr"],
            "threshold": 75,
            "model_slug": "gpt-4o",
            "data": [
                {
                    "prompt": "Capital of France?",
                    "output": "Paris is the capital of France.",
                    "golden_answer": "Paris.",
                }
            ],
        }
    ],
}

r = requests.post(
    f"{BASE}/runs/custom",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    data=json.dumps(payload),
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Responses​

Examples​

Responses

Examples