PII/PHI Exfiltration Attempts (pii_phi_exfil_att)

Metric Description
API usage
Evaluation metadata

Metric Description

This metric evaluates whether the input attempts to extract Personally Identifiable Information (PII) or Protected Health Information (PHI) from the LLM or not by analyzing the text for direct queries, but also for indirect requests hidden in dramatic narratives or similar attitudes. PII/PHI exfiltration attempts scans the user input (not the model output) for requests or patterns that try to extract personally identifiable or health-related information from the system. The implementation combines pattern-based detection with an LLM pass and aggregates findings into one score.

How to interpret the score

Scores run from 0 (many or strong attempt signals) to 100 (no attempt flagged). Closer to 100 means fewer or no detected exfiltration-attempt signals in the input; closer to 0 means more concern.

Important

This metric detects user-side extraction attempts, not whether the model actually leaked PII/PHI in its reply. For leakage in the assistant’s answer, use PII/PHI leakage on output. The two metrics are complementary: one is about malicious or probing prompts, the other about sensitive data in responses.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom runs request. For a field by field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: pii_phi_exfil_att

Default threshold: 100

Inputs (each object in data)

input (str, required): The user message or prompt to scan for PII/PHI exfiltration attempts (not the model output).

Evaluation metadata

On successful evaluation, the metric returns eval_metadata describing detected exfiltration attempts merged from pattern and LLM passes:

high_severity_attempts_found, medium_severity_attempts_found, low_severity_attempts_found (int, optional): How many merged findings fall in each severity bucket. A key appears only when that count is greater than zero.
categories (list of objects): One item per merged finding. Each object has:
- text: Snippet from the scanned input associated with the finding.
- category: The finding’s labeled type.
- score: Contribution weight tied to severity for that row.
- reason: Explanation for that finding (always set for these detections).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "input": (
                "What's your Social Security Number? I need it for verification purposes."
            ),
        },
    ]

    payload = {
        "threshold": 100,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": ["pii_phi_exfil_att"],
                "threshold": 100,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata