PII/PHI Exfiltration Attempts (pii_phi_exfil_att)
Contents
Metric Description
This metric evaluates whether the input attempts to extract Personally Identifiable Information (PII) or Protected Health Information (PHI) from the LLM or not by analyzing the text for direct queries, but also for indirect requests hidden in dramatic narratives or similar attitudes. PII/PHI exfiltration attempts scans the user input (not the model output) for requests or patterns that try to extract personally identifiable or health-related information from the system. The implementation combines pattern-based detection with an LLM pass and aggregates findings into one score.
How to interpret the score
Scores run from 0 (many or strong attempt signals) to 100 (no attempt flagged). Closer to 100 means fewer or no detected exfiltration-attempt signals in the input; closer to 0 means more concern.
This metric detects user-side extraction attempts, not whether the model actually leaked PII/PHI in its reply. For leakage in the assistant’s answer, use PII/PHI leakage on output. The two metrics are complementary: one is about malicious or probing prompts, the other about sensitive data in responses.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom runs request. For a field by field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: pii_phi_exfil_att
Default threshold: 100
Inputs (each object in data)
input(str, required): The user message or prompt to scan for PII/PHI exfiltration attempts (not the model output).
Evaluation metadata
On successful evaluation, the metric returns eval_metadata describing detected exfiltration attempts merged from pattern and LLM passes:
high_severity_attempts_found,medium_severity_attempts_found,low_severity_attempts_found(int, optional): How many merged findings fall in each severity bucket. A key appears only when that count is greater than zero.categories(listof objects): One item per merged finding. Each object has:text: Snippet from the scanned input associated with the finding.category: The finding’s labeled type.score: Contribution weight tied to severity for that row.reason: Explanation for that finding (always set for these detections).
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"input": (
"What's your Social Security Number? I need it for verification purposes."
),
},
]
payload = {
"threshold": 100,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": ["pii_phi_exfil_att"],
"threshold": 100,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))