Model-as-Judge: Layer 2¶
The Judge catches what guardrails miss. It evaluates AI outputs for subtle policy violations, hallucinations, inappropriate content, and novel attack patterns that can't be detected with pattern matching.
Key principle: The Judge detects. It doesn't decide. The Judge outputs PASS, REVIEW, or ESCALATE. Decisions about what to do with REVIEW and ESCALATE verdicts flow to humans (Layer 3) or automated policy.
Verdict Types¶
| Verdict | Meaning | Action |
|---|---|---|
| PASS | Output is safe and policy-compliant | Deliver to user |
| REVIEW | Uncertain or borderline, needs human review | Queue for review (or log, based on config) |
| ESCALATE | Clearly violates policy or is dangerous | Block immediately, alert security team |
Rule-Based Judge¶
Start here. No API key needed, no latency, no cost:
from airs.runtime.judge import RuleBasedJudge
judge = RuleBasedJudge(max_output_length=10000)
result = await judge.evaluate(
input_text="Tell me a story",
output_text="Here is a normal response.",
)
# result.verdict == JudgeVerdict.PASS
The rule-based judge catches:
- Excessive output length: outputs exceeding the configured limit get
REVIEW - Refusal-then-comply: when the model says "I can't do that" but then proceeds to do it anyway (a common jailbreak indicator)
This is useful for:
- Getting started without any external dependencies
- Low-risk deployments where basic checks suffice
- Testing and development environments
Model-as-Judge¶
For production use. Works with any OpenAI-compatible API:
from airs.runtime.judge import LLMJudge
judge = LLMJudge(
model="gpt-4o-mini",
api_key="sk-...", # or set OPENAI_API_KEY env var
)
result = await judge.evaluate(
input_text="What was our Q3 revenue?",
output_text="Q3 revenue was $42.7M, up 15% year-over-year.",
policy="Only provide information verifiable against provided documents.",
)
print(result.verdict) # JudgeVerdict.REVIEW
print(result.reason) # "Financial claims without source documents"
print(result.confidence) # 0.85
Compatible Providers¶
The LLMJudge works with any API that implements the OpenAI chat completions format:
# OpenAI
judge = LLMJudge(model="gpt-4o-mini", api_key="sk-...")
# Ollama (local)
judge = LLMJudge(
model="llama3",
base_url="http://localhost:11434/v1",
api_key="ollama", # required by client but not validated
)
# Azure OpenAI
judge = LLMJudge(
model="gpt-4o-mini",
base_url="https://your-resource.openai.azure.com/openai/deployments/gpt-4o-mini/",
api_key="your-azure-key",
)
# Any OpenAI-compatible API (vLLM, Together, Groq, etc.)
judge = LLMJudge(
model="meta-llama/Llama-3-70b",
base_url="https://api.together.xyz/v1",
api_key="your-key",
)
Use a different model
The Judge should be a different model from the one being evaluated. Using the same model to judge itself reduces the chance of catching its own errors. Ideally, use a different provider entirely.
Custom Policy¶
The default policy checks for hallucination, PII leakage, content policy violations, prompt injection in output, and off-topic responses. Override with your own:
result = await judge.evaluate(
input_text=user_input,
output_text=ai_output,
policy="""
Evaluate against these rules:
1. Response must only reference data from the provided knowledge base
2. No financial advice or predictions
3. Must include appropriate disclaimers for medical information
4. Must not reveal internal system details or pricing logic
""",
)
Custom Prompt Template¶
For full control over the judge prompt:
judge = LLMJudge(
model="gpt-4o-mini",
prompt_template="""You are an expert evaluator for a {domain} application.
## Context
{policy}
## User Query
{input_text}
## AI Response
{output_text}
## Evaluate
Return JSON: {{"verdict": "pass"|"review"|"escalate", "reason": "...", "confidence": 0.0-1.0}}
""",
)
The template receives {policy}, {input_text}, and {output_text} as format variables.
Custom Judge¶
Implement the Judge base class for any evaluation logic:
from airs.runtime.judge import Judge, JudgeEvaluation, JudgeVerdict
class RAGGroundingJudge(Judge):
"""Check if output is grounded in retrieved documents."""
def __init__(self, similarity_threshold: float = 0.7):
self.threshold = similarity_threshold
async def evaluate(self, input_text, output_text, policy="", **kwargs):
# Get retrieved docs from kwargs
sources = kwargs.get("sources", [])
if not sources:
return JudgeEvaluation(
verdict=JudgeVerdict.REVIEW,
reason="No source documents provided for grounding check",
confidence=0.5,
)
# Your grounding logic here
grounding_score = compute_grounding(output_text, sources)
if grounding_score < self.threshold:
return JudgeEvaluation(
verdict=JudgeVerdict.ESCALATE,
reason=f"Output not grounded in sources (score: {grounding_score:.2f})",
confidence=0.9,
)
return JudgeEvaluation(
verdict=JudgeVerdict.PASS,
reason=f"Output grounded (score: {grounding_score:.2f})",
confidence=grounding_score,
)
Judge in the Pipeline¶
When used within the SecurityPipeline, the judge's behavior is controlled by the PACE state:
| PACE State | Judge Behavior |
|---|---|
| Primary | 5% sampling (evaluate 1 in 20 requests) |
| Alternate | 100% evaluation (every request) |
| Contingency | 100% evaluation + human approval required |
| Emergency | Disabled (AI is off, circuit breaker active) |
The judge always evaluates if:
- A guardrail flagged the request (regardless of sampling rate)
- The pipeline config has
judge_enabled=Trueandpace_enabled=False
Judge Deployment Phases¶
The framework recommends deploying the judge in phases:
- Shadow: Judge evaluates but results are only logged, not acted on. Measure accuracy against human reviews.
- Advisory: Judge verdicts are shown to reviewers as a recommendation but don't block.
- Operational: Judge verdicts drive blocking and escalation.
Use the rule-based judge or FLAG mode guardrails to implement shadow mode without any judge infrastructure.
For latency-critical deployments, shadow mode is also the recommended first step when introducing a distilled Small Language Model as an inline judge. Run it in shadow alongside the full LLM Judge to validate accuracy before switching to enforcement.