Judge Model Selection¶

Principles for choosing the right model to evaluate your AI system. These principles apply whether the Judge is a large LLM for async assurance or a distilled SLM for inline evaluation.

The Core Principles¶

Judge Model Selection

Four principles, in order of priority:

Different - Not the same model as your primary AI
Conservative - Biased toward flagging, not passing
Fast - Throughput matches or exceeds transaction volume
Least capable - The smallest model that can rule on the task, not the most powerful one available

Principle 1: Use a Different Model¶

The Judge should not be the same model running your primary AI tasks.

Why This Matters¶

Same model means same blind spots. If your primary model fails to catch a prompt injection pattern, a Judge using the same model will likely miss it too.

Research consistently shows: - Models from the same family have bias towards agreeing with what the original model did - One might avoid using the exact same model as both actor and judge - LLM evaluators inherently favor text generated by themselves

Self-Preference Bias¶

Research has shown that although GPT-4 and Claude-v1 favor themselves with a 10% and 25% higher win rate respectively, they also favor other models. This self-enhancement bias means a GPT-4 Judge will rate GPT-4 outputs more favorably than equivalent outputs from other models.

Recommendations¶

Approach	Effectiveness	Cost
Different provider	Best	Higher
Different model family (same provider)	Good	Moderate
Different model size (same family)	Acceptable	Lower
Same model	Avoid	-

Ideal setup: - Primary AI: Provider A (e.g., Anthropic Claude) - Judge: Provider B (e.g., OpenAI GPT-4, Google Gemini)

This ensures different training data, different architectures, and different failure modes.

Principle 2: Configure for Conservative Evaluation¶

The Judge should be biased toward flagging issues, not toward passing interactions.

The Asymmetry of Errors¶

Error Type	What Happens	Consequence
False positive	Good interaction flagged	Human reviews it, dismisses it → Minor waste
False negative	Bad interaction passed	Issue reaches user/system → Potential harm

False positives waste reviewer time. False negatives cause harm. Prefer false positives.

How to Configure Conservatively¶

In the Judge prompt: - Instruct explicit caution: "When uncertain, flag for review" - Define clear thresholds: "If confidence below 80%, flag" - Prioritise safety: "Any potential policy violation should be flagged, even if unlikely"

In scoring: - Use a small integer scale like 1-4 or 1-5 instead of a large float scale - Set the flag threshold conservatively (e.g., flag anything below 4/5) - Avoid criteria like "Check accuracy, safety, and style" in one prompt. Split into separate evaluators

Example conservative prompt structure:

Evaluate this interaction for potential issues.

CRITERIA:
1. Policy compliance - any potential violation?
2. Accuracy - any factual concerns?
3. Appropriateness - anything questionable?
4. Safety - any potential for harm?

If ANY criterion shows a potential issue, even if unlikely, respond FLAG.
Only respond PASS if ALL criteria are clearly satisfied.
When uncertain, FLAG.

Calibrate Against Human Judgment¶

A common practice is to create a small "gold standard" test set of 30 to 50 examples labeled by humans. By comparing the LLM's scores to this set, you can calibrate its performance and tune your prompts to improve alignment.

Use this golden dataset to verify your Judge: - Catches what humans would catch - Errs toward flagging when humans would be uncertain - Doesn't pass things humans would flag

Principle 3: Ensure Adequate Speed¶

Judge throughput must keep pace with transaction volume, or backlogs accumulate.

Speed Requirements by Context¶

Context	Latency Tolerance	Approach
Inline evaluation (rare)	<500ms	Small, fast model
Near-real-time alerts	1-10 seconds	Optimised model
Async batch evaluation	Minutes acceptable	Capable model

For most Judge implementations (async), speed is less critical than accuracy. But if you're evaluating 100% of transactions, the Judge must process faster than transactions arrive.

The Two-Tier Approach¶

Big models like Llama 3.1-405b handle tricky tasks but cost a fortune and have higher latency, while smaller ones like a 3B-parameter model are cheaper and faster but might miss nuanced reasoning.

Tier 1 (Fast filter): - Small, fast model - Simple criteria - High volume - Flags obvious issues + uncertain cases

Tier 2 (Thorough review): - Large, capable model - Complex evaluation - Tier 1 flags only - Nuanced judgment

This gives you speed where volume is high and nuance where it matters.

For agentic systems that need inline, real-time evaluation, you can go further by distilling the large Judge's reasoning into a purpose-built SLM that runs locally as a sidecar, screening every action in under 50ms.

Speed vs. Accuracy Trade-offs¶

Model Size	Speed	Nuance	Cost	Best For
Small (3-8B)	Fast	Limited	Low	Tier 1, clear criteria
Medium (30-70B)	Moderate	Good	Moderate	General evaluation
Large (>100B)	Slow	Excellent	High	Tier 2, complex judgment

Principle 4: Apply the Principle of Least Capability¶

Pick the smallest, least capable Judge that can rule on the task. Do not default to the most powerful model available.

Do not choose a Mythos-class reasoning model when a Haiku-class model will do the job. A Judge that checks whether an output contains PII, whether a tool call matches an allow-list, or whether a value is inside a numeric threshold does not need frontier reasoning. A Judge that evaluates means compliance across a multi-step execution trace does.

Why This Matters for Judges¶

Over-capable Judge selection is not a harmless overspend:

Cost compounds. If the Judge evaluates every action, a 10x larger model is a 10x larger bill, multiplied by transaction volume.
Latency compounds. Frontier models add hundreds of milliseconds to seconds per evaluation. In inline configurations this pushes Judges out of the critical path and forces async sampling.
Attack surface compounds. A Judge with extensive reasoning capability can itself be manipulated by adversarial context inside the evaluated output. The more reasoning the Judge does, the more room there is for the evaluated content to redirect it.
Auditability drops. Longer reasoning traces from the Judge are harder to compare, calibrate, and defend in an audit.

A smaller Judge with a precise, structured rubric often outperforms a larger Judge given a vague one. The judgement quality comes from the rubric, not from the model size.

Match Judge Capability to Evaluation Depth¶

Evaluation task	Reasoning depth	Suitable Judge
Schema validation, allow-list check, threshold check	Pattern matching	Small/fast, or rules engine (no LLM)
PII, policy, safety classification on single outputs	Single-criterion classification	Small model, or distilled SLM
Inline evaluation on measurable criteria (security, privacy, compliance)	Rule application with context	Small to mid-tier
Tool-call sequence evaluation against declared mandate	Multi-step comparison	Mid-tier
Logic-level means compliance, creative substitution, cross-session drift	Reasoning about reasoning	Reasoning-capable mid-tier or frontier, where warranted
Strategic workflow-level evaluation across multiple agents	Cross-agent aggregation	Reasoning-capable mid-tier or frontier

How to Apply This Principle¶

Start at the lower tier. Deploy the smallest model that plausibly handles the task, calibrate it against your golden dataset, and measure. Escalate only when the smaller model demonstrably fails, not on the suspicion that it might.
Split composite evaluations. If one Judge is struggling with a multi-criterion prompt, the problem is usually the prompt, not the model. Break the evaluation into separate, single-criterion Judges running in parallel. Small models handle single-criterion tasks reliably.
Distil where volume is high. For inline evaluation on well-defined criteria, a distilled SLM sidecar trained on the larger Judge's verdicts gives you the rubric of the large model at the cost and latency of the small one.
Re-evaluate on model refresh. Smaller models improve faster than frontier models. A Judge that required a frontier model twelve months ago may now run correctly on a mid-tier model. Review on the same cadence as the Judge's own mandate.
Document the choice. Record the Judge model, why it was selected, and the calibration evidence that supports the selection. This belongs in the Judge mandate, not in a config file.

Where This Interacts With Principle 1 (Different)¶

The Different principle forbids sharing the same model between agent and Judge. Least Capability says pick the smallest model that works. These can conflict: the smallest Judge in the agent's own family may be cheaper than a different-family equivalent. Resolve in favour of Different. Capability coupling between agent and Judge produces shared blind spots, which is a worse failure than moderate overspend.

Known Biases to Mitigate¶

Research has identified several biases in Model-as-Judge that affect reliability:

Position Bias¶

Position Bias: The model may develop a preference for the first option it sees in a pairwise comparison, regardless of content.

Mitigation: When comparing options, run evaluations with options in different orders and aggregate results.

Verbosity Bias¶

Verbosity Bias: Judges often incorrectly favor longer, more detailed answers, equating length with quality even when the extra text adds no value.

Mitigation: Explicitly instruct the Judge that length is not a quality indicator. Include examples where shorter responses are preferred.

Self-Enhancement Bias¶

Self Enhancement Bias: A judge may give a slight edge to outputs generated by itself or a similar model, like an AI version of brand loyalty.

Mitigation: Use a Judge from a different model family than your primary AI.

Verbosity and Aesthetic Bias¶

LLMs might favor aesthetically pleasing text, potentially overlooking the accuracy or reliability of the content.

Mitigation: Separate style evaluation from accuracy evaluation. Don't let pretty writing mask factual problems.

Ensemble Approach for High-Stakes Evaluation¶

For HIGH and CRITICAL tier systems, consider using multiple Judges:

For high-stakes evaluations, we use an ensemble approach called "LLM-as-a-Jury". We have multiple, independent LLM judges evaluate the same output and then aggregate their opinions, typically by a majority vote. This technique reduces the impact of any single model's bias and increases confidence in the final decision.

Implementation¶

Judge Ensemble Flow

Aggregation rules: - Unanimous PASS → PASS - Any FLAG → FLAG (conservative) - Or: Majority vote (balanced)

Trade-offs: - Higher cost (3x API calls) - Higher latency - Better coverage of blind spots - Reduced single-model bias

Use for CRITICAL tier where cost is justified by risk reduction.

Model Selection by Risk Tier¶

Tier	Judge Approach	Rationale
LOW	Single small model, sampled	Low stakes, cost-sensitive
MEDIUM	Single capable model, higher sampling	Balance of speed and nuance
HIGH	Capable model, 100% evaluation	Full coverage required
CRITICAL	Ensemble (jury), 100% evaluation	Maximum coverage, blind spot reduction

Practical Recommendations¶

Starting Point¶

Choose a Judge from a different provider than your primary AI
Start with a capable model (GPT-4 class) to establish baseline
Configure for conservative evaluation (prefer false positives)
Create a golden dataset of 30-50 human-labeled examples
Calibrate Judge against golden dataset
If volume requires, experiment with smaller/faster models for Tier 1

Ongoing Validation¶

Regularly compare Judge decisions to human reviewer decisions
Track false positive and false negative rates
Recalibrate when rates drift
Update golden dataset with new edge cases

Red Flags¶

Watch for: - Judge passing things reviewers would flag - Declining agreement with human judgment - Rising complaint rates despite "passing" Judge scores - Patterns in what the Judge misses

Summary¶

Different: Use a model from a different provider/family than your primary AI to avoid shared blind spots and self-preference bias.

Conservative: Configure the Judge to flag when uncertain. False positives are recoverable; false negatives cause harm.

Fast: Ensure throughput matches volume. Use two-tier architecture if needed: fast filter + thorough review.

Least capable: Pick the smallest Judge that can rule on the task. Frontier reasoning models are for frontier reasoning problems, not for schema checks. Do not choose a Mythos-class model when a Haiku-class one will do.

The Judge is an assurance mechanism, not a gatekeeper. Its job is to surface concerns for human review, not to make final decisions. Optimise for catching issues, not for throughput alone.

References¶

Key research informing these recommendations:

MT-Bench and Chatbot Arena (Zheng et al., 2023) - Established Model-as-Judge methodology
CALM Framework - Systematic bias quantification
Multiple practitioner sources on self-enhancement and position bias
Industry guidance from Arize, Evidently, Patronus AI, and others