Controls: Guardrails, Judge, and Human Oversight¶

1. Guardrails¶

Real-time controls that block known-bad inputs and outputs.

Input Guardrails¶

Control	What It Catches
Injection detection	Attempts to override system prompt
Encoding detection	Obfuscated attacks (Base64, hex, Unicode)
PII detection	Personal data in prompts
Content policy	Prohibited request types
Rate limiting	Abuse, enumeration
Length limits	Context stuffing

Processing flow:

Input Processing Flow

Output Guardrails¶

Control	What It Catches
Content filtering	Harmful/inappropriate content
PII detection	Personal data leakage
Grounding check	Hallucination
Format validation	Malformed responses

Limitations¶

Guardrails catch known patterns. They miss: - Novel techniques - Semantic variations - Context-dependent violations - Subtle policy violations

This is why the Judge provides the second layer.

For practical implementation guidance - international PII detection, RAG ingestion filtering, secrets scanning, alerting design, and guardrail exception governance - see Practical Guardrails.

Semantic Firewall¶

Part of the Reviewing Controls layer, alongside the Judge: intent-level enforcement that sits between Guardrails and Model-as-Judge in the request path. Where Guardrails block known-bad patterns and the Judge evaluates unknown-bad content, the semantic firewall catches known-bad intent expressed in wording neither has seen before - a request reworded, translated, or indirected to mean the same prohibited thing.

Layer	Catches	Speed
Guardrails	Known patterns	~10ms
Semantic Firewall	Known intent, novel wording	~15-30ms
Model-as-Judge	Unknown-bad, context-dependent	10ms-5s by tier

It classifies inbound requests against a declared taxonomy of authorised and prohibited topics/intents using an embedding classifier or distilled intent model - not a general-purpose LLM call. This makes it cheap enough to run on every request, narrowing what actually needs to reach the Judge.

It does not replace the Judge. It routes: pass straight through, escalate to Judge evaluation, or reject outright for high-confidence matches against the prohibited taxonomy.

→ Full detail, build options, and limitations: Semantic Firewall

2. Model-as-Judge¶

Evaluation of interactions for quality and policy compliance. The Judge can be a large LLM (for async assurance and complex reasoning) or a distilled SLM (for inline, real-time action screening). Both approaches can be combined: an SLM screens every action in under 50ms, while a large LLM audits a sample asynchronously.

→ For model selection guidance, see Model-as-Judge Selection

What the Judge Does¶

Function	Description
Policy compliance	Did the AI follow guidelines?
Quality assessment	Accurate, helpful, appropriate?
Anomaly detection	Unusual patterns?
Risk flagging	What needs human review?

What the Judge Does NOT Do¶

Block transactions in real-time
Make final decisions
Replace human judgment

The Judge surfaces findings. Humans decide actions.

Architecture¶

Judge Architecture - Simple and Two-Tier

Evaluation Criteria¶

Criterion	Scoring
Policy adherence	Pass / Minor / Major violation
Accuracy	Verified / Unverified / Incorrect
Appropriateness	Appropriate / Borderline / Inappropriate
Safety	Safe / Uncertain / Concerning

Output: PASS / REVIEW / ESCALATE

Deployment Phases¶

Phase	Action on Findings
Shadow	Log only, measure accuracy
Advisory	Surface to humans, learn from feedback
Operational	Findings drive workflows

Start in shadow mode. Validate accuracy before acting.

Accuracy¶

The Judge will make mistakes.

Error	Impact	Mitigation
False positive	Unnecessary review	Tune prompts
False negative	Missed violations	Human sampling

Target: >90% agreement with human reviewers.

3. Human Oversight (HITL)¶

Humans review findings, make decisions, remain accountable.

HITL Architecture

Triggers¶

Trigger	Response
Judge flag	Review interaction
Guardrail block	Review if legitimate
User escalation	Human takes over
Sampling	Quality assurance
Threshold breach	Investigate pattern

Queue Design¶

Queue	SLA	Reviewer
Critical	1h	Senior + expert
High	4h	Domain expert
Standard	24h	Trained reviewer
Sampling	72h	QA team

Actions¶

Action	When
Approve	Interaction appropriate
Correct	Minor issue, fixable
Escalate	Needs senior review
Block user	Abuse detected
Tune	False positive

Prevent Rubber-Stamping¶

Control	Purpose
Canary cases	Verify reviewers catch known-bad
Time tracking	Flag too-fast reviews
Volume limits	Prevent fatigue
Inter-rater checks	Measure consistency

Going Deeper¶

Topic	Document
What these controls cost in production	Cost & Latency - latency budgets, sampling strategies, tiered evaluation cascade
Judge accuracy, drift, and adversarial failure	Judge Assurance · When the Judge Can Be Fooled
Semantic firewall detail, build options, and limitations	Semantic Firewall - intent-level boundary enforcement between Guardrails and the Judge
Practical guardrail configurations	Practical Guardrails - what to turn on first, encoding detection, international PII
When HITL doesn't scale	Humans in the Business Process - using existing business process checkpoints as a detection layer
Controls for multi-agent systems	MASO Framework - controls across 11 domains for agent orchestration
Controls for reasoning models (o1, etc.)	Reasoning Model Controls - trace scanning, instruction adherence, consistency checks
Session-level and pre-action evaluation	Output Evaluator - session-aware, pre-action evaluation architecture for agentic systems

Implementation Order¶

Logging - Can't evaluate what you don't capture
Basic guardrails - Block obvious attacks
Judge in shadow - Evaluate without action
HITL queues - Somewhere for findings
Judge advisory - Surface to humans
Enhanced guardrails - Add ML detection
Judge operational - Drive workflows
Continuous tuning - Improve from findings