Semantic Firewall¶

Intent-level boundary enforcement that sits between Guardrails and Model-as-Judge. Catches requests that are semantically equivalent to a prohibited intent even when the surface wording is novel.

Why This Layer Exists¶

Guardrails block known patterns: regex, denylists, encoding checks. They miss semantic variations - the same prohibited intent expressed in wording the guardrail has never seen. Model-as-Judge catches this, but at Judge latency and Judge cost, applied to every request regardless of whether the request was ever close to a policy boundary.

The semantic firewall fills the gap between the two: faster and cheaper than a Judge call, but aware of meaning rather than just surface form. It does not replace either layer. It reduces how often the expensive layer needs to run, and catches a class of failure the cheap layer cannot.

Layer	Catches	Mechanism	Speed
Guardrails	Known-bad patterns	Regex, denylists, encoding detection	~10ms
Semantic Firewall	Known-bad intent, novel wording	Embedding similarity / intent classifier against a declared taxonomy	~15-30ms
Model-as-Judge	Unknown-bad, context-dependent, novel	Independent LLM or distilled SLM reasoning	10ms-5s by tier

What It Does¶

Classifies the intent of inbound (and optionally outbound) content against a fixed, declared set of authorised and prohibited topics or intents for the deployment - not the literal text. Two requests with different surface wording and equivalent intent are scored the same way.

Function	Description
Intent classification	Does this request's meaning match a prohibited or out-of-scope category, regardless of phrasing?
Topic boundary enforcement	Is this request inside the declared domain for this deployment?
Paraphrase / obfuscation resistance	Catches rewordings, translations, and indirection that change wording but not intent

What It Does NOT Do¶

Evaluate output quality, accuracy, or policy nuance - that's the Judge's job
Make final block/allow decisions on ambiguous cases - escalate to Judge or HITL
Replace topic-boundary configuration with model judgment - the taxonomy is declared, not inferred at runtime

The semantic firewall narrows what reaches the Judge. It does not replace the Judge.

Architecture¶

Sits inline, immediately after deterministic guardrails and before the request reaches the model (or, in a multi-provider gateway, before routing to any backend):

Input → Guardrails (~10ms: regex, PII, encoding) → Semantic Firewall (~15-30ms: intent classification against taxonomy) → [pass] → model invocation → [flag] → Judge (inline SLM or async LLM, per risk tier) → [reject] → blocked, logged, optionally escalated to HITL

In a multi-backend deployment (e.g. routing across multiple model providers through a shared gateway), this is the natural place to enforce one topic-boundary policy across all backends, rather than relying on each provider's native guardrails to converge on the same boundary independently.

Build Options¶

Approach	Mechanism	Latency	Notes
Embedding similarity classifier	Encode declared intent exemplars, compare inbound request by cosine similarity	~10-20ms	Lowest cost, no model call. Right default for Low/Medium risk tiers.
Distilled intent classifier (SLM)	Small model fine-tuned specifically for intent classification, not general evaluation	~15-40ms	Narrower and cheaper at volume than a general-purpose Judge sidecar.
Reference implementations	LlamaFirewall-style intent classification patterns	varies	Useful as an architectural reference rather than a drop-in dependency.

Declared Taxonomy, Not Inferred Boundaries¶

Like the Judge's reliance on a clear OISpec, the semantic firewall is only as good as the intent taxonomy it's checking against. A vague taxonomy ("don't discuss harmful topics") degrades to generic safety filtering. A specific one ("this deployment handles benefits-enrolment queries only; flag anything resembling account credential requests, payment instrument changes, or requests for information about other applicants") gives the classifier something precise to score against.

The taxonomy is a configuration artefact, reviewed and versioned the same way guardrail rules and OISpecs are - not something the classifier model is left to infer from general training.

Limitations¶

Embedding classifiers can still be fooled by adversarial phrasing engineered specifically to sit just outside the declared exemplar space - this is a narrower version of the same adversarial-robustness problem the Judge has.
A taxonomy that's too broad produces false positives at a rate that erodes trust in the layer; too narrow, and it misses genuine paraphrase attacks. Tune against real traffic, the same way guardrail thresholds are tuned.
It is a cost/latency optimisation over running the Judge on every request, not a categorically different guarantee. Treat its output as routing logic (pass / escalate to Judge / reject), not as a final verdict on ambiguous cases.

Going Deeper¶

Topic	Document
Where this sits relative to Guardrails and Judge	Controls: Guardrails, Judge, and Human Oversight
Cost and latency budgeting across the full evaluation stack	Cost & Latency
Injection detection at the logging layer	Logging & Observability - LOG-06
Judge isolation in a multi-provider gateway	AWS Bedrock - NET-03
Declared intent as the reference standard for evaluation	MASO: Objective Intent