Skip to content

Semantic Firewall

Intent-level boundary enforcement that sits between Guardrails and Model-as-Judge. Catches requests that are semantically equivalent to a prohibited intent even when the surface wording is novel.

Why This Layer Exists

Guardrails block known patterns: regex, denylists, encoding checks. They miss semantic variations - the same prohibited intent expressed in wording the guardrail has never seen. Model-as-Judge catches this, but at Judge latency and Judge cost, applied to every request regardless of whether the request was ever close to a policy boundary.

The semantic firewall fills the gap between the two: faster and cheaper than a Judge call, but aware of meaning rather than just surface form. It does not replace either layer. It reduces how often the expensive layer needs to run, and catches a class of failure the cheap layer cannot.

Layer Catches Mechanism Speed
Guardrails Known-bad patterns Regex, denylists, encoding detection ~10ms
Semantic Firewall Known-bad intent, novel wording Embedding similarity / intent classifier against a declared taxonomy ~15-30ms
Model-as-Judge Unknown-bad, context-dependent, novel Independent LLM or distilled SLM reasoning 10ms-5s by tier

What It Does

Classifies the intent of inbound (and optionally outbound) content against a fixed, declared set of authorised and prohibited topics or intents for the deployment - not the literal text. Two requests with different surface wording and equivalent intent are scored the same way.

Function Description
Intent classification Does this request's meaning match a prohibited or out-of-scope category, regardless of phrasing?
Topic boundary enforcement Is this request inside the declared domain for this deployment?
Paraphrase / obfuscation resistance Catches rewordings, translations, and indirection that change wording but not intent

What It Does NOT Do

  • Evaluate output quality, accuracy, or policy nuance - that's the Judge's job
  • Make final block/allow decisions on ambiguous cases - escalate to Judge or HITL
  • Replace topic-boundary configuration with model judgment - the taxonomy is declared, not inferred at runtime

The semantic firewall narrows what reaches the Judge. It does not replace the Judge.

Architecture

Sits inline, immediately after deterministic guardrails and before the request reaches the model (or, in a multi-provider gateway, before routing to any backend):

Input → Guardrails (~10ms: regex, PII, encoding) → Semantic Firewall (~15-30ms: intent classification against taxonomy) → [pass] → model invocation → [flag] → Judge (inline SLM or async LLM, per risk tier) → [reject] → blocked, logged, optionally escalated to HITL

In a multi-backend deployment (e.g. routing across multiple model providers through a shared gateway), this is the natural place to enforce one topic-boundary policy across all backends, rather than relying on each provider's native guardrails to converge on the same boundary independently.

Build Options

Approach Mechanism Latency Notes
Embedding similarity classifier Encode declared intent exemplars, compare inbound request by cosine similarity ~10-20ms Lowest cost, no model call. Right default for Low/Medium risk tiers.
Distilled intent classifier (SLM) Small model fine-tuned specifically for intent classification, not general evaluation ~15-40ms Narrower and cheaper at volume than a general-purpose Judge sidecar.
Reference implementations LlamaFirewall-style intent classification patterns varies Useful as an architectural reference rather than a drop-in dependency.

Declared Taxonomy, Not Inferred Boundaries

Like the Judge's reliance on a clear OISpec, the semantic firewall is only as good as the intent taxonomy it's checking against. A vague taxonomy ("don't discuss harmful topics") degrades to generic safety filtering. A specific one ("this deployment handles benefits-enrolment queries only; flag anything resembling account credential requests, payment instrument changes, or requests for information about other applicants") gives the classifier something precise to score against.

The taxonomy is a configuration artefact, reviewed and versioned the same way guardrail rules and OISpecs are - not something the classifier model is left to infer from general training.

Limitations

  • Embedding classifiers can still be fooled by adversarial phrasing engineered specifically to sit just outside the declared exemplar space - this is a narrower version of the same adversarial-robustness problem the Judge has.
  • A taxonomy that's too broad produces false positives at a rate that erodes trust in the layer; too narrow, and it misses genuine paraphrase attacks. Tune against real traffic, the same way guardrail thresholds are tuned.
  • It is a cost/latency optimisation over running the Judge on every request, not a categorically different guarantee. Treat its output as routing logic (pass / escalate to Judge / reject), not as a final verdict on ambiguous cases.

Going Deeper

Topic Document
Where this sits relative to Guardrails and Judge Controls: Guardrails, Judge, and Human Oversight
Cost and latency budgeting across the full evaluation stack Cost & Latency
Injection detection at the logging layer Logging & Observability - LOG-06
Judge isolation in a multi-provider gateway AWS Bedrock - NET-03
Declared intent as the reference standard for evaluation MASO: Objective Intent