The framework

Three layers that act, one that contains.

Three independent control layers, each doing one job, with a circuit breaker behind them to contain what they can't hold. If one layer fails, the others still hold. They start in detect-only, watching and logging without blocking, and graduate to enforcing once you trust what they catch.

The pattern

Layered controls, end to end.

A request passes through prevention, review, and human judgement on its way to the user. Each layer is independent, so a gap in one does not become a hole in all.

AIRS reference architecture: input guardrails, the AI model, output guardrails, a reviewing layer, and human governance around them — Guardrails prevent, reviewing controls detect, humans decide, circuit breakers contain. The same shape holds whether you run one model or many.

Layer 01

~10ms

Guardrails

Fast, fixed boundaries: content policies, scope limits, tool permissions. These are the locked doors of the system: cheap, reliable, and impossible to talk your way past.

How guardrails are built →

Layer 02

5ms–5s

Reviewing controls

A second opinion before anything reaches the user. Three controls of increasing depth, each catching what the one before it waves through:

Deterministic scanners. Pattern and signature checks on the response: secrets, PII, banned strings. Cheap, exact, and run on everything.
Semantic firewall. Catches known-bad intent in wording neither the scanners nor the judge has seen before: a prohibited request reworded, translated, or indirected to mean the same thing. Inside the semantic firewall →
Contextual review (model-as-judge). A model weighing the response against policy, context, and declared intent, to surface the unknown-bad a fixed rule can't name. It is itself a model, so it is probabilistic and can be fooled: it informs the decision rather than making the final call, and never stands in for the deterministic guardrails beneath it. When the judge can be fooled →

Layer 03

as needed

Human oversight

Escalation paths, audit trails, and a real person on the high-stakes decisions. The scope of oversight scales with the consequence: the more that rides on a call, the closer a human watches.

Making oversight work →

Failsafe

instant

Circuit breakers

Not a behavioural layer but a containment mechanism: the emergency stop. It halts the AI and fails over to a safe fallback when the three layers above are bypassed or overwhelmed, and maps to PACE Emergency. You rarely need it, and you're very glad it's there when you do.

PACE resilience & fail-safe →

Three domains

The same idea, at three scales.

The same layered idea applies whether you are securing one agent, many agents, or the platforms beneath them.

Foundation

One agent doing one job. This is where everyone starts, and where the three control layers are easiest to see.

Secure a single agent →

MASO (Multi-Agent Security Operations)

Many agents working together. As they hand work to each other, trust gets complicated and the controls have to follow.

Secure multi-agent systems →

Infrastructure

The platforms, cloud, tools, and data underneath it all, the layer everything else depends on.

Secure the platform →

The wider boundary

The layers act on more than words in and out.

A request is the obvious thing to watch, but it is not the only one. The same three control layers also bound what an agent is allowed to be, remember, trust, and spend. Treat these as factors to weigh when you place your controls, not afterthoughts.

Identity & delegated authority

Who is the agent acting as, and how far does its borrowed authority reach? Scope it per action, not just at sign-up.

Memory & context

What an agent stores and recalls is an attack surface. Poisoned memory can sit dormant and steer a later decision.

Supply chain & tool provenance

Models, tools, and their manifests arrive from somewhere. Trust what you can verify, not what claims to be safe.

Cost & token governance

An agent that can loop or fan out can also burn a budget. Cap the blast radius before a runaway run does.

The thinking behind it

Why these layers, in this order.

Six short reads make the whole case for runtime security: the why before the how.

A runtime problem

Pre-deployment testing cannot prove future safety. Security has to be continuous.

Guardrails aren't enough

Novel attacks and semantic violations walk straight past fixed rules.

The judge detects

An evaluator surfaces unknown-bad against declared intent. It informs humans, not replaces them.

Infrastructure beats instructions

Telling an agent what not to do fails. Make violations technically impossible.

Humans remain accountable

AI assists decisions; humans own outcomes. Oversight scales, it doesn't disappear.

Feedback loops

Four loops at different speeds turn the layers into a self-improving system.

Browse all insights →

Not sure where to begin?

Start with the plain-language explainer, or jump straight to the entry point that matches your role.

What is AIRS? → Find your role