AI Runtime Security - Single-Agent Controls¶
Most AI governance guidance assumes you can test your way to safety. You cannot.
AI systems are non-deterministic. Same prompt, same model, same parameters - different response. Every time. Your test suite proves the system can behave correctly. It cannot prove it will on the next request.
- AI systems are non-deterministic - same input, different output, by design
- Guardrails fail silently - the most dangerous outputs look perfectly normal
- Multi-agent systems amplify errors - hallucinations compound, permissions propagate, failures correlate
Guardrails → Judge → Human Oversight → Circuit Breaker
Block known-bad → Detect unknown-bad → Decide edge cases → Fail safely
Start here:
- Read the 5-minute executive summary - the entire framework on one page
- Explore the technical architecture - three layers, what fails when
- Get working controls in 30 minutes - from zero to deployed
Architecture¶
Three layers, one principle: you can't fully test a non-deterministic system before deployment, so you continuously verify behavior in production. This is a closed-loop control system - not evaluate-once-and-deploy, but constrain-and-continuously-verify.
Layer 1 - Guardrails (containment boundaries) block known-bad inputs and outputs at machine speed (~10ms). Deterministic pattern matching: content filters, PII detection, topic restrictions, rate limits. Permissions derive from business intent - what the use case requires - not from evaluation of the model's capabilities. These are action-space constraints that leave the model's reasoning unconstrained. Every request passes through. No exceptions.
Layer 2 - Model-as-Judge catches unknown-bad through independent model evaluation. The Judge can be a large LLM running asynchronously (500ms to 5s) or a distilled SLM running inline as a sidecar (10ms to 50ms). Either way, enterprise-owned and configured, evaluating task agent outputs against policy, factual grounding, tone, and safety criteria. Catches what guardrails cannot pattern-match, including within-bounds adversarial behavior that containment alone cannot address.
Layer 3 - Human Oversight provides the accountability backstop. Scope scales with risk: low-risk systems get spot checks, high-risk systems get human approval before commit. Humans decide edge cases. Humans own outcomes. Handles what neither containment nor the Judge can resolve autonomously.
Circuit Breaker stops all AI traffic and activates a non-AI fallback when any layer fails. Not a degradation - a full stop with a predetermined safe state.
Each layer is specifically designed to catch what the previous layer misses. This is compound defence by design - not defence-in-depth by coincidence. For the full argument, see Why Containment Beats Evaluation.
This pattern already exists in production at major platforms: NVIDIA NeMo, AWS Bedrock, Azure AI, LangChain, Guardrails AI, and others. This reference provides the vendor-neutral implementation: risk classification, controls, fail postures, and tested fallback paths.
Get Started¶
| If you want to... | Go here |
|---|---|
| Get the whole framework on one page | Cheat Sheet / Decision Poster |
| Deploy low-risk AI fast | Fast Lane |
| Understand the concepts in 30 minutes | Quick Start |
| Implement controls with working code | Implementation Guide |
| Classify a system by risk | Risk Tiers |
| Deploy an agentic AI system | Agentic Controls |
| Understand what happens when controls fail | PACE Resilience |
| Enforce controls at the infrastructure layer | Infrastructure Controls |
| Track your implementation | Checklist |
| Secure a multi-agent system | MASO Framework |
Before You Build Controls¶
The First Control: Choosing the Right Tool
The most effective way to reduce AI risk is to not use AI where it doesn't belong. Before guardrails, judges, or human oversight - ask whether AI is the right tool for this problem.
If your deployment is internal, read-only, handles no regulated data, and has a human reviewing output - start with the Fast Lane. You may not need the rest.
Risk-Scaled Controls¶
Controls scale to risk so that low-risk AI moves fast and high-risk AI stays safe. The framework provides a menu of control patterns. Select the ones you need. Consciously deselect the ones you do not. Every organisation has its own way of doing things, and the controls are designed to fit your context rather than impose a single way of working.
| Risk Tier | Controls Required | PACE Posture | Use Case Examples |
|---|---|---|---|
| Low | Fast Lane: minimal guardrails, self-certification | P only (fail-open with logging) | Internal chatbots, document summarisation, code assistance |
| Medium | Guardrails + Judge, periodic human review | P + A configured | Customer-facing content, recommendation engines, search |
| High | All three layers, human-in-the-loop for writes | P + A + C configured and tested | Financial advice, medical support, regulatory decisions |
| Critical | Full architecture, mandatory human approval | Full PACE cycle with tested E→P recovery | Autonomous actions on regulated data, safety-critical systems |
Classify your system: Risk Tiers
PACE Resilience¶
Every control has a defined failure mode. The PACE methodology ensures that when a control layer degrades - and it will - the system fails safely rather than silently.
Primary: All layers operational. Normal production.
Alternate: One layer degraded. Backup activated. Scope tightened. Example: Judge layer is down → guardrails remain active, all outputs queued for human review.
Contingency: Multiple layers degraded. AI operates in supervised-only mode. Human approves every action. Reduced capacity, high assurance.
Emergency: Confirmed compromise or cascading failure. Circuit breaker fires. AI traffic stopped. Non-AI fallback activated. Incident response engaged.
Even at the lowest risk tier, there's a fallback plan. At the highest, there's a structured degradation path from full autonomy to full stop.
Core Documents¶
| Document | Purpose |
|---|---|
| Cheat Sheet | Entire framework on one page - classify, control, fail posture, test |
| Decision Poster | Visual one-page reference |
| Fast Lane | Pre-approved minimal controls for low-risk AI |
| Risk Tiers | Classify your system, determine control and resilience requirements |
| Risk Assessment | Quantitative control effectiveness, residual risk per tier, NIST AI RMF aligned |
| Controls | Guardrails, Judge, and Human Oversight implementation with per-layer fail postures |
| Agentic | Controls for single autonomous AI agents including graceful degradation path |
| PACE Resilience | What happens when controls fail |
| Checklist | Track implementation and PACE verification progress |
| Emerging Controls | Multimodal, reasoning, and streaming considerations (theoretical) |
Infrastructure Controls¶
This section defines what to enforce. The infrastructure section defines how - 80 technical controls across 11 domains, with standards mappings and platform-specific patterns.
Domains: Identity & Access Management (8), Logging & Observability (10), Network & Segmentation (8), Data Protection (8), Secrets & Credentials (8), Supply Chain (8), Incident Response (8), Tool Access (6), Session & Scope (5), Delegation Chains (5), Sandbox Patterns (6).
Standards mappings: Every control maps to the three-layer model, ISO 42001 Annex A, NIST AI RMF, and OWASP LLM/Agentic Top 10.
Platform patterns: AWS Bedrock, Azure AI, and Databricks reference architectures.
Defence in Depth Beyond the AI Layer¶
The three-layer model above - guardrails, judge, human oversight - addresses controls specific to non-deterministic AI behavior. It does not replace the security controls your organisation already has. It sits inside them.
Your existing DLP systems apply to data flowing into and out of AI systems - both preventing sensitive data from reaching models and catching leakage that AI-specific controls miss. API gateways validate requests and enforce schemas regardless of whether the caller is human or AI. Database access controls and parameterised queries prevent injection even if an agent constructs a malicious query. IAM governs who can invoke AI systems in the first place. SIEM correlates AI events with network, endpoint, and application events. Secure coding practices in the systems agents interact with still matter - arguably more, because the caller is now non-deterministic.
These controls are outside the scope of this reference, but they are part of your defence. When you assess your AI security posture, include them. When you threat-model, include them. When one of these controls misses something, they are your safety net.
When You Need Multi-Agent¶
When AI agents collaborate, delegate tasks, and take autonomous actions across trust boundaries, the single-agent controls on this page are necessary but not sufficient. The MASO Framework extends this architecture into multi-agent orchestration.
| What MASO adds | Why single-agent controls aren't enough |
|---|---|
| Inter-agent message bus security | Agents communicating directly create uncontrolled trust boundaries |
| Non-Human Identity per agent | Shared credentials between agents create lateral movement risk |
| Epistemic integrity controls | Hallucinations compound across agent chains; confidence inflates without evidence |
| Transitive authority prevention | Delegation creates implicit privilege escalation |
| Kill switch architecture | Multi-agent cascading failures require system-wide emergency stop |
| Dual OWASP coverage | Agentic Top 10 (2026) risks only exist when agents act autonomously |
| Document | Purpose |
|---|---|
| MASO Overview | Architecture, PACE integration, OWASP dual mapping, 7 control domains |
| Tier 1 - Supervised | Low autonomy: human approves all writes |
| Tier 2 - Managed | Medium autonomy: NHI, signed bus, Model-as-Judge, continuous monitoring |
| Tier 3 - Autonomous | High autonomy: self-healing PACE, adversarial testing, isolated kill switch |
| Red Team Playbook | 13 adversarial test scenarios for multi-agent systems |
| Integration Guide | LangGraph, AutoGen, CrewAI, AWS Bedrock implementation patterns |
| Worked Examples | Financial services, healthcare, critical infrastructure |
Extensions¶
| Folder | Contents |
|---|---|
| Regulatory | ISO 42001 and EU AI Act mapping |
| Technical | Bypass prevention, metrics |
| Industry Solutions | Guardrails, evaluators, and safety model reference |
| Templates | Risk assessment templates, implementation plans |
| Worked Examples | Per-tier implementation walkthroughs |
Insights¶
Foundational Arguments
| Article | Key Argument |
|---|---|
| The First Control: Choosing the Right Tool | Design thinking before technology selection |
| Why Your AI Guardrails Aren't Enough | Guardrails block known-bad; you need detection for unknown-bad |
| The Judge Detects. It Doesn't Decide. | Async evaluation beats real-time blocking for nuanced decisions |
| Infrastructure Beats Instructions | You can't secure AI systems with prompts alone |
| Risk Tier Is Use Case, Not Technology | Classification reflects deployment context, not model capability |
| Humans Remain Accountable | AI assists decisions; humans own outcomes |
Emerging Challenges
| Article | Key Argument |
|---|---|
| The Verification Gap | Current safety approaches can't confirm ground truth |
| Behavioral Anomaly Detection | Aggregating signals to detect drift from expected behavior |
| Multimodal AI Breaks Your Text-Based Guardrails | Images, audio, and video create new attack surfaces |
| When AI Thinks Before It Answers | Reasoning models need reasoning-aware controls |
| When Agents Talk to Agents | Multi-agent accountability gaps → see MASO |
| The Memory Problem | Long context and persistent memory introduce novel risks |
| You Can't Validate What Hasn't Finished | Real-time streaming challenges existing validation |
| Open-Weight Models Shift the Burden | Self-hosted models inherit the provider's control responsibilities |
| When the Judge Can Be Fooled | The Judge layer needs its own threat model |
Platforms Implementing This Pattern¶
This isn't a theoretical proposal. These platforms already implement variants of this pattern:
| Platform | Approach |
|---|---|
| NVIDIA NeMo Guardrails | Five rail types: input, dialog, retrieval, execution, output |
| LangChain | Middleware chains with human-in-the-loop |
| Guardrails AI | Open-source validator framework |
| Galileo | Eval-to-guardrail lifecycle |
| DeepEval | Model-as-Judge evaluation framework |
| AWS Bedrock Guardrails | Managed input/output filtering |
| Azure AI Content Safety | Content filtering and moderation |
Standards Alignment¶
| Standard | Relevance | Mapping |
|---|---|---|
| OWASP LLM Top 10 | Security vulnerabilities in LLM applications | OWASP mapping |
| OWASP Agentic Top 10 | Risks specific to autonomous AI agents | MASO mapping |
| NIST AI RMF | AI risk management framework | NIST mapping |
| ISO 42001 | AI management system standard | ISO 42001 mapping |
| NIST SP 800-218A | Secure development for generative AI | SP 800-218A mapping |
| MITRE ATLAS | Adversarial threat landscape for AI | MASO threat intelligence |
| DORA | Digital operational resilience | MASO regulatory alignment |
Scope¶
In scope: Custom LLM applications, AI decision support, document processing, single-agent systems - from deployment through incident response.
Out of scope: Vendor AI products (use vendor controls), model training (see MLOps security guidance), and pre-deployment testing. This framework is about reducing harm during live operation.
Pre-deployment complement: For secure development practices covering data sourcing, training, fine-tuning, and model release, see NIST SP 800-218A. This framework begins where SP 800-218A ends.
For multi-agent systems: See MASO.
Contributing¶
Feedback, corrections, and extensions welcome. See CONTRIBUTING.md.