Agentic AI Controls¶

Additional controls for AI systems that take autonomous actions.

What Makes Agents Different¶

Characteristic	Chatbot	Agent
Actions	Responds only	Takes real-world actions
Autonomy	Single turn	Multi-step, self-directed
Scope	Fixed	May expand based on goals
Failure mode	Bad answer	Bad action with consequences

Key risk: Agents can cause harm at machine speed without human review.

The Two Core Problems¶

Agentic AI security reduces to two problems:

The Two Core Problems

Problem	Question	Failure Mode
1. System Access	Does the agent access only the right systems?	Reaches data/APIs it shouldn't
2. Request Integrity	Does the action match the user's actual intent?	Manipulated or misinterpreted requests

Problem 1: System Access¶

The agent should only reach systems it needs, with minimum necessary permissions. For the governance model, lifecycle, and threat landscape behind these controls, see IAM Governance for AI Systems.

Control	Implementation
Least-privilege credentials	Agent gets tokens scoped to specific resources
Network allowlists	Agent can only reach approved endpoints
Data views	Database exposes only permitted subset
Action allowlists	Only pre-approved action types permitted
Blast radius limits	Maximum records, funds, or scope per action

Test: If the agent is fully compromised, what's the worst it can do? Reduce that.

Problem 2: Request Integrity¶

The action the agent takes should match what the user actually wanted.

Threat	Control
Injection attacks	Input guardrails, tool output sanitisation
Instruction drift	Anchor to original request, not intermediate reasoning
Misinterpretation	Intent confirmation before irreversible actions
Manipulation via tools	Treat tool outputs as untrusted data

Test: Can you trace from the user's original request to the final action? Is the link intact?

Why Both Problems Matter¶

Scenario	Access OK?	Integrity OK?	Outcome
Normal operation	✓	✓	Correct action
Over-privileged agent	✗	✓	Correct action, but breach waiting to happen
Injection attack	✓	✗	Wrong action on right systems
Compromised agent	✗	✗	Catastrophic - wrong action, broad access

Both problems must be solved. Solving one doesn't help if the other fails.

Core Principle¶

Infrastructure beats instructions.

Don't tell the agent "only access customer service data."
Give it credentials that can only access customer service data.

Bad (Instruction)	Good (Infrastructure)
"Only access CS data"	Database view exposes only CS data
"Don't send emails without approval"	Email API requires approval token
"Stay within budget"	Hard spending cap at API gateway

Control Categories¶

1. Scope Enforcement¶

Limit what the agent can access and do - technically, not via prompts.

Control	Implementation
Network allowlist	Agent can only reach approved endpoints
Data views	Agent sees only authorised data subset
Action allowlist	Only permitted actions can execute
Resource caps	Hard limits on compute, API calls, cost
Time limits	Maximum execution duration

2. Action Validation¶

Validate every action independently. Don't trust agent reasoning.

Validation flow:

Action Validator Flow

Dry-run / simulation mode: For high-risk or first-time actions, execute in simulation mode before committing. The gateway routes the action to a sandbox or staging environment, captures the result, and presents it for review. Only after validation does the action execute against production systems. This is especially valuable during initial deployment when behavioral baselines have not yet been established.

3. Tool Output Sanitisation¶

Tool outputs are injection vectors. Treat as untrusted.

Control	Purpose
Scan for instructions	Detect "ignore previous" patterns
Truncate length	Limit context pollution
Mark as data	Clear framing that this is data, not instructions
Flag suspicious	Human review before continuing

4. Approval Workflows¶

Make approval meaningful, not rubber-stamp.

Bad	Good
"Approve?"	Show context, data, impact, expected outcome
Approve/Deny only	Approve / Deny / Modify / Escalate
Same approver for all	Different approvers by action type
No expiry	Approval expires, must re-request
No review deadline	Oversight SLA: maximum time before human review required, configurable by risk tier

5. Circuit Breakers¶

Hard stops that trigger regardless of agent "reasoning."

Threshold	Action
>100 actions in one task	Pause
>$50 in API calls	Pause
>30 minutes execution	Pause
>10% error rate	Pause
Any scope violation	Terminate
Any irreversible action	Require approval

Agent Risk Tiers¶

Agents are typically HIGH or CRITICAL tier. LOW/MEDIUM agents are rare.

Agent Type	Typical Tier	Key Controls
Read-only research	HIGH	Scope limits, output review
Internal automation	HIGH	Action allowlist, circuit breakers
Customer-facing	CRITICAL	Full approval workflow
Financial actions	CRITICAL	All controls, human approval

6. Session-Level Intent Analysis¶

Per-interaction controls have a structural blind spot: task decomposition attacks. Adversaries, including AI-orchestrated campaigns, break malicious goals into sub-tasks that individually pass every control layer. Each input clears guardrails. Each output passes Judge. No single action triggers a circuit breaker. The malicious intent only exists in the aggregate.

In September 2025, Anthropic disclosed that a state-sponsored group used Claude Code to execute 80–90% of a cyber espionage campaign autonomously by decomposing the campaign into individually benign sub-tasks. CrowdStrike documented adversaries building custom frameworks that decompose malicious tasks into innocent-looking components. Per-interaction controls catch none of this.

Session-level intent analysis detects what per-interaction controls cannot: aggregate intent across a sequence of benign-looking actions.

What to Analyze¶

Signal	What It Reveals
Action sequence patterns	Individually benign actions that form a malicious workflow when combined (read → copy → exfiltrate)
Goal drift across steps	Agent's effective goal shifting away from declared intent over multiple interactions
Cumulative scope expansion	Each action slightly extends scope; aggregate exceeds authorization
Resource accumulation	Agent gathering information or access incrementally toward a prohibited objective
Cross-session correlation	Patterns that span multiple sessions from the same user or agent identity

Implementation¶

Level	Approach	Cost
Basic	Track action sequences per session; alert on known-bad combinations (e.g., read sensitive data + call external API)	Low
Intermediate	Build session-level behavioral baselines; flag sessions whose action-type distribution deviates from cohort	Medium
Advanced	Model-as-Judge evaluates the full session trajectory against declared intent at session close or at periodic checkpoints	High

Session-Level Circuit Breakers¶

Standard circuit breakers use per-action thresholds that decomposition attacks deliberately stay under. Add aggregate thresholds:

Threshold	Example
Cumulative data access	Session has accessed >N distinct data sources → pause
Action-type diversity	Session has used >N distinct tool types → flag for review
Scope surface area	Session's aggregate data access footprint exceeds role baseline → pause
Intent coherence	Session actions don't form a coherent path toward declared goal → flag

For the full behavioral anomaly detection architecture (signal collection, aggregation pipeline, ML detection, and UEBA parallels), see Behavioral Anomaly Detection. For trace-level evaluation methodology, see Process-Aware Evaluation.

7. Synchronous Pre-Action Evaluation¶

The standard Judge operates asynchronously, evaluating actions after they have been taken. For chatbot use cases this is sufficient. For agentic systems with tool access operating at machine speed, async evaluation means damage is done before detection fires.

CrowdStrike documented the fastest eCrime breakout time at 27 seconds. The standard async Judge evaluation cycle (500ms–5s per evaluation, plus queue time) cannot intercept actions within this window. When agents can execute irreversible actions at machine speed, post-action evaluation is a forensic tool, not a prevention layer.

When to Evaluate Synchronously¶

Not every action warrants synchronous evaluation, as that would be cost-prohibitive. Use risk-based routing:

Action Category	Evaluation Mode	Rationale
Read-only queries	Async (post-action)	Low blast radius, reversible
Data writes to internal systems	Sync (pre-action) for first occurrence; async for established patterns	Medium blast radius
External API calls	Sync (pre-action)	Data leaves controlled environment
Financial transactions	Sync (pre-action) + human approval	Irreversible, high impact
Destructive operations	Sync (pre-action) + human approval	Irreversible by definition
Actions matching decomposition patterns	Sync (pre-action)	Session-level flag triggered

Architecture¶

Agent proposes action
    ↓
Action classifier (rule-based, <5ms)
    ↓ [low-risk]          ↓ [elevated-risk]
Execute immediately    Pre-action Judge evaluation (500ms–2s)
    ↓                      ↓ [pass]        ↓ [flag]
Async Judge eval       Execute          Human approval queue

Cost Implications¶

Synchronous evaluation adds latency to the agent's action loop. Budget for it:

Metric	Async-Only	Sync for Elevated-Risk
Median action latency	+0ms (eval is background)	+800ms on ~20% of actions
P99 action latency	+0ms	+3s on flagged actions
Judge API cost	Same	+15–30% (more actions evaluated inline)
Actions completed before detection	All of them	Only low-risk actions

The tradeoff is explicit: latency for prevention. For systems where a 27-second breakout window exists, the latency cost is justified.

For detailed latency budgets and cost optimisation strategies, see Cost & Latency.

8. Tool and Integration Supply Chain¶

The integration layer (MCP servers, tool endpoints, agent frameworks, RAG data sources) is the primary attack surface for agentic systems. Cisco's 2025 threat research found that attackers increasingly target the surrounding components that feed information into models rather than the models themselves.

As of 2025–2026: 43% of MCP servers tested had command injection vulnerabilities. CVE-2025-6514 achieved CVSS 10.0 (RCE via MCP). A fake npm MCP package silently copied emails. GitHub issue injection via MCP enabled full repository takeover.

This is not an optional extension. For agentic systems, supply chain security is a prerequisite.

Minimum Controls¶

Control	Implementation
Tool provenance verification	Only load MCP servers and tools from verified, signed sources. Validate content hashes before execution.
Tool output sanitisation	All tool responses are untrusted input. Scan for instruction injection, truncate, mark as data. (See §3 above.)
Network-level tool isolation	Tools execute in sandboxed environments with no access to agent memory, credentials, or other tools' state.
Permission scoping per tool	Each tool gets minimum necessary permissions. A file-reading tool cannot write. A search tool cannot execute.
Tool behavior monitoring	Baseline tool response patterns. Alert on response size anomalies, new data types, or instruction-like content.
Dependency scanning	Automated vulnerability scanning for all tool dependencies, including transitive dependencies.

MCP-Specific Controls¶

Control	Purpose
Server allowlisting	Only approved MCP servers can connect. No dynamic discovery.
Schema validation	Validate all MCP tool definitions against expected schemas before registration.
Capability restriction	Restrict MCP server capabilities to declared scope. A calendar MCP server cannot access the filesystem.
Update verification	MCP server updates require the same review as application code changes.

For the full treatment of MCP as an attack surface, including the AISI 5-level autonomy classification and SLSA-style provenance, see The MCP Problem and Supply Chain Controls.

Judge for Agents¶

Agent interactions need deeper evaluation.

Additional Criteria	Question
Goal alignment	Did agent pursue stated goal?
Action appropriateness	Were actions proportionate?
Scope adherence	Did agent stay in bounds?
Reasoning quality	Was the reasoning sound?
Efficiency	Did agent take unnecessary steps?
Session coherence	Does the action sequence form a coherent path toward the declared goal?
Trace integrity	Does the reasoning chain support the conclusion without discarded contradictions?

Monitoring¶

Signal	Concern
Action volume spike	Runaway agent
Error rate increase	Agent confused or attacking
Novel action patterns	Unexpected behavior
Scope boundary probes	Attempted breakout
Cost anomalies	Resource abuse

Recovery and Rollback¶

When integrity is compromised, you need to undo the damage.

Capability	Purpose
Action logging	Full audit trail of what agent did (not just said)
Reversibility windows	Delay irreversible actions to allow intervention
Automated rollback	Undo actions when integrity breach detected
Blast radius tracking	Know exactly what was affected

Not all actions are reversible. For those that aren't, require human approval.

Key Takeaways¶

Solve both problems - Access control AND integrity preservation
Enforce via infrastructure - Agents can ignore instructions
Validate every action - Independent of agent reasoning
Sanitise tool outputs - They're injection vectors
Use circuit breakers - Hard stops that can't be reasoned around, including session-level aggregate thresholds
Require approval for impact - Irreversible actions need humans
Enable rollback - Assume integrity will sometimes fail
Monitor aggressively - Agents can cause harm fast
Detect aggregate intent - Per-interaction controls miss task decomposition attacks; analyze session-level action sequences
Evaluate before acting - For elevated-risk actions, synchronous pre-action Judge evaluation prevents damage that async detection can only report
Secure the integration layer - Tool supply chain is the primary attack surface; treat it as a prerequisite, not an extension