Output Evaluator: Session-Aware, Pre-Action Evaluation Architecture¶

Closing the gap between per-interaction controls and campaign-level threats.

The Problem This Solves¶

The framework's three-layer pattern (Guardrails → Judge → Human Oversight) operates at the interaction level. Each input is checked. Each output is evaluated. Each action is individually validated.

This architecture has three structural blind spots exposed by 2025–2026 threat intelligence:

Blind Spot	Threat Evidence	Why Per-Interaction Controls Miss It
Task decomposition	Anthropic disclosed AI-orchestrated espionage campaign (80–90% autonomous) decomposed into individually benign sub-tasks; CrowdStrike documented adversary frameworks that split malicious tasks into innocent components	Each sub-task passes guardrails, Judge, and circuit breakers independently. Malicious intent exists only in the aggregate.
Speed asymmetry	CrowdStrike: fastest eCrime breakout at 27 seconds; agents execute tool calls in milliseconds	Async Judge evaluation (500ms–5s + queue) fires after the damage is done. Detection without prevention.
Integration layer as primary attack surface	43% MCP server vulnerability rate; CVE-2025-6514 CVSS 10.0 RCE; fake npm MCP packages; GitHub issue injection → repo takeover	Core controls don't cover the tool supply chain. Attack enters through the tools, not the model.

The Output Evaluator is a solution architecture that addresses the first two blind spots. (The third is addressed by elevating supply chain controls to core; see Agentic Controls §8.)

Architecture Overview¶

The Output Evaluator operates at three timescales simultaneously:

Output Evaluator Timescales

Layer 1: Real-Time Pre-Action Evaluation¶

For agentic systems, the Judge must be able to evaluate before an action executes, not only after.

Action classification (rule-based, <5ms):

Every proposed agent action is classified into a risk category before execution:

Category	Examples	Evaluation Mode
Low	Read-only queries, internal lookups	Execute → async Judge
Medium	Data writes, internal API calls	Execute → async Judge (first occurrence: sync)
Elevated	External API calls, multi-record operations	Sync pre-action Judge → execute on pass
High	Financial transactions, destructive operations	Sync pre-action Judge + human approval → execute
Flagged	Any action when session-level flag is active	Sync pre-action Judge regardless of action type

Pre-action Judge prompt structure:

You are evaluating a proposed agent action BEFORE it executes.

Declared user intent: {original_user_request}
Current session context: {session_summary}
Actions taken so far this session: {action_history}
Proposed action: {action_description}
Action parameters: {parameters}

Evaluate:
1. Does this action align with the declared user intent?
2. Is this action necessary to achieve the stated goal?
3. Does this action combined with previous session actions
   suggest scope beyond the declared intent?
4. Would this action be irreversible or have external impact?

Response: APPROVE / ESCALATE / BLOCK
Reasoning: [brief]

Layer 2: Session-Level Intent Coherence¶

This layer analyzes the aggregate trajectory of a session, not individual actions.

Session state tracking:

For each active session, maintain:

{
  "session_id": "sess_abc123",
  "declared_intent": "Summarise Q4 sales data for the board presentation",
  "start_time": "2026-03-09T14:00:00Z",
  "actions": [
    {"type": "read", "target": "sales_db", "scope": "Q4_2025", "time": "T+0s"},
    {"type": "read", "target": "hr_db", "scope": "employee_list", "time": "T+45s"},
    {"type": "read", "target": "finance_db", "scope": "salary_bands", "time": "T+90s"},
    {"type": "api_call", "target": "email_api", "scope": "draft", "time": "T+120s"}
  ],
  "aggregate_metrics": {
    "distinct_data_sources": 3,
    "distinct_tool_types": 2,
    "data_scope_surface": "sales + HR + finance",
    "intent_coherence_score": 0.3
  }
}

In this example, the declared intent is "summarise Q4 sales data" but the session has accessed HR and finance databases and is drafting an email. The intent coherence score is low. This triggers a checkpoint.

Session checkpoints:

Trigger	Action
Intent coherence score drops below threshold	Pause session, evaluate trajectory
Cumulative data scope exceeds declared intent	Flag for human review
Action sequence matches known decomposition pattern	Escalate immediately
Session duration exceeds baseline for task type	Log and increase monitoring
Periodic checkpoint (every N actions or T minutes)	Lightweight intent coherence check

Intent coherence scoring:

Score how well the session's actual actions align with its declared intent:

Signal	Weight	Measurement
Data source relevance	High	Are accessed data sources relevant to declared intent?
Action type appropriateness	High	Are the types of actions taken (read/write/call) expected for this task?
Scope accumulation	Medium	Is the aggregate data footprint proportionate to the stated goal?
Temporal pattern	Low	Does the pacing match expected workflow for this task type?
Tool diversity	Medium	Is the number of distinct tools used expected for this task?

Layer 3: Campaign-Level Pattern Correlation¶

This layer operates across sessions, detecting patterns that span multiple interactions or identities.

Cross-session signals:

Pattern	Detection Method
Same user, serial sessions, incrementally expanding scope	Track scope surface area across sessions for each identity
Different users, same action sequence	Cluster session trajectories; flag when distinct users follow identical unusual patterns
Low-and-slow data gathering	Aggregate data access volume per identity over rolling windows (24h, 7d, 30d)
Coordinated probing	Detect when multiple agents or users are testing boundaries around the same resources simultaneously

Integration with UEBA:

This layer maps directly to existing User and Entity Behavior Analytics. The agent's NHI (non-human identity) should be enrolled in the same UEBA pipeline as human accounts:

UEBA Dimension	Agent Application
Peer group baseline	Compare agent behavior against other agents with same role/config
Individual baseline	Track each agent's behavioral profile over time
Temporal profile	Expected activity hours, volume patterns
Access profile	Normal data sources, tool usage, API call patterns
Anomaly scoring	Composite risk score updated per action

See Behavioral Anomaly Detection for the full UEBA-to-agent mapping.

Implementation Tiers¶

Tier 1: Action Classification + Known-Bad Sequences¶

Effort: Low-Medium | Value: Blocks obvious decomposition patterns

Rule-based action classifier (<5ms per action)
Sync pre-action check for elevated/high-risk actions
Hardcoded known-bad action sequences (e.g., read_sensitive → call_external_api)
Session-level aggregate thresholds (max data sources, max tool types)

Tier 2: Intent Coherence + Statistical Baselines¶

Effort: Medium | Value: Detects novel decomposition patterns

Everything in Tier 1
Intent coherence scoring per session
Behavioral baselines per agent role (statistical, not ML)
Periodic session trajectory checkpoints
Cross-session scope tracking per identity

Tier 3: LLM-on-Trajectory + ML Anomaly Detection¶

Effort: High | Value: Catches sophisticated, novel campaigns

Everything in Tier 2
Model-as-Judge evaluates full session trajectory at checkpoints and close
ML-based anomaly detection on session trajectories
Cross-session pattern clustering
Peer group deviation analysis
Campaign-level correlation across identities

Cost and Latency Impact¶

Component	Added Latency	Added Cost (at 100K sessions/month)
Action classifier	<5ms per action	Negligible (rule-based)
Sync pre-action Judge	500ms–2s on ~20% of actions	$2K–8K (LLM API calls)
Session state tracking	None (async write)	$200–500 (storage + compute)
Intent coherence scoring	None (async, periodic)	$500–2K (compute)
Session trajectory Judge	None (async, at checkpoint/close)	$1K–5K (LLM API calls)
Cross-session correlation	None (batch)	$500–2K (compute)

Total overhead: $4K–18K/month at 100K sessions. Compare to the cost of a single successful decomposition attack (data breach, unauthorized financial transaction, infrastructure compromise).

What This Does NOT Solve¶

Limitation	Why
Novel attack patterns with normal-looking trajectories	If the decomposed sub-tasks genuinely look like normal work, no behavioral analysis will catch them until post-incident forensics
Single-action catastrophic attacks	If one action causes irreversible harm, session-level analysis is too late; this is what pre-action evaluation and infrastructure-level blast radius limits address
Insider threats with legitimate access patterns	A user doing exactly what they're authorized to do, for unauthorized purposes, produces a normal-looking trace
Evaluation evasion by the agent itself	If the primary model can detect that its trajectory is being evaluated, it may adapt; see Evaluation Integrity Risks

Relationship to Existing Framework Components¶

Component	Relationship
Controls: Guardrails	Output Evaluator does not replace guardrails. Guardrails remain Layer 1 for known-bad patterns.
Controls: Judge	Output Evaluator extends the Judge with synchronous pre-action mode and session-level trajectory evaluation. The async Judge continues to operate for post-action review.
Agentic Controls	§6 (Session-Level Intent Analysis), §7 (Synchronous Pre-Action Evaluation), and §8 (Tool Supply Chain) are the core control specifications. This document provides the solution architecture.
Behavioral Anomaly Detection	Output Evaluator's campaign-level layer uses the UEBA framework described there.
Process-Aware Evaluation	Output Evaluator operationalises process-aware evaluation for the pre-action and session-level layers.
Cost & Latency	Pre-action evaluation adds to the latency budget. Budget accordingly.

Implementation Order¶

Action classifier: Categorize every agent action by risk level (rule-based, fast)
Sync pre-action gate: Block elevated-risk actions until Judge approves
Session state tracking: Record action sequences per session
Known-bad sequence detection: Alert on hardcoded dangerous action combinations
Aggregate thresholds: Session-level circuit breakers (data sources, tool diversity, scope)
Intent coherence scoring: Measure alignment between actions and declared intent
Behavioral baselines: Statistical baselines per agent role for deviation detection
LLM trajectory evaluation: Judge evaluates full session trajectory at checkpoints
Cross-session correlation: Detect patterns spanning multiple sessions
Peer group analysis: Compare agent behavior against fleet baselines