Skip to content

AI Token Economics and MASO

Tokens are not just a pricing unit. They are the resource budget every agent operates within. In multi-agent systems, how that budget is consumed, protected, and wasted determines whether your AI deployment is economically viable.

The Unit of Cost

Every interaction with a language model is priced in tokens. Input tokens cover everything the model reads before responding: system prompts, conversation history, retrieved documents, tool call results, and inter-agent messages. Output tokens cover everything the model writes: responses, tool call parameters, reasoning chains, and inter-agent instructions.

Single-agent systems have predictable token economics. You send a prompt, you get a response, you pay for both. Agentic systems break that predictability in two ways.

First, context accumulates. Each step in an agent workflow adds to the context window: the original request, the plan, the tool results, the errors, the retries. A task that takes ten agent steps has ten times the context overhead of a single-step task, not counting the output tokens that each step generates.

Second, agents spawn agents. In a multi-agent workflow, the output of one agent becomes the input of the next. Agent A's response, which costs output tokens, becomes part of Agent B's context, which costs input tokens. The token cost of a message is paid twice: once when it is generated, once when it is read.

These two dynamics mean that multi-agent AI token costs are non-linear. A three-agent workflow does not cost three times a single-agent workflow. It costs more, because each agent inherits the full context of what came before it.

Where MASO Adds Token Overhead

MASO is a security framework, and security is not free. Every control layer has a token footprint.

Judge Evaluation

Each judge call is itself an LLM inference. The judge reads a structured prompt containing: its own system instructions, the OISpec it is evaluating against, the agent action or output being reviewed, and any relevant context. It produces a verdict and, at higher tiers, a reasoning chain explaining the ruling.

At Tier 3 with cloud judges on 100% of agent actions, the judge can consume more tokens than the agent it evaluates. A complex action requires a detailed evaluation. That evaluation requires a long output. For a 3-agent workflow with tactical, domain, and strategic judges, the evaluation stack can triple the total token consumption of the workflow.

OISpec Injection

Every agent and judge operates against a declared Objective Intent Specification. That specification must be present in the model's context to be effective. Long, verbose OISpecs injected wholesale into every agent call add significant input token overhead, particularly when specifications include extensive examples, edge cases, and constraint lists.

Inter-Agent Message Overhead

The secure inter-agent message bus adds structure to every agent-to-agent communication: signatures, metadata, routing information, and schema-validated payloads. That structure is verbose compared to raw text. In high-frequency multi-agent workflows, message overhead accumulates.

Flight Recorder Retrieval

Agents that need to review prior actions for goal integrity monitoring or context continuity may query the flight recorder. Each query returns structured log entries: action records, judge verdicts, PACE state transitions. That context adds to input token consumption.

Where MASO Saves Tokens

The honest framing is not whether MASO costs tokens. It does. The question is whether it saves more than it costs. In a well-implemented deployment, it usually does.

Loop Prevention

Runaway agent loops are the single largest source of token waste in agentic AI. An agent stuck in a reformulation cycle, retrying a failed tool call, or pursuing a goal it cannot achieve will continue consuming tokens until something stops it. Without MASO's loop detection and iteration caps, that something is often a budget overrun or a system timeout.

MASO's execution controls place hard limits on iterations per task, tool calls per session, and token budgets per agent. An agent that would have made 200 API calls before timing out is stopped at ten. The token saving is proportionate to how bad the loop would have been without the control. For production agentic systems, this is often the largest single cost reduction the framework delivers.

Blast Radius Containment

Without blast radius caps, a single misconfigured or manipulated agent can consume the full token budget of an entire workflow. A prompt injection that causes an agent to enter a reasoning spiral, or an adversarial input designed to maximise output verbosity, can exhaust the budget of a workflow before other agents have a chance to run.

Blast radius caps bound the damage. The token waste is still there, but it is bounded at the agent level rather than the workflow level.

SLM Sidecars: Evaluation Without API Tokens

The most significant token economics decision in a MASO deployment is whether to run judge evaluation through a cloud LLM API or through a locally-deployed distilled SLM.

Cloud judges consume API tokens for every evaluation. At 1M agent actions per month, even a small judge model running at 500 tokens per evaluation consumes 500M tokens, paid at per-token API rates. That cost scales linearly with volume.

A distilled SLM sidecar runs locally. It does not consume API tokens. The evaluation cost is infrastructure rather than consumption: fixed compute for the model, scaling only with concurrency rather than volume. At 1M evaluations per month, the economics flip entirely: the cloud judge approach costs tens of thousands of dollars; the SLM approach costs hundreds.

The critical insight: SLM evaluation is free at the token level. It does not add to your API token bill. The security evaluation layer, which can represent 100% overhead in a cloud-judge deployment, approaches zero marginal token cost with an SLM sidecar. See Distilling the Judge into a Small Language Model for the full architecture.

Mandate Specificity Reduces Agent Verbosity

Agents operating against vague instructions produce exploratory, hedged, verbose outputs. An agent that knows it should "handle customer requests" has no basis for concision. It reasons about what the request might mean, hedges against multiple interpretations, and produces long outputs that cover every possibility.

An agent operating against a specific OISpec knows exactly what it should do. The narrower the mandate, the shorter the output needed to satisfy it. A well-specified OISpec does not just improve security evaluation quality. It improves token efficiency across the board, because agents that know what they are doing produce tighter outputs.

FDoS Prevention

Adversarial token consumption is a real threat class. An attacker who can craft inputs that cause an agent to produce maximum-length outputs, trigger reasoning spirals, or enter retry loops can inflict economic harm without exfiltrating data or compromising systems. This is financial denial-of-service through token exhaustion.

MASO's input guardrails screen for characteristics associated with verbose-injection patterns before requests reach the model. The token cost of a blocked request is the guardrail evaluation. The token cost of an unblocked verbose injection is orders of magnitude higher. The economics of prevention are strongly favourable.

Risk as the Evaluation Gate

Token spend on evaluation should track risk, not request volume. This is the structural principle that separates disciplined MASO from naive MASO, and it has direct consequences for token economics.

Every agent action carries a risk classification at runtime. That classification is a function of three things: the consequence if the action is wrong, whether the action is reversible, and the authority level the agent is exercising at this step. A read operation against a public knowledge base is low risk regardless of which agent performs it. A write operation that modifies access controls is high risk regardless of how confident the agent appears.

The risk classification is not a property of the agent. It is a property of the action. The same agent can perform low-risk and high-risk actions within a single session, and each should be evaluated accordingly.

The Routing Decision

Risk classification routes each action to an evaluation path. The paths differ in token cost by orders of magnitude.

Action Risk Evaluation Path Evaluation Token Cost Notes
Low: read-only, no external state change, reversible Guardrails only, async SLM sample Near zero Guardrails are rule-based. Async SLM sample adds no latency or API tokens.
Medium: writes to internal state, reversible, limited scope SLM inline evaluation Near zero (local inference) SLM runs as a sidecar. No API token consumption. Adds 10-50ms latency.
High: external writes, difficult to reverse, broader scope SLM inline + cloud judge API tokens for cloud judge call Cloud judge is synchronous. Adds 500ms-2s latency. Use a small model unless the action demands it.
Critical: irreversible, high blast radius, regulatory consequence Synchronous cloud judge + human approval gate Highest API token cost Reserve this path. Every action routed here is expensive and slow by design.

The economic logic: most actions in a well-designed agentic workflow are low-to-medium risk. They should consume no cloud judge tokens at all. The cloud judge is reserved for the minority of actions where the consequence of a wrong call justifies the cost.

Why Naive MASO Violates This

Naive MASO applies the same evaluation intensity to every action. A read operation and an irreversible payment are evaluated by the same cloud judge, at the same cost. The result is that low-risk actions, which represent the bulk of volume in most workflows, consume the bulk of the evaluation budget.

The problem is not using a cloud judge. The problem is using it indiscriminately. A cloud judge on a read-only lookup is not more secure than guardrails plus an SLM sample. It is just more expensive.

Why Action Risk Classification Is Not Optional

If risk classification is absent, the system has two choices: evaluate everything at the highest tier (expensive, slow) or evaluate everything at the lowest tier (cheap, inadequate). Neither is correct. Risk classification is what makes proportionate evaluation possible, and proportionate evaluation is what makes MASO economically viable.

This is also why the Objective Intent and Execution Control domains exist. OISpecs declare not just what an agent should do but the risk profile of each action type it may take. Execution controls encode that risk profile into the routing logic. Without declared intent, there is no basis for classification. Without classification, there is no basis for routing. Without routing, every action costs as much as the worst action.

What the Routing Logic Looks Like

# Risk-gated evaluation routing
def route_evaluation(action, oispec):
    risk = classify_action_risk(action, oispec)

    # Guardrails run on everything, always
    guardrail_result = guardrails.evaluate(action)
    if guardrail_result.blocked:
        return block(action, guardrail_result)

    if risk == RiskLevel.LOW:
        # Async SLM sample only: no synchronous judge tokens
        async_slm_sample(action, sample_rate=0.05)
        return allow(action)

    if risk == RiskLevel.MEDIUM:
        # SLM inline: local inference, no API tokens
        slm_result = slm_sidecar.evaluate(action, oispec)
        if slm_result.flagged:
            return escalate(action, slm_result)
        return allow(action)

    if risk == RiskLevel.HIGH:
        # SLM inline first; cloud judge if SLM flags or action warrants it
        slm_result = slm_sidecar.evaluate(action, oispec)
        judge_result = cloud_judge.evaluate(action, oispec)
        if slm_result.flagged or judge_result.flagged:
            return escalate(action, judge_result)
        return allow(action)

    if risk == RiskLevel.CRITICAL:
        # Cloud judge synchronous + human gate
        judge_result = cloud_judge.evaluate(action, oispec)
        return human_approval_gate(action, judge_result)

The cloud judge is called for HIGH and CRITICAL risk actions only. For a typical workflow where 70% of actions are LOW or MEDIUM risk, this immediately eliminates cloud judge token consumption for the majority of the volume.

Classify at the action type level, not the session level

Risk classification should apply to each action individually, not to a session or agent as a whole. A session that starts with low-risk reads and escalates to high-risk writes should escalate its evaluation intensity at the point the writes begin, not retroactively from the start of the session. This keeps low-risk phases cheap and reserves evaluation cost for the actions that warrant it.

The Compounding Factor

The relationship between MASO implementation quality and token economics is not linear. Naive MASO makes token costs significantly worse. Disciplined MASO keeps costs manageable while providing genuine security.

Naive MASO: What It Looks Like

A naive MASO deployment applies controls without optimising their token footprint:

  • Cloud judge on 100% of agent actions at every boundary
  • Full OISpec injected into every context window, including sections irrelevant to the current action
  • Strategic evaluator running per action rather than per phase boundary
  • Multiple domain judges running sequentially, each receiving full context
  • No SLM distillation
  • No loop detection or iteration caps

The result: for a 3-agent workflow processing 1M actions per month, the evaluation stack alone (tactical, domain, and strategic judges, all cloud) can cost $30K-$150K per month. The generator cost for the same workflow might be $10K-$30K. Security overhead exceeds generator cost by 3-5x.

This is not MASO being expensive. This is MASO being applied without regard for token economics. The controls are real. The waste is also real.

Disciplined MASO: What It Looks Like

A disciplined MASO deployment achieves the same security posture with substantially lower token cost:

  • SLM sidecar for inline tactical evaluation (zero API tokens per evaluation)
  • OISpec summary injected at runtime (50-100 tokens vs. 500-2,000 for full spec)
  • Domain judges consolidated into a single multi-criteria SLM evaluation call
  • Strategic evaluator triggered at phase boundaries (100K phases vs. 3M actions for the same workflow)
  • Cloud judge at 1% sample for calibration and drift detection only
  • Loop detection terminating runaway tasks at iteration 10 rather than 200
  • Blast radius caps preventing any single agent from consuming more than its allocated share

The result: for the same 3-agent workflow at 1M actions per month, the evaluation stack costs $3K-$5K per month. Generator cost is unchanged. Security overhead is 10-15% of generator cost rather than 300-500%.

The Numbers Side by Side

Approach Generator cost (1M actions/month) Evaluation overhead Total
No security controls $10K-30K $0 $10K-30K
Naive MASO (cloud judge, 100%) $10K-30K $30K-150K $40K-180K
Disciplined MASO (SLM + sampling) $10K-30K $3K-5K $13K-35K

Naive MASO is harder to justify to finance than no controls at all, because it makes the cost case against security. Disciplined MASO adds 10-15% overhead, which is a reasonable security cost that most organisations will accept.

Reasoning Tokens

Reasoning models introduce a token category that does not exist in standard LLMs. Before producing a response, models like Claude with extended thinking, OpenAI o3/o4, and Gemini with thinking spend tokens on an internal reasoning chain. These reasoning tokens are charged at the same rate as regular tokens, sometimes at a premium, and they can dwarf the cost of the final output.

A standard generation request might produce 500 output tokens. The same request sent to a reasoning model might consume 8,000 reasoning tokens before producing those 500 output tokens. The reasoning tokens are the real cost. The visible output is the smaller number.

This creates two problems for MASO.

The Visibility Problem

Reasoning tokens are often partially or fully hidden. The model thinks, then responds. The thinking process is not always exposed in the API response, and when it is, it may be summarised rather than verbatim. This means the judge evaluating a reasoning model's output is evaluating the conclusion, not the process that produced it.

This matters because a reasoning model can arrive at a correct-looking output through a problematic reasoning path. An agent that was prompted to reason carefully about a task might reason its way around a constraint: "the OISpec says not to access the payments database directly, but I can achieve the same outcome by retrieving the data through the customer service API which has access to payment records." The output action might not immediately look like a policy violation. The reasoning chain that produced it definitely does.

Without visibility into reasoning tokens, the judge cannot evaluate means compliance. It can only evaluate output compliance. The Anti-Mythos judge is designed to address exactly this, but it depends on having access to the reasoning chain, which provider-level reasoning token visibility settings must be configured to expose.

The Cost Amplification Problem

Reasoning token costs compound in multi-agent workflows in the same way output tokens do, but at a higher base rate. If each agent in a 3-agent workflow uses extended thinking, and each thinking chain averages 5,000 tokens, the reasoning token bill for 1M workflow executions is 15 billion reasoning tokens before a single output token is counted.

The compounding effect is more severe than for standard tokens because reasoning is not easily sampled. You cannot decide that only 10% of agent steps need to reason carefully. If the task requires reasoning, all steps do. Sampling reasoning usage is a meaningful quality trade-off, not just an economics lever.

MASO Controls for Reasoning Models

Concern Control
Reasoning chain not visible to judge Configure provider reasoning visibility to expose chain-of-thought. Require reasoning token exposure as a deployment prerequisite for Tier 2+.
Reasoning path circumvents constraints Anti-Mythos judge evaluates means compliance at the reasoning level, not just output level. Flagging constraint workarounds in reasoning chains is a distinct evaluation criterion.
Reasoning tokens inflate cost unpredictably Set budget_tokens limits on extended thinking where the API supports it. Treat reasoning token consumption as a monitored metric with its own alert thresholds.
Judge itself uses extended thinking Reasoning judges are more accurate but dramatically more expensive. Use reasoning-mode judges for CRITICAL risk actions only. Use standard judges for HIGH and below.
Long reasoning chains displace OISpec from context Inject OISpec after rather than before the reasoning section where possible. Reasoning content in context occupies middle positions; OISpec at the end receives stronger attention.

Budget tokens, not just max tokens

Most reasoning model APIs expose a budget_tokens or thinking.budget_tokens parameter that caps the internal reasoning chain length separately from the output length. Use it. An uncapped reasoning model will spend as many tokens as it decides the task warrants, which on complex agent tasks can be an order of magnitude more than expected. Setting a reasoning budget is not a quality cut: research shows that overly long reasoning chains can reduce accuracy through overthinking. Set a budget appropriate to the task complexity.

Prompt Caching

Prompt caching is the most commonly overlooked token cost optimisation in MASO deployments, and the returns are substantial. Most major providers offer caching mechanisms that allow repeated context, system prompts, and large documents to be served from cache rather than re-processed on every request. Cached tokens are typically charged at 10-25% of standard input token rates, or free for retrieval after an initial write cost.

In a MASO deployment, the cache hit candidates are everywhere: OISpecs, judge system prompts, solution mandates, agent mandates, tool schemas, and reference documents. These are long, stable, and repeated on every call. Without caching, every agent invocation re-sends and re-processes the same thousands of tokens of boilerplate context.

What Qualifies for Caching

Content Size Cache Candidate Notes
Agent system prompt 500-2,000 tokens Strong Stable across all requests for that agent
Judge system prompt 1,000-3,000 tokens Strong Same judge prompt used for all evaluations of that type
OISpec (full) 500-5,000 tokens Strong Changes rarely; version-controlled
Solution mandate 300-1,500 tokens Strong Stable per deployment
Tool schemas 200-2,000 tokens Strong Defined at deploy time; rarely changes
RAG retrieved documents Variable Conditional Cacheable if the same document is retrieved across multiple requests
Conversation history Variable Weak Changes every turn; low cache hit rate
Current action / user input Variable None Always unique

Cache Economics in a MASO Context

For a single agent with a 2,000-token system prompt and a 1,500-token OISpec, every uncached invocation pays for 3,500 input tokens of boilerplate. At 1M invocations per month, that is 3.5 billion boilerplate input tokens.

With caching, the first request writes the cache. Every subsequent request retrieves it at 10-25% cost. For a deployment that was previously spending $35,000/month on boilerplate input tokens, caching reduces that line item to $3,500-$8,750. The optimisation requires no architectural changes, no SLM distillation, and no change to evaluation logic.

For judge invocations, the savings are larger. A judge system prompt is longer than an agent system prompt (it includes evaluation criteria, OISpec, scoring rubrics, and output format instructions), it is identical for every evaluation of the same action type, and it is called at high frequency. Without caching, the judge's boilerplate context is the most expensive repeated token cost in the stack.

Cache Interaction with MASO Controls

Prompt caching is not entirely free of security considerations. There are two interactions worth knowing.

OISpec version drift. If a cached OISpec becomes stale while the live OISpec has been updated, agents and judges will operate against the old version until the cache expires. For Tier 2 and above, OISpec updates should explicitly invalidate the relevant cache entries rather than waiting for TTL expiry. Treat OISpec version mismatches as a control failure, not a minor inconsistency.

Cache poisoning. Cached prompts that include dynamic content (user-provided strings that were included in a system prompt and cached) create a path for injected content to persist across sessions. A user who successfully injects content into a cacheable prompt segment can have that injection served to subsequent users who receive the poisoned cache entry. The mitigation is structural: cache only the static sections of prompts, never the dynamic sections. The split should be explicit in your prompt architecture.

Cache Security Rule Why
Cache only static prompt sections Dynamic content in cache is a persistence vector for injection
Invalidate cache on OISpec update, not on TTL Stale OISpecs mean judges evaluate against outdated criteria
Verify cache hit content matches current OISpec version Cache hit confirmation should include a version check, not just a hit/miss flag
Log cache misses as observability events Unexpected cache misses may indicate cache invalidation by an attacker

Cache Hit Rate as a Health Metric

In a well-architected MASO deployment, prompt cache hit rates for system prompts and OISpecs should be above 90%. A hit rate significantly below that indicates either that the cache is misconfigured, that OISpecs are being modified too frequently, or that prompt structure is including dynamic content in cacheable sections.

Cache hit rate belongs on the same dashboard as evaluation token ratio and loop amplification factor. A drop in cache hit rate is a signal: something about the prompt structure changed, and that change may be intentional (a deployment update) or unintentional (a prompt injection that modified a previously-stable section).

Token Ratios Worth Tracking

Standard cost dashboards track spend. Token economics requires tracking ratios, because spend alone does not reveal whether the token budget is being used well or wasted.

Evaluation token ratio: Judge tokens consumed divided by generator tokens consumed. With a well-implemented SLM sidecar, this ratio can approach zero for API billing purposes. With a naive cloud judge on 100% of requests, it can exceed 3.0. Target below 0.2 for API-billable evaluation.

Loop amplification factor: Actual agent iterations per task divided by the expected iterations for a correctly-functioning agent. A factor of 1.0 means no unnecessary iterations. A factor above 3.0 signals a loop problem that is costing real money.

Context bloat coefficient: Actual input tokens per agent call divided by the minimum input tokens necessary to complete the task. A coefficient of 1.0 means perfect efficiency. Coefficients above 2.0 often indicate OISpec injection of irrelevant content, accumulated context that could be summarised, or tool results being passed in full when summaries would suffice.

Security token efficiency: The cost of security controls divided by the number of confirmed policy violations detected. A security layer that catches no violations at high token cost should be reviewed. A security layer that catches frequent violations at low token cost is earning its keep.

Reasoning token ratio: Reasoning tokens consumed divided by output tokens produced. A ratio above 20:1 signals that agents are spending far more on thinking than on doing, which may indicate overly complex tasks, insufficient constraint specificity, or uncapped extended thinking. Set a baseline per task type and alert on deviations.

Prompt cache hit rate: Cache hits as a percentage of cacheable requests. Should be above 90% for system prompts, OISpecs, and tool schemas in a stable deployment. Rates below 70% indicate prompt architecture problems or excessive OISpec churn. A sudden drop may indicate prompt structure modification.

Token Exhaustion and MASO Performance Degradation

Token exhaustion is not just an economic problem. It is a quality problem, and for MASO specifically, it is a security problem. When a model approaches or hits token limits, its outputs degrade in ways that are predictable, measurable, and often invisible to the system around it.

There are two distinct failure modes: context window exhaustion, where the model's context is full and it can no longer attend to all relevant information, and budget exhaustion, where a hard token cap terminates inference mid-task. Both matter. They cause different failures.

Context Window Exhaustion

Language models do not process context uniformly. Attention degrades as context grows: content near the beginning and end of a context window receives more reliable attention than content in the middle. This is the "lost in the middle" effect, documented in research by Liu et al. (2023) and replicated across model families. For short contexts it is negligible. For long agentic sessions, it becomes a primary driver of quality degradation.

The practical consequences for agentic workflows:

OISpec displacement. The OISpec is typically injected at the start of an agent's context. In a long session, it moves progressively further from the active attention window. An agent that reliably followed its mandate at step 5 may begin to drift from it at step 50, not because the agent was compromised but because the mandate has been displaced from effective attention. The agent still "knows" the OISpec in the sense that the tokens are present, but the model no longer weighs them heavily when generating the next action.

Hallucination rate increase. As context fills, models increasingly generate content from training knowledge rather than from the grounded context. Early tool call results, earlier agent reasoning, and prior conversation history fade from effective attention. The model fills the gaps with plausible-sounding content. In agentic workflows where downstream agents treat prior agent output as ground truth, hallucinations compounding across a long session are one of the most dangerous failure modes MASO must detect.

Instruction following degradation. Instructions embedded mid-context, including constraint updates, PACE state transitions, and mid-session mandate revisions, lose reliability as the context grows. A constraint injected at position 80K in a 128K context window is less reliably followed than the same constraint injected at position 5K.

Budget Exhaustion

Budget exhaustion is different. The model does not degrade gradually: inference stops when the token cap is reached. The output is truncated at a semantically arbitrary point, producing incomplete reasoning chains, partial tool call parameters, or cut-off responses.

For a generator agent, a truncated response is a bad user experience. For a judge, it is a security failure. A judge that reaches its token limit mid-verdict may have produced a partial approval where a full evaluation would have flagged the action. The fact that the verdict was cut off rather than deliberately permissive does not change the outcome: the action proceeds without complete evaluation.

Budget exhaustion in judges is particularly hazardous because the truncation is often invisible. The judge produces output up to the limit. If the verdict fields are populated before the reasoning chain completes, the system may read the verdict as valid. The reasoning that would have supported a different conclusion never materialises.

How Token Exhaustion Specifically Degrades MASO

MASO Component Effect of Context Exhaustion Effect of Budget Exhaustion
Judge evaluation OISpec displaced from active attention; evaluation quality degrades as context grows; verdicts drift from declared criteria Verdict truncated mid-reasoning; partial approvals recorded as valid; complete evaluation never produced
Guardrails (ML-based) Attention-based classifiers degrade on long inputs; patterns near the start of long inputs may be missed Hard token limits prevent evaluation of long inputs entirely; long inputs may bypass checks
Goal integrity monitoring Early goal state displaced from attention; drift from original objective becomes harder to detect Monitoring output incomplete; partial results misread as full assessments
OISpec adherence Constraint compliance degrades as OISpec moves out of active attention range If OISpec is injected near the token cap, it may be truncated before reaching the model
Strategic evaluator Long workflow histories exceed evaluator context; early phase outcomes missed Multi-phase summary cut off before all phases assessed
Inter-agent communication Agent A's output degraded by context pressure before being passed to Agent B; degraded content propagates downstream Truncated inter-agent messages arrive at Agent B malformed or incomplete

The PACE Implication

Context window pressure and budget exhaustion are both failure modes that should trigger PACE escalation. An agent that is approaching context limits is not operating normally: its instruction-following reliability is degraded, its hallucination rate is elevated, and its judge is evaluating against increasingly displaced criteria.

MASO treats this as a Contingency trigger, not a routine condition. The indicators:

  • Context fill rate above 70%: Increase evaluation sampling rate. Begin context summarisation.
  • Context fill rate above 85%: Halt new task intake for this agent. Complete current task, then reset context with summarised history.
  • Context fill rate above 95%: Trigger PACE Alternate. Route new requests to a fresh agent instance. Preserve full context snapshot for forensic review.
  • Judge budget exhaustion detected: Flag the evaluation as incomplete. Do not record the partial verdict as a valid approval. Escalate the action to human review.

The monitoring requirement is specific: the framework needs visibility into context utilisation as a runtime metric, not just token spend. Context fill percentage should be a first-class observable alongside cost and latency.

Mitigations

Context summarisation at boundaries. At natural phase boundaries, summarise accumulated context into a compact representation and replace the full history with the summary. The summary preserves facts and outcomes while freeing context space. The risk: summarisation itself can lose nuance, particularly for constraint details and prior judge verdicts. Summarise conservatively, and always retain the full OISpec and current mandate rather than summarising them.

Judge context isolation. The judge should not inherit the full agent context window. It should receive a structured, minimal evaluation prompt: the action being assessed, the relevant OISpec constraints for this action type, and any directly relevant prior context. Passing the full 80K-token agent session to a judge is wasteful and counterproductive. The judge needs focused context, not complete history.

Sliding window evaluation. For long-running agents, maintain a sliding evaluation window that always contains the most recent N actions plus the OISpec, rather than accumulating the full session history. The trade-off is loss of long-range pattern detection, which is why the strategic evaluator running against full session summaries at phase boundaries is a complement rather than a replacement.

Generous judge token budgets. Judge token budgets should be set significantly above the expected evaluation length, not at the expected length. A judge budget set exactly at the average evaluation length will regularly be hit for above-average cases. Set judge budgets at the 99th percentile of expected evaluation length. The marginal cost of unused budget capacity is zero; the cost of a truncated verdict is a security failure.

Explicit context budget monitoring. Treat context utilisation as a monitored metric. Set PACE thresholds on context fill rate the same way you set them on error rates and cost rates. An agent silently drifting toward context exhaustion without a PACE trigger is an unmonitored failure mode.

A degraded judge is worse than no judge

A judge that is producing unreliable verdicts due to context pressure may provide false assurance. An action that a fresh-context judge would have flagged may pass through a context-exhausted judge with an approval. The system records a positive evaluation. The action proceeds. The failure is invisible. Monitoring context fill rate and treating high fill rate as a quality degradation signal, not just a cost signal, is essential at Tier 2 and above.

Practical Recommendations

Distill the judge before you scale. The token economics of cloud judge evaluation are manageable at low volumes and unsustainable at high volumes. If you are evaluating more than 5% of requests synchronously and processing more than 100K requests per month, model the SLM sidecar. The break-even is typically around 50,000 evaluations per month. Above that, local inference is almost always cheaper. See Cost & Latency for the detailed cost model.

Summarise, don't inject in full. OISpecs are detailed documents. They do not need to be present in their entirety for every agent call. Extract the constraints and evaluation criteria relevant to the current action type and inject only those. A targeted 80-token constraint summary achieves better evaluation focus than a 2,000-token full specification, and costs 96% fewer input tokens.

Move strategic evaluation to phase boundaries. The strategic evaluator assesses combined workflow outputs against the solution mandate. It does not need to run on every agent action. Trigger it at natural phase boundaries: end of research phase, end of drafting phase, before final output. This reduces strategic evaluation calls by 10-100x without reducing coverage of the cases that matter.

Instrument loop detection before anything else. Loop detection is the highest-ROI token governance control in agentic AI. It is cheap to implement, immediate in effect, and addresses the largest single source of runaway token consumption. Implement it before any other optimisation.

Treat the token budget as a first-class agent input. Google's Budget Tracker research demonstrated that agents given explicit budget awareness make more efficient decisions. Rather than enforcing budget limits as external constraints applied after the fact, pass remaining token budget as a runtime variable that agents can observe and reason about. Agents that know they are running low on budget shift behaviour accordingly, resolving tasks with fewer iterations. See Economic Governance for the full treatment.

Enable prompt caching for all static context before optimising anything else. OISpecs, agent system prompts, judge system prompts, and tool schemas are long, stable, and called at high frequency. Caching them requires no architectural changes and typically cuts input token costs by 30-60% for well-structured MASO deployments. Do this first, before SLM distillation, before sampling rate tuning, before anything else. The return is immediate and the implementation risk is low.

Cap reasoning budgets explicitly. If any agent or judge in your deployment uses a reasoning model, set budget_tokens (or the equivalent parameter for your provider) on every call. An uncapped reasoning model on a complex agentic task can consume 20,000-50,000 reasoning tokens for a single action. That is not a pathological case — it is what reasoning models do when given room to think. Cap reasoning budgets at the 95th percentile of what the task actually needs, verify that quality holds at that cap, then enforce it in production.

Expose reasoning chains to the judge. If agents use extended thinking, configure provider settings to expose the reasoning trace. A judge that can only evaluate the output of a reasoning model cannot assess means compliance. The reasoning chain is where constraint workarounds, policy evasions, and goal substitutions are visible. Without it, the judge is evaluating the surface of the decision, not the decision itself.

Separate security token costs from generator token costs in your dashboards. When token costs are reported as a single number, cost pressure lands on whatever is easiest to cut. Security controls are easy to cut and hard to justify in the abstract. When security token costs are reported separately, alongside the policy violations they detect and the incidents they prevent, the ROI becomes visible. This is not just good accounting. It is how security controls survive budget pressure.

Never cut security controls to meet a token budget

If the token budget does not support the required evaluation intensity for the risk tier, the correct response is to reduce the system's scope or autonomy, not to weaken controls. A Tier 3 system running without adequate evaluation is not a Tier 3 system. It is a Tier 1 system with Tier 3 consequences. See Economic Governance for the governance decision framework.

References