Cost and Latency¶

The three-layer pattern is not free. Budget for it.

The Problem¶

Each layer adds cost and latency:

Layer	Latency Added	Cost Per Request	At 1M requests/month
Guardrails (rule-based)	5–20ms	~$0 (compute only)	Negligible
Guardrails (ML classifier)	20–100ms	$0.001–0.005	$1K–5K
Distilled SLM (local model)	10–50ms	~$0 (compute only)	Infrastructure cost only
Judge (LLM evaluation)	500ms–5s	$0.01–0.05	$10K–50K
Human Oversight (per review)	Minutes–hours	$5–50 per review	Depends on sample rate

For a Tier 3 system running the full pattern on every request, the Judge alone can cost more than the generator. A distilled SLM can eliminate that cost for routine screening while maintaining 100% inline coverage.

Sampling Strategies¶

You don't have to judge every request. Match evaluation density to risk.

By Risk Tier¶

Risk Tier	Guardrails	Judge	Human Review
Tier 1 (Low)	100% of requests	5–10% sample	1% or anomaly-triggered
Tier 2 (Medium)	100% of requests	25–50% sample	5% + all judge flags
Tier 3 (High)	100% of requests	100% of requests	10% + all judge flags

Adaptive Sampling¶

Increase judge evaluation rate when signals indicate elevated risk:

Trigger	Sampling Adjustment
Guardrail block rate above baseline	Increase judge rate by 2x
New user (first 50 requests)	Judge 100%
After-hours usage (if unusual for your environment)	Increase judge rate by 2x
Prompt attack detected	Judge 100% for that user for 24 hours
Model provider change notification	Judge 100% for 48 hours

Stratified Sampling¶

Not all requests carry equal risk. Sample by category:

Request Type	Judge Rate	Rationale
FAQ / simple lookup	5%	Low risk, repetitive
Creative generation	25%	More variable, higher guardrail miss rate
Data analysis / summarisation	50%	Accesses user data, exfiltration risk
Decision support	100%	Consequential output
Actions / tool use	100%	Real-world impact

Latency Budgets¶

Design your latency budget before adding controls.

Example: Customer-Facing Chat (Tier 2, Streaming)¶

Component	Budget	Actual
Input guardrails	20ms	15ms (rule-based)
LLM generation (first token)	500ms	400ms
Buffer evaluation (per chunk)	50ms	30ms (rule-based)
Total to first visible token	570ms	445ms
Post-stream judge evaluation	N/A (async)	2s

Example: Document Processing (Tier 3, Non-Streaming)¶

Component	Budget	Actual
Input guardrails	100ms	50ms
LLM generation (complete)	10s	8s
Output guardrails	100ms	60ms
Judge evaluation	5s	3s
Total before delivery	15.2s	11.1s

Example: Agentic Tool Calls (Tier 3, SLM Sidecar)¶

For agentic systems where a distilled SLM screens every action inline:

Component	Budget	Actual
Input guardrails	20ms	15ms
LLM generation (tool call)	500ms	400ms
SLM sidecar evaluation	50ms	25ms
Tool execution	200ms	150ms
Total per action	770ms	590ms
Large Judge (async, 1% sample)	N/A	2s

The SLM adds negligible latency compared to a cloud Judge call, making 100% inline evaluation feasible without breaking the latency budget.

What Breaks the Budget¶

Problem	Cause	Mitigation
Judge adds 5s to every request	Using large model for judge	Use smaller model (Haiku-class) for routine evaluation, or distill into an SLM for sub-50ms inline checks
Guardrail latency spikes	ML classifier cold start	Pre-warm classifiers, use rule-based for latency-critical path
Multiple judge calls per request	Evaluating multiple dimensions separately	Batch evaluations into a single prompt
Human review blocks delivery	Synchronous human review on all flags	Async review for medium flags; synchronous only for high/critical
SLM cold start on first request	Model not loaded into memory	Pre-load the SLM at pod/process start, keep it resident

Cost Optimisation¶

Judge Model Selection¶

Judge Model Tier	Cost (per 1K eval tokens)	Accuracy	When to Use
Small (Haiku, GPT-4o-mini)	~$0.001	80–85%	Tier 1, high-volume screening
Medium (Sonnet, GPT-4o)	~$0.01	88–93%	Tier 2, balanced cost/accuracy
Large (Opus, GPT-4)	~$0.05	93–97%	Tier 3, consequential decisions

Tiered Evaluation¶

Run cheap evaluation first; escalate to expensive evaluation only when needed:

Request → Rule-based guardrails (free, fast)
  ↓ (passed)
Request → Small model judge (cheap, fast)
  ↓ (flagged or uncertain)
Request → Large model judge (expensive, accurate)
  ↓ (flagged)
Request → Human review (most expensive)

This reduces cost by 60–80% compared to running the large model on everything.

For even greater savings, consider distilling the large Judge into a Small Language Model that runs locally.

SLM Cost Profile¶

A distilled SLM deployed as a sidecar fundamentally changes the cost model. Instead of paying per-token API costs for every evaluation, you pay a fixed infrastructure cost regardless of volume.

Cost Component	Cloud Judge (API)	Distilled SLM (Local)
Per-evaluation cost	$0.01–0.05	~$0 (compute only)
At 1M evaluations/month	$10,000–$50,000	Infrastructure only
Infrastructure	None (API)	~$50–200/month per node (CPU, 1GB RAM for INT4 model)
Teacher verification (1% sample)	N/A	$100–500/month
Initial distillation	N/A	One-time: $500–2,000 (Teacher labelling + training compute)
Retraining (monthly)	N/A	$200–500 per cycle

Break-even point: At roughly 50,000 evaluations per month, the SLM approach becomes cheaper than even the smallest cloud Judge model. Above 500,000 evaluations per month, the savings are substantial.

Monthly Evaluations	Cloud Judge (Small Model)	SLM + 1% Teacher Verification
100K	$100–500	~$250–400 (fixed)
500K	$500–2,500	~$300–500 (fixed)
1M	$1,000–5,000	~$350–700 (fixed)
10M	$10,000–50,000	~$500–1,000 (fixed)

The SLM cost stays nearly flat as volume grows because the marginal cost per evaluation is compute only. The cloud Judge cost scales linearly with volume.

When an SLM makes financial sense

If you need to evaluate more than 5% of requests with a Judge, and your volume exceeds 100K requests per month, model the cost of an SLM sidecar. For agentic systems that require 100% inline evaluation, the SLM is almost always the cheaper option at scale. See Distilling the Judge into a Small Language Model for the full architecture.

Caching¶

Judge evaluations on identical or near-identical inputs can be cached:

Cache Type	Hit Rate	Risk
Exact match (same input hash)	Low (5–10%)	None
Semantic similarity (embedding distance < threshold)	Medium (15–30%)	Adversarial inputs designed to be semantically similar but functionally different

Only cache for Tier 1. For Tier 2–3, the risk of cache-based bypass outweighs the cost saving.

Total Cost of Evaluation (Multi-Agent Workflows)¶

The per-layer costs above model a single judge. In a multi-agent workflow, the full evaluation stack includes multiple evaluation roles. This section models the compound cost so you can budget for the complete architecture, not just one layer.

Evaluation Stack Components¶

Component	What It Does	Evaluation Rate	Cost Driver
Tactical judge	Evaluates each agent action against its OISpec	Per-action (100% at Tier 3)	Highest volume. Dominates cost unless using SLM.
Domain judges (fraud, security, compliance)	Evaluates actions from specific policy perspectives	Per-action or per-phase	Can be consolidated into a single multi-criteria evaluation call, or run as separate SLM sidecars.
Strategic evaluator	Assesses combined outputs against workflow intent	Per-phase + post-execution	Low volume. Single LLM call per evaluation.
Meta-evaluator	Monitors judge drift via calibration test cases	Scheduled (daily/weekly)	Negligible at scale. Fixed cost regardless of transaction volume.
Observer	Anomaly scoring, PACE triggers	Continuous (metrics pipeline)	Infrastructure cost. No per-evaluation LLM calls.

Compound Cost Model: Fraud Detection Example¶

A fraud detection workflow processing 1M transactions/month with 3 task agents:

Scenario A: Cloud Judge (all evaluation via API)

Component	Volume	Cost per eval	Monthly cost
Tactical judge (3 agents × 1M actions)	3M	$0.01-0.05	$30,000-150,000
Domain judges (fraud + compliance, sampled 25%)	750K	$0.01-0.03	$7,500-22,500
Strategic evaluator (per-phase, ~100K phases)	100K	$0.01-0.05	$1,000-5,000
Meta-evaluator (daily calibration, 100 test cases)	3K	$0.05	$150
Observer	Continuous	Infrastructure	$500-1,000
Total			$39,150-178,650

Scenario B: SLM Sidecar + Sampled Cloud Judge

Component	Volume	Cost per eval	Monthly cost
SLM tactical judge (3 agents × 1M actions)	3M	~$0 (compute)	$150-600 (infrastructure)
SLM domain evaluation (multi-criteria, 100%)	3M	~$0 (compute)	Included in SLM infra
Cloud Judge teacher verification (1% sample)	30K	$0.05	$1,500
Strategic evaluator (per-phase, ~100K phases)	100K	$0.01	$1,000
Meta-evaluator (daily calibration)	3K	$0.05	$150
Observer	Continuous	Infrastructure	$500-1,000
Total			$3,300-4,250

The SLM approach reduces compound evaluation cost by 95-97% at this volume. The break-even is even lower for multi-agent workflows than for single-agent systems because the evaluation volume multiplies with agent count.

Budget the full stack, not one layer

When presenting evaluation costs to finance, model the complete stack (tactical + domain + strategic + meta + observer), not just the tactical judge. Then show both cloud and SLM scenarios. The SLM scenario is almost always the right answer for high-volume multi-agent workflows.

Critical-Path Latency for Time-Sensitive Workflows¶

The per-layer latency budgets above show each component in isolation. For time-sensitive workflows (fraud detection, trading, real-time safety), what matters is the critical path: the synchronous components that must complete before the action executes.

Synchronous vs. Asynchronous Evaluation¶

Not every evaluation component sits on the critical path. Most run asynchronously, adding zero latency to the transaction.

Component	On Critical Path?	Latency Added	Notes
Input guardrails	Yes (synchronous)	5-20ms	Always inline. Rule-based is fastest.
SLM tactical judge	Yes (synchronous)	10-50ms	Sidecar evaluation. Sub-50ms is achievable.
Cloud tactical judge	Usually no (async)	500ms-5s	Only synchronous for CRITICAL risk at Tier 3.
Domain judges (SLM)	Yes if SLM (sync)	10-50ms	Can be batched into the tactical SLM call.
Domain judges (cloud)	No (async)	500ms-5s	Async with "most restrictive wins" default.
Strategic evaluator	No (async)	1-5s	Runs at phase boundaries, not per-action.
Meta-evaluator	No (scheduled)	0ms	Runs on a schedule. Not per-transaction.
Observer	No (background)	0ms	Metrics pipeline. Non-blocking.

Example: Fraud Detection (Tier 2, SLM Sidecar)¶

Transaction arrives
  → Input guardrails                     5-20ms  [sync]
  → Agent processes transaction          50-200ms [sync]
  → SLM tactical + domain evaluation     10-50ms  [sync]
  → Action executes (flag/approve/block) 5-10ms   [sync]
                                         ─────────
  Total critical path:                   70-280ms

  → Cloud Judge (1% sample)              2-5s     [async, non-blocking]
  → Strategic evaluator (end of batch)   1-5s     [async, non-blocking]

Critical-path overhead from evaluation: 10-50ms. That is the SLM sidecar. Everything else is asynchronous.

Example: Trading Compliance (Tier 3, CRITICAL Risk)¶

Trade request arrives
  → Input guardrails                     5-20ms   [sync]
  → Agent generates trade recommendation 100-500ms [sync]
  → SLM tactical evaluation              10-50ms  [sync]
  → Cloud Judge (synchronous, CRITICAL)  500ms-2s [sync, CRITICAL only]
  → Compliance domain judge (sync)       500ms-2s [sync, CRITICAL only]
                                         ──────────
  Total critical path:                   1.1-4.6s

  → Strategic evaluator (post-trade)     1-5s     [async]

For CRITICAL risk, the cloud judge runs synchronously. This is the cost of assured evaluation for irreversible, high-consequence actions.

Do not put the full stack on the critical path

The most common latency mistake is making every evaluation component synchronous. Only the tactical judge (SLM sidecar) and guardrails belong on the critical path for most workflows. Strategic evaluation, meta-evaluation, and observer scoring are asynchronous by design. If your workflow has a latency budget under 500ms, use the SLM approach and keep the cloud judge async.

Budgeting Template¶

Line Item	Monthly Estimate
Generator LLM API costs	$ ___
Input guardrails (if ML-based)	$ ___
Output guardrails (if ML-based)	$ ___
SLM infrastructure (if using distilled model)	$ ___
SLM retraining (amortised monthly)	$ ___
Judge LLM API costs (at sampling rate ___%)	$ ___
Teacher verification of SLM (if applicable, ___% sample)	$ ___
Human review (estimated ___ reviews × $___/review)	$ ___
Monitoring infrastructure (SIEM, dashboards)	$ ___
Total security overhead	$ ___
As % of generator cost	____%

Rule of thumb: Security overhead is typically 15–40% of generator cost for Tier 2, and 40–100% for Tier 3. Replacing the cloud Judge with a distilled SLM for routine screening can reduce that overhead by 60–90% at high volumes, bringing Tier 3 costs closer to the Tier 2 range.