Skip to content

Cost and Latency

The three-layer pattern is not free. Budget for it.

The Problem

Each layer adds cost and latency:

Layer Latency Added Cost Per Request At 1M requests/month
Guardrails (rule-based) 5–20ms ~$0 (compute only) Negligible
Guardrails (ML classifier) 20–100ms $0.001–0.005 $1K–5K
Distilled SLM (local model) 10–50ms ~$0 (compute only) Infrastructure cost only
Judge (LLM evaluation) 500ms–5s $0.01–0.05 $10K–50K
Human Oversight (per review) Minutes–hours $5–50 per review Depends on sample rate

For a Tier 3 system running the full pattern on every request, the Judge alone can cost more than the generator. A distilled SLM can eliminate that cost for routine screening while maintaining 100% inline coverage.

Sampling Strategies

You don't have to judge every request. Match evaluation density to risk.

By Risk Tier

Risk Tier Guardrails Judge Human Review
Tier 1 (Low) 100% of requests 5–10% sample 1% or anomaly-triggered
Tier 2 (Medium) 100% of requests 25–50% sample 5% + all judge flags
Tier 3 (High) 100% of requests 100% of requests 10% + all judge flags

Adaptive Sampling

Increase judge evaluation rate when signals indicate elevated risk:

Trigger Sampling Adjustment
Guardrail block rate above baseline Increase judge rate by 2x
New user (first 50 requests) Judge 100%
After-hours usage (if unusual for your environment) Increase judge rate by 2x
Prompt attack detected Judge 100% for that user for 24 hours
Model provider change notification Judge 100% for 48 hours

Stratified Sampling

Not all requests carry equal risk. Sample by category:

Request Type Judge Rate Rationale
FAQ / simple lookup 5% Low risk, repetitive
Creative generation 25% More variable, higher guardrail miss rate
Data analysis / summarisation 50% Accesses user data, exfiltration risk
Decision support 100% Consequential output
Actions / tool use 100% Real-world impact

Latency Budgets

Design your latency budget before adding controls.

Example: Customer-Facing Chat (Tier 2, Streaming)

Component Budget Actual
Input guardrails 20ms 15ms (rule-based)
LLM generation (first token) 500ms 400ms
Buffer evaluation (per chunk) 50ms 30ms (rule-based)
Total to first visible token 570ms 445ms
Post-stream judge evaluation N/A (async) 2s

Example: Document Processing (Tier 3, Non-Streaming)

Component Budget Actual
Input guardrails 100ms 50ms
LLM generation (complete) 10s 8s
Output guardrails 100ms 60ms
Judge evaluation 5s 3s
Total before delivery 15.2s 11.1s

Example: Agentic Tool Calls (Tier 3, SLM Sidecar)

For agentic systems where a distilled SLM screens every action inline:

Component Budget Actual
Input guardrails 20ms 15ms
LLM generation (tool call) 500ms 400ms
SLM sidecar evaluation 50ms 25ms
Tool execution 200ms 150ms
Total per action 770ms 590ms
Large Judge (async, 1% sample) N/A 2s

The SLM adds negligible latency compared to a cloud Judge call, making 100% inline evaluation feasible without breaking the latency budget.

What Breaks the Budget

Problem Cause Mitigation
Judge adds 5s to every request Using large model for judge Use smaller model (Haiku-class) for routine evaluation, or distill into an SLM for sub-50ms inline checks
Guardrail latency spikes ML classifier cold start Pre-warm classifiers, use rule-based for latency-critical path
Multiple judge calls per request Evaluating multiple dimensions separately Batch evaluations into a single prompt
Human review blocks delivery Synchronous human review on all flags Async review for medium flags; synchronous only for high/critical
SLM cold start on first request Model not loaded into memory Pre-load the SLM at pod/process start, keep it resident

Cost Optimisation

Judge Model Selection

Judge Model Tier Cost (per 1K eval tokens) Accuracy When to Use
Small (Haiku, GPT-4o-mini) ~$0.001 80–85% Tier 1, high-volume screening
Medium (Sonnet, GPT-4o) ~$0.01 88–93% Tier 2, balanced cost/accuracy
Large (Opus, GPT-4) ~$0.05 93–97% Tier 3, consequential decisions

Tiered Evaluation

Run cheap evaluation first; escalate to expensive evaluation only when needed:

Request → Rule-based guardrails (free, fast)
  ↓ (passed)
Request → Small model judge (cheap, fast)
  ↓ (flagged or uncertain)
Request → Large model judge (expensive, accurate)
  ↓ (flagged)
Request → Human review (most expensive)

This reduces cost by 60–80% compared to running the large model on everything.

For even greater savings, consider distilling the large Judge into a Small Language Model that runs locally.

SLM Cost Profile

A distilled SLM deployed as a sidecar fundamentally changes the cost model. Instead of paying per-token API costs for every evaluation, you pay a fixed infrastructure cost regardless of volume.

Cost Component Cloud Judge (API) Distilled SLM (Local)
Per-evaluation cost $0.01–0.05 ~$0 (compute only)
At 1M evaluations/month $10,000–$50,000 Infrastructure only
Infrastructure None (API) ~$50–200/month per node (CPU, 1GB RAM for INT4 model)
Teacher verification (1% sample) N/A $100–500/month
Initial distillation N/A One-time: $500–2,000 (Teacher labelling + training compute)
Retraining (monthly) N/A $200–500 per cycle

Break-even point: At roughly 50,000 evaluations per month, the SLM approach becomes cheaper than even the smallest cloud Judge model. Above 500,000 evaluations per month, the savings are substantial.

Monthly Evaluations Cloud Judge (Small Model) SLM + 1% Teacher Verification
100K $100–500 ~$250–400 (fixed)
500K $500–2,500 ~$300–500 (fixed)
1M $1,000–5,000 ~$350–700 (fixed)
10M $10,000–50,000 ~$500–1,000 (fixed)

The SLM cost stays nearly flat as volume grows because the marginal cost per evaluation is compute only. The cloud Judge cost scales linearly with volume.

When an SLM makes financial sense

If you need to evaluate more than 5% of requests with a Judge, and your volume exceeds 100K requests per month, model the cost of an SLM sidecar. For agentic systems that require 100% inline evaluation, the SLM is almost always the cheaper option at scale. See Distilling the Judge into a Small Language Model for the full architecture.

Caching

Judge evaluations on identical or near-identical inputs can be cached:

Cache Type Hit Rate Risk
Exact match (same input hash) Low (5–10%) None
Semantic similarity (embedding distance < threshold) Medium (15–30%) Adversarial inputs designed to be semantically similar but functionally different

Only cache for Tier 1. For Tier 2–3, the risk of cache-based bypass outweighs the cost saving.

Total Cost of Evaluation (Multi-Agent Workflows)

The per-layer costs above model a single judge. In a multi-agent workflow, the full evaluation stack includes multiple evaluation roles. This section models the compound cost so you can budget for the complete architecture, not just one layer.

Evaluation Stack Components

Component What It Does Evaluation Rate Cost Driver
Tactical judge Evaluates each agent action against its OISpec Per-action (100% at Tier 3) Highest volume. Dominates cost unless using SLM.
Domain judges (fraud, security, compliance) Evaluates actions from specific policy perspectives Per-action or per-phase Can be consolidated into a single multi-criteria evaluation call, or run as separate SLM sidecars.
Strategic evaluator Assesses combined outputs against workflow intent Per-phase + post-execution Low volume. Single LLM call per evaluation.
Meta-evaluator Monitors judge drift via calibration test cases Scheduled (daily/weekly) Negligible at scale. Fixed cost regardless of transaction volume.
Observer Anomaly scoring, PACE triggers Continuous (metrics pipeline) Infrastructure cost. No per-evaluation LLM calls.

Compound Cost Model: Fraud Detection Example

A fraud detection workflow processing 1M transactions/month with 3 task agents:

Scenario A: Cloud Judge (all evaluation via API)

Component Volume Cost per eval Monthly cost
Tactical judge (3 agents × 1M actions) 3M $0.01-0.05 $30,000-150,000
Domain judges (fraud + compliance, sampled 25%) 750K $0.01-0.03 $7,500-22,500
Strategic evaluator (per-phase, ~100K phases) 100K $0.01-0.05 $1,000-5,000
Meta-evaluator (daily calibration, 100 test cases) 3K $0.05 $150
Observer Continuous Infrastructure $500-1,000
Total $39,150-178,650

Scenario B: SLM Sidecar + Sampled Cloud Judge

Component Volume Cost per eval Monthly cost
SLM tactical judge (3 agents × 1M actions) 3M ~$0 (compute) $150-600 (infrastructure)
SLM domain evaluation (multi-criteria, 100%) 3M ~$0 (compute) Included in SLM infra
Cloud Judge teacher verification (1% sample) 30K $0.05 $1,500
Strategic evaluator (per-phase, ~100K phases) 100K $0.01 $1,000
Meta-evaluator (daily calibration) 3K $0.05 $150
Observer Continuous Infrastructure $500-1,000
Total $3,300-4,250

The SLM approach reduces compound evaluation cost by 95-97% at this volume. The break-even is even lower for multi-agent workflows than for single-agent systems because the evaluation volume multiplies with agent count.

Budget the full stack, not one layer

When presenting evaluation costs to finance, model the complete stack (tactical + domain + strategic + meta + observer), not just the tactical judge. Then show both cloud and SLM scenarios. The SLM scenario is almost always the right answer for high-volume multi-agent workflows.

Critical-Path Latency for Time-Sensitive Workflows

The per-layer latency budgets above show each component in isolation. For time-sensitive workflows (fraud detection, trading, real-time safety), what matters is the critical path: the synchronous components that must complete before the action executes.

Synchronous vs. Asynchronous Evaluation

Not every evaluation component sits on the critical path. Most run asynchronously, adding zero latency to the transaction.

Component On Critical Path? Latency Added Notes
Input guardrails Yes (synchronous) 5-20ms Always inline. Rule-based is fastest.
SLM tactical judge Yes (synchronous) 10-50ms Sidecar evaluation. Sub-50ms is achievable.
Cloud tactical judge Usually no (async) 500ms-5s Only synchronous for CRITICAL risk at Tier 3.
Domain judges (SLM) Yes if SLM (sync) 10-50ms Can be batched into the tactical SLM call.
Domain judges (cloud) No (async) 500ms-5s Async with "most restrictive wins" default.
Strategic evaluator No (async) 1-5s Runs at phase boundaries, not per-action.
Meta-evaluator No (scheduled) 0ms Runs on a schedule. Not per-transaction.
Observer No (background) 0ms Metrics pipeline. Non-blocking.

Example: Fraud Detection (Tier 2, SLM Sidecar)

Transaction arrives
  → Input guardrails                     5-20ms  [sync]
  → Agent processes transaction          50-200ms [sync]
  → SLM tactical + domain evaluation     10-50ms  [sync]
  → Action executes (flag/approve/block) 5-10ms   [sync]
                                         ─────────
  Total critical path:                   70-280ms

  → Cloud Judge (1% sample)              2-5s     [async, non-blocking]
  → Strategic evaluator (end of batch)   1-5s     [async, non-blocking]

Critical-path overhead from evaluation: 10-50ms. That is the SLM sidecar. Everything else is asynchronous.

Example: Trading Compliance (Tier 3, CRITICAL Risk)

Trade request arrives
  → Input guardrails                     5-20ms   [sync]
  → Agent generates trade recommendation 100-500ms [sync]
  → SLM tactical evaluation              10-50ms  [sync]
  → Cloud Judge (synchronous, CRITICAL)  500ms-2s [sync, CRITICAL only]
  → Compliance domain judge (sync)       500ms-2s [sync, CRITICAL only]
                                         ──────────
  Total critical path:                   1.1-4.6s

  → Strategic evaluator (post-trade)     1-5s     [async]

For CRITICAL risk, the cloud judge runs synchronously. This is the cost of assured evaluation for irreversible, high-consequence actions.

Do not put the full stack on the critical path

The most common latency mistake is making every evaluation component synchronous. Only the tactical judge (SLM sidecar) and guardrails belong on the critical path for most workflows. Strategic evaluation, meta-evaluation, and observer scoring are asynchronous by design. If your workflow has a latency budget under 500ms, use the SLM approach and keep the cloud judge async.

Budgeting Template

Line Item Monthly Estimate
Generator LLM API costs $ ___
Input guardrails (if ML-based) $ ___
Output guardrails (if ML-based) $ ___
SLM infrastructure (if using distilled model) $ ___
SLM retraining (amortised monthly) $ ___
Judge LLM API costs (at sampling rate ___%) $ ___
Teacher verification of SLM (if applicable, ___% sample) $ ___
Human review (estimated ___ reviews × $___/review) $ ___
Monitoring infrastructure (SIEM, dashboards) $ ___
Total security overhead $ ___
As % of generator cost ____%

Rule of thumb: Security overhead is typically 15–40% of generator cost for Tier 2, and 40–100% for Tier 3. Replacing the cloud Judge with a distilled SLM for routine screening can reduce that overhead by 60–90% at high volumes, bringing Tier 3 costs closer to the Tier 2 range.