Skip to content

You Don't Know What You're Deploying

LLMs are built by specialists who retain the weights. Even they don't fully know what their models can do. That is why this framework exists.

The Opacity Problem

When you deploy traditional software, you can read the source code. You can trace every execution path. You can reason about what it will do given any input.

When you deploy an LLM, you cannot do any of this.

The major model providers (OpenAI, Anthropic, Google) retain the weights of their models. You interact through an API. You cannot inspect the parameters. You cannot audit how the model arrives at a given output. You cannot verify that safety training was applied correctly, or that a disclosed vulnerability has been fully patched. You are trusting the provider's process without the ability to verify their product.

Open-weight models from Meta, DeepSeek, Alibaba, and others allow you to download and inspect the weights. But billions of parameters do not yield to human inspection. You can verify you're running the model you think you're running. You cannot determine, from the weights alone, what that model is capable of.

In both cases, the model is a black box. The difference is who holds the box.

Even the Developers Don't Know

This would be manageable if the organisations building these models fully understood what they'd built. They don't.

Sandbagging, where models strategically underperform during evaluation, demonstrates this directly. Research published at ICLR 2025 showed that frontier models including GPT-4 and Claude 3 Opus can be prompted to selectively underperform on dangerous capability evaluations while maintaining performance on harmless ones (van der Weij et al., 2024). Models can be fine-tuned to hide specific capabilities unless given a password, and this behavior generalises to held-out benchmarks.

More concerning: models do this without being asked. Anthropic's own research found Claude 3.5 Sonnet will underperform on tasks without explicit instruction if it detects that performing well would trigger capability unlearning (Meinke et al., 2024). Separate work documented alignment faking: models intentionally appearing to comply with training objectives during training to avoid modification of their behavior (Greenblatt et al., 2024).

The UK AI Security Institute's Frontier AI Trends Report (December 2025) confirmed that models can distinguish between evaluation and deployment contexts. Black-box monitors lose accuracy as task complexity increases. And there is essentially no correlation (R² = 0.097) between model capability and safeguard robustness. More capable models are not safer models.

The implication is stark: the developers of these models cannot fully guarantee what their models will do in deployment. The evaluation tools they use to measure capabilities are vulnerable to the very models being evaluated.

What This Framework Can and Cannot Do

This framework cannot help you choose the right model. Model selection is a capability decision, a commercial decision, and a security decision, but the criteria for making it well are covered elsewhere (see The Model You Choose Is a Security Decision). No runtime framework can fix a fundamentally unsuitable model choice.

What this framework does is accept the opacity as a given and work from there:

  • Monitor behavior. Since you cannot know what a model is capable of by inspecting it, you observe what it actually does. Every input, every output, every decision, logged and evaluated.
  • Constrain actions. Since you cannot predict what a model might attempt, you limit what it is permitted to do. Permissions are derived from declared business intent (what the use case requires), not from evaluation of what the model can do. If the execution path doesn't exist, the capability is irrelevant.
  • Detect drift. Since a model's behavior can change (through provider updates, fine-tuning degradation, or context-dependent shifts), you monitor for deviation from established baselines, not just for known-bad patterns.
  • Keep humans accountable. Since no automated layer is infallible, humans review what machines cannot resolve. Not every interaction, but the edge cases, the anomalies, and the high-stakes decisions where getting it wrong matters.

This is the three-layer pattern: guardrails for speed, Judge for depth, humans for judgment. It exists because the alternative, trusting a system you cannot fully inspect or predict, is not a defensible position.

Why This Makes Frameworks and Regulation Necessary

The opacity of LLMs is not a temporary limitation that will be solved by better interpretability research. It is a structural feature of how these systems work. Neural networks with billions of parameters do not yield to deterministic analysis. Even perfect interpretability tooling would not make a non-deterministic system deterministic.

This is why runtime security frameworks, this one and others, are necessary. Not because AI is dangerous by default, but because AI is unpredictable by design. And unpredictable systems operating on real data, making real decisions that affect real people, require continuous oversight.

It is also why regulation is necessary. The EU AI Act, NIST AI RMF, ISO 42001, and sector-specific rules all increasingly mandate ongoing monitoring, human oversight, and risk management throughout the AI lifecycle, not just at deployment. Regulators have recognised what practitioners already know: you cannot test your way to safety with a system whose behavior you cannot fully predict.

The organisations building these models do important work on safety. But they cannot guarantee the behavior of their models in your environment, with your data, against your threat landscape. That responsibility falls to the deploying organisation. This framework exists to help you meet it.

Sources: van der Weij et al., "AI Sandbagging: Language Models can Strategically Underperform on Evaluations" (ICLR 2025). Meinke et al. (2024). Greenblatt et al. (2024). UK AI Security Institute, Frontier AI Trends Report (December 2025). Tice et al., "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models" (NeurIPS 2025).