Support Systems Risk¶

The things most likely to break your AI aren't AI problems.

Support Systems Iceberg

Why This Matters¶

The industry focuses here:

AI-Specific Threat	Probability	Industry Attention
Prompt injection	Low-Medium	Very High
Model manipulation	Very Low	High
Training data poisoning	Very Low	High
Hallucination	Medium	Very High

But failures actually come from here:

Support System Failure	Probability	Industry Attention
API key rotation breaks production	Medium	Low
Model provider silently changes behavior	Medium	Low
Logging pipeline fails	Medium	Very Low
Key person leaves	High	Very Low
Stale RAG data serves wrong answers	High	Low
Cost blowout from misconfigured agent	Medium	Low
Guardrail config pushed without review	Medium	Low

The mundane risks are more likely to hurt you than the exotic ones.

Risk Categories¶

1. Dependency Risk¶

Your AI system depends on things you don't control.

1.1 Model Provider Dependency¶

Risk	Scenario	Impact
Provider outage	OpenAI / Anthropic / Azure API goes down	AI system unavailable
Silent model change	Provider updates model, outputs shift	Quality degradation, compliance drift
Deprecation	Model version retired with notice	Forced migration under pressure
Rate limiting	Hit API limits during peak	Customer-facing errors
Pricing change	Provider increases costs	Budget overrun
Terms change	Provider changes data handling terms	Compliance violation
Region unavailability	Provider exits a region	Data residency breach

Controls:

Control	Implementation	Framework Mapping
Multi-provider capability	Ability to switch models	AI.15 Continuity
Version pinning	Pin to specific model versions	AI.14 Configuration
Behavior monitoring	Detect output drift after provider changes	AI.8 Judge
Contractual protections	SLAs, deprecation notice, data handling	AI.13 Supplier
Capacity planning	Understand rate limits vs demand	AI.15 Continuity

1.2 Infrastructure Dependency¶

Risk	Scenario	Impact
Cloud region failure	Primary region outage	Full system outage
DNS failure	DNS resolution fails	Can't reach model APIs
Network partition	Connectivity between services lost	Partial or full failure
Compute exhaustion	GPU/CPU capacity exceeded	Degraded performance or outage
Storage failure	Vector DB or logging storage fails	RAG failure, blind spot

Controls:

Control	Implementation	Framework Mapping
Multi-region	Deploy across regions where justified	AI.15 Continuity
Health checks	Monitor all dependencies continuously	AI.11 Monitoring
Graceful degradation	System continues with reduced function	AI.15 Continuity
Capacity alerts	Alert before limits hit	AI.11 Monitoring

1.3 Data Supply Chain¶

Risk	Scenario	Impact
Stale data	RAG knowledge base not updated	AI gives outdated answers confidently
Corrupted data	Bad data injected into knowledge base	AI gives wrong answers confidently
Missing data	Embedding pipeline fails, new docs not indexed	AI can't find information that exists
Schema change	Upstream data format changes	Pipeline breaks silently
Source unavailable	Data source goes offline	RAG returns no context, AI hallucinates

This is arguably the highest-probability, highest-impact category. When RAG data is wrong, the AI is confidently wrong - and users trust it.

Controls:

Control	Implementation	Framework Mapping
Data freshness monitoring	Alert when knowledge base staleness exceeds threshold	AI.5 Data Governance
Data validation	Validate data before indexing	AI.5 Data Governance
Pipeline monitoring	Monitor ingestion pipelines for failures	AI.11 Monitoring
Source health checks	Monitor upstream data sources	AI.11 Monitoring
Freshness metadata	Include "last updated" in AI responses where relevant	AI.7 Guardrails

2. Configuration Risk¶

AI systems are extraordinarily sensitive to configuration. A one-word change in a system prompt can fundamentally alter behavior.

2.1 Prompt and Parameter Configuration¶

Risk	Scenario	Impact
Prompt regression	System prompt changed, quality drops	Degraded output quality
Parameter drift	Temperature/token settings changed	Inconsistent behavior
Guardrail misconfiguration	Rule disabled or threshold changed	Security control gap
Judge criteria change	Evaluation criteria modified without review	Assurance gap
Circuit breaker override	Limits relaxed "temporarily"	Agent runaway

This is the AI equivalent of a firewall rule change. It should be treated with the same rigour.

Controls:

Control	Implementation	Framework Mapping
Version control	All prompts, parameters, configs in git	AI.14 Configuration
Change approval	Config changes require review (by tier)	AI.14 Configuration
Drift detection	Alert on configuration differences from baseline	AI.14 Configuration
Rollback capability	Instant revert to last known good	AI.14 Configuration
Pre-production testing	Test config changes before prod	AI.4 Development

2.2 Secrets Management¶

Risk	Scenario	Impact
Leaked API key	Key committed to repo or exposed in logs	Unauthorised model access, cost
Key rotation failure	Key rotated, dependent services not updated	Production outage
Over-privileged keys	Key has broader access than needed	Excessive blast radius
No key rotation	Same key used for months/years	Increased compromise window
Shared credentials	Multiple services use same key	Can't isolate or revoke

Controls:

Control	Implementation	Framework Mapping
Secrets vault	HashiCorp Vault, AWS Secrets Manager, etc.	AI.15 Cryptography
Automated rotation	Rotate keys on schedule	AI.15 Cryptography
Per-service credentials	Unique keys per service/environment	AI.15 Cryptography
Secret scanning	Scan repos and logs for leaked keys	AI.4 Development
Access logging	Log all API key usage	AI.11 Logging

3. Observability Risk¶

If you can't see it, you can't govern it. Observability failures are silent and dangerous.

3.1 Logging Failures¶

Risk	Scenario	Impact
Log pipeline failure	Logging service goes down	Judge has no data, audit trail gap
Log volume overflow	Storage fills, logs dropped	Missing evidence
Log format change	Upstream changes log schema	Parsing breaks, Judge can't evaluate
Selective logging	Some interactions not captured	Biased Judge evaluation
Log tampering	Logs modified after the fact	Compromised audit trail

If logging fails, the Judge is blind, HITL has nothing to review, and your audit trail has a gap. This is a governance failure, not just an ops issue.

Controls:

Control	Implementation	Framework Mapping
Log pipeline monitoring	Alert within minutes of logging failure	AI.11 Monitoring
Dead letter queue	Capture failed log writes for retry	AI.11 Logging
Log completeness checks	Reconcile expected vs actual log volume	AI.11 Logging
Tamper protection	Immutable logs for CRITICAL tier	AI.11 Logging
Redundant logging	Write to multiple destinations for CRITICAL	AI.11 Logging

3.2 Monitoring and Alerting¶

Risk	Scenario	Impact
Alert fatigue	Too many alerts, team ignores them	Real issues missed
Missing alerts	Key failure modes not monitored	Blind spot
Stale dashboards	Dashboards show cached/old data	False confidence
Threshold drift	Thresholds not updated as system evolves	Alerts no longer meaningful

Controls:

Control	Implementation	Framework Mapping
Alert review	Quarterly review of alert relevance	AI.11 Monitoring
Alert testing	Test that alerts fire correctly	AI.11 Monitoring
Escalation paths	Defined escalation when alerts ignored	AI.12 Incident
Dashboard freshness	Show "last updated" on all dashboards	AI.11 Monitoring

4. People Risk¶

The H in HITL is a single point of failure.

4.1 Key Person Dependency¶

Risk	Scenario	Impact
Sole prompt engineer	One person wrote all system prompts	Can't maintain or debug if they leave
Judge expertise	Only one person understands Judge config	Assurance capability at risk
HITL reviewer	Domain expertise concentrated in one person	Review quality drops
Platform knowledge	One engineer knows the platform	Operational risk

Controls:

Control	Implementation	Framework Mapping
Documentation	Document all prompts, configs, decisions, rationale	AI.3 Documentation
Cross-training	Minimum 2 people for every critical function	AI.14 Training
Runbooks	Documented procedures for common operations	AI.14 Training
Succession planning	Identified backup for key roles	AI.1 Governance

4.2 HITL Sustainability¶

Risk	Scenario	Impact
Reviewer fatigue	High volume, repetitive reviews	Quality drops, rubber-stamping
Insufficient capacity	More findings than reviewers can handle	Queue backlog, SLA breach
Skill decay	Reviewers lose domain sharpness over time	Review effectiveness drops
Turnover	Reviewers leave, replacements untrained	Assurance gap during transition
Complacency	Low finding rate leads to inattention	Miss the one that matters

This is a known failure mode in every human oversight system. Aviation, nuclear, financial services - all have struggled with this.

Controls:

Control	Implementation	Framework Mapping
Workload management	Monitor reviewer load, redistribute	AI.9 HITL
Quality checks	Sample-check HITL decisions	AI.9 HITL
Rotation	Rotate reviewers to maintain attention	AI.9 HITL
Calibration	Regular calibration exercises	AI.9 HITL
Canary reviews	Inject known findings to test reviewer attention	AI.9 HITL

5. Financial Risk¶

AI costs are variable, opaque, and can spike without warning.

5.1 Cost Control¶

Risk	Scenario	Impact
Agent cost blowout	Agentic loop calls API repeatedly	Unexpected bill
Judge cost overrun	100% sampling on high-volume system	Budget exceeded
Shadow AI	Teams spin up AI services without oversight	Uncontrolled spend
Pricing surprise	Provider changes pricing model	Budget assumptions invalid
No budget allocation	AI costs absorbed into general cloud spend	No visibility or control

Controls:

Control	Implementation	Framework Mapping
Budget caps	Hard spending limits per system/team	AG.2.2 Circuit Breakers
Cost alerts	Alert at 50%, 80%, 100% of budget	AI.11 Monitoring
Cost attribution	Tag all AI spend to system/team/use case	AI.3 Inventory
Forecast vs actual	Monthly comparison	AI.1 Governance
Shadow AI detection	Monitor for unauthorised AI API usage	AI.3 Inventory

6. Vendor Concentration Risk¶

Risk	Scenario	Impact
Single model provider	One provider for all AI capabilities	Total loss if provider fails
Correlated failure	Model + guardrails + logging all on same platform	Single event takes out controls and system
Vendor lock-in	Deep integration makes switching costly	No leverage, no alternatives
Geopolitical	Provider restricted in your jurisdiction	Forced migration

Controls:

Control	Implementation	Framework Mapping
Multi-provider strategy	At least evaluate alternatives	AI.13 Supplier
Abstraction layer	Don't hard-code to one provider's API	AI.4 Development
Exit planning	Documented migration path	AI.13 Supplier
Control separation	Don't run guardrails on same platform as AI	AI.15 Continuity

7. Change Management Risk¶

Risk	Scenario	Impact
Library upgrade	Python package update breaks integration	Production outage
SDK change	Provider SDK breaking change	Build failures or runtime errors
Upstream model update	Provider updates model behind same version string	Behavior change
Framework migration	LangChain / LlamaIndex major version change	Significant rework
Undocumented change	Someone changes something, doesn't tell anyone	Mysterious behavior shift

Controls:

Control	Implementation	Framework Mapping
Dependency pinning	Pin all library versions	AI.4 Development
Change log	All changes documented with rationale	AI.14 Configuration
Staged rollout	Dev → staging → canary → production	AI.4 Development
Automated testing	Tests run on every change	AI.4 Development
Change freeze	Freeze periods for critical business times	AI.4 Development

Risk Heat Map¶

Support Systems Risk Heat Map

Category	Probability	Impact	Attention It Gets
Data supply chain	High	High	Low
People / key person	High	High	Very Low
Configuration	Medium-High	High	Low
HITL sustainability	High	Medium-High	Very Low
Secrets management	Medium	High	Medium
Logging failure	Medium	High	Very Low
Cost blowout	Medium	Medium	Low
Model provider outage	Medium	High	Medium
Vendor concentration	Medium	Very High	Low
Change management	Medium-High	Medium-High	Low
Infrastructure	Low-Medium	High	Medium

The pattern: high-probability, high-impact risks get the least attention because they're not exciting. Prompt injection is interesting. API key rotation is not. But API key rotation is more likely to take you down.

Framework Integration¶

These risks map to existing control families, but most need strengthening:

Support System Risk	Primary Control Family	Gap
Model provider dependency	AI.13 Supplier, AI.15 Continuity	Needs operational runbooks
Infrastructure dependency	AI.15 Continuity	Needs AI-specific BCP
Data supply chain	AI.5 Data Governance	Needs freshness monitoring
Prompt/config changes	AI.14 Configuration	Needs change approval rigour
Secrets management	AI.15 Cryptography	Needs AI-specific secrets policy
Logging failures	AI.11 Logging	Needs pipeline monitoring
Monitoring gaps	AI.11 Monitoring	Needs alert effectiveness review
Key person dependency	AI.14 Training, AI.3 Documentation	Needs succession planning
HITL sustainability	AI.9 HITL	Needs workload management
Cost control	AG.2.2 Circuit Breakers, AI.11 Monitoring	Needs budget caps and attribution
Vendor concentration	AI.13 Supplier	Needs exit planning
Change management	AI.4 Development, AI.14 Configuration	Needs AI-specific change process

Implementation Priority¶

Do First (Week 1-2)¶

Action	Why
Monitor logging pipeline health	If logging fails, everything downstream is blind
Set budget alerts on AI spend	Prevent surprise bills
Document who knows what	Identify key-person risks
Pin model versions	Prevent silent behavior changes
Put prompts in version control	If they're not in git, they're not controlled

Do Next (Month 1)¶

Action	Why
Implement data freshness monitoring	Stale RAG data is a high-probability risk
Set up secrets rotation	Reduce credential compromise window
Cross-train at least one backup per function	Reduce key-person risk
Define change approval process for configs	Prevent unreviewed changes
Test failover for model provider	Know what happens before it happens

Do Later (Quarter 1)¶

Action	Why
HITL workload analysis	Prevent reviewer fatigue
Multi-provider evaluation	Reduce vendor concentration
Full dependency mapping	Understand all failure paths
Canary reviews for HITL	Test reviewer attention
Exit planning for primary vendor	Have the plan before you need it

Operational Checklist¶

Use this for each production AI system:

Check	Status
Dependencies
Model provider SLA understood?	☐
Fallback if provider unavailable?	☐
Model version pinned?	☐
Data pipeline monitored?	☐
Data freshness threshold defined?	☐
Configuration
All prompts in version control?	☐
Config changes require approval?	☐
Rollback procedure documented?	☐
Secrets
API keys in vault (not code)?	☐
Rotation schedule defined?	☐
Per-service credentials?	☐
Observability
Logging pipeline monitored?	☐
Alerts tested and relevant?	☐
Dashboard data fresh?	☐
People
At least 2 people can maintain?	☐
Runbooks documented?	☐
HITL capacity sufficient?	☐
Cost
Budget cap set?	☐
Cost alerts configured?	☐
Spend attributed to system?	☐
Vendor
Exit plan documented?	☐
Alternative evaluated?	☐