AI Endpoint Hardening¶
Securing model inference endpoints, API gateways, and tool servers against adversarial and operational threats.
This guide consolidates endpoint-specific security controls that are distributed across the framework's network, IAM, secrets, and data protection domains into a single implementation reference. If you're the engineer deploying model endpoints, start here.
Why AI Endpoints Are Different¶
Traditional API hardening focuses on authentication, rate limiting, and input validation. AI inference endpoints introduce additional concerns:
| Concern | Traditional API | AI Inference Endpoint |
|---|---|---|
| Input validation | Schema validation, type checking | Schema validation + prompt injection detection + content policy enforcement |
| Output control | Response schema | Response schema + PII scanning + hallucination detection + system prompt leak prevention |
| Authentication | OAuth/API key per client | Per-client + per-agent identity + session-scoped credentials |
| Rate limiting | Requests per second | Requests + tokens per second + cost capping per consumer |
| Resource consumption | CPU/memory bounds | GPU/TPU bounds + context window limits + max output tokens |
| Attack surface | OWASP Top 10 | OWASP Top 10 + OWASP LLM Top 10 + OWASP Agentic Top 10 |
The core difference: a traditional API processes structured data deterministically. An AI endpoint processes unstructured natural language non-deterministically. Every input is potentially adversarial, and every output is potentially wrong.
Endpoint Architecture¶
AI endpoints must be deployed behind the framework's zone architecture:
Consumer → Zone 1 (API Gateway + WAF)
→ Zone 2 (Input Guardrails → Model Endpoint → Output Guardrails)
→ Zone 3 (Judge - async evaluation)
→ Zone 6 (Logging)
The model endpoint is never directly addressable. All access routes through the API gateway (NET-07) and guardrails (NET-02). This is the non-negotiable foundation.
Gateway Hardening¶
The API gateway is the single entry point and first line of defence. Harden it as you would any security-critical reverse proxy - then add AI-specific controls.
Standard Controls¶
| Control | Implementation | Notes |
|---|---|---|
| TLS termination | TLS 1.2+ with strong cipher suites. Disable TLS 1.0/1.1. | HSTS headers on all responses. |
| mTLS (Tier 3+) | Mutual TLS for service-to-service calls within Zone 2. | Certificate pinning for known internal clients. |
| Authentication | OAuth 2.0 / OIDC token validation. No API-key-only auth for Tier 2+. | Propagate authenticated identity downstream (IAM-01). |
| Rate limiting | Per-consumer limits on requests, tokens, and cost. | See Token-Based Rate Limiting below. |
| WAF integration | Standard OWASP CRS rules for SQLi, XSS, path traversal. | AI-specific WAF rules for common injection patterns (optional - guardrails are the primary defence). |
| Schema validation | Reject malformed requests before they reach the model. | Validate JSON structure, required fields, max prompt length. |
| Request size limits | Cap request body size. | Prevent context window stuffing attacks. |
| Timeout enforcement | Set max request duration. | Prevents long-running inference from consuming GPU indefinitely. |
AI-Specific Gateway Controls¶
| Control | Implementation |
|---|---|
| Max input tokens | Enforce a hard limit on prompt length at the gateway. Requests exceeding the limit are rejected before reaching the model. |
| Max output tokens | Set max_tokens at the gateway level, not just in the prompt. Prevents runaway generation. |
| Correlation ID injection | Generate and inject a unique request_id / trace_id into every request. Propagate through all downstream components. Essential for SOC correlation. |
| Consumer identification | Log the authenticated identity, source IP, user agent, and session ID for every request. Required for identity correlation. |
| Endpoint routing | Route to specific model versions via gateway config, not consumer choice. Consumers don't get to select which model version serves their request. |
Token-Based Rate Limiting¶
Traditional request-per-second rate limiting is insufficient for AI endpoints because:
- A single request can consume vastly different amounts of compute depending on prompt length and output length.
- A rate-limited attacker can send one request with a massive prompt that consumes more resources than 100 normal requests.
- Token cost varies by model - a request to a large model costs more than the same request to a smaller model.
Rate Limiting Strategy¶
| Dimension | Limit Type | Example |
|---|---|---|
| Requests per minute | Hard cap per consumer | 60 rpm for standard, 200 rpm for premium |
| Input tokens per minute | Hard cap per consumer | 100K tokens/min |
| Output tokens per minute | Hard cap per consumer | 50K tokens/min |
| Cost per hour | Soft cap with alert, hard cap with block | $10/hr soft, $50/hr hard |
| Concurrent requests | Per consumer and global | 5 concurrent per consumer, 100 global |
Burst Protection¶
Allow short bursts above the per-minute limit, but enforce a sliding window. This prevents legitimate batch operations from being blocked while still catching sustained abuse.
Model Endpoint Hardening¶
The model inference endpoint itself - whether self-hosted, managed, or third-party - requires specific hardening.
Self-Hosted Endpoints (vLLM, TGI, Ollama, TorchServe)¶
| Control | Detail |
|---|---|
| Network binding | Bind to internal interface only. Never expose on 0.0.0.0 or public IP. |
| Authentication | Require authentication even for internal access. No unauthenticated inference. |
| Health check endpoint | Expose /health or /ready on a separate port or path. Return minimal information - model name and status only. Do not expose model version, configuration, or system details. |
| Metrics endpoint | Expose Prometheus metrics (/metrics) on a separate port, accessible only from Zone 6 (Logging). Do not expose on the inference port. |
| Resource limits | Set container memory and GPU limits. Use Kubernetes resource quotas or equivalent. Prevent a single request from exhausting GPU memory. |
| Process isolation | Run inference in a dedicated container/pod. Do not co-locate with other services. |
| Image hardening | Use minimal base images. Remove unnecessary packages, shells, and debugging tools from production images. |
| Read-only filesystem | Mount the container filesystem as read-only. Model weights and config are mounted as read-only volumes. |
| Non-root execution | Run the inference process as a non-root user. |
| No outbound access | Model inference endpoints should have no outbound network access (except for agent tool calls via egress proxy - NET-04). |
Managed Endpoints (Azure OpenAI, Bedrock, Vertex AI)¶
| Control | Detail |
|---|---|
| Private endpoints | Use private endpoints / VPC endpoints. Do not call managed AI services over the public internet from production. |
| Network restrictions | Apply IP allowlisting or VNet/VPC rules to restrict which networks can reach the managed endpoint. |
| Content filtering | Enable the provider's built-in content filtering as a baseline. Layer framework guardrails on top - provider filters are not sufficient alone. |
| Data residency | Deploy in regions aligned with data residency requirements. Verify that the provider does not route inference to other regions. |
| Model version pinning | Pin to a specific model version. Do not use "latest" or auto-updating model references in production. Provider model updates can change behavior without notice. |
| Diagnostic logging | Enable full diagnostic logging to your SIEM. Azure: Diagnostic Settings → Log Analytics. AWS: Bedrock invocation logging → CloudWatch. GCP: Cloud Logging integration. |
| Customer-managed keys | For Tier 3+, use customer-managed encryption keys for data at rest on the managed service. |
| Quota management | Set provider-level quotas per deployment to prevent cost runaway. Alert at 80% quota consumption. |
Tool Server Hardening (MCP and Native)¶
When AI agents invoke tools - via MCP servers, native function calls, or API integrations - the tool endpoints require their own hardening.
| Control | Detail |
|---|---|
| Allowlisting | Only pre-approved tool endpoints are reachable by agents. Enforced at the egress proxy (network layer), not by prompt instructions. |
| Input validation | Validate all parameters the agent sends to the tool. Type checking, range checking, allowlisted values. The agent's output is untrusted input to the tool. |
| Output sanitisation | Strip credentials, internal metadata, and excessive data from tool responses before they enter the agent's context window (SEC-01). |
| Authentication | Tool servers require authentication. Credentials injected by the authorization gateway, not passed through the model context. |
| Idempotency | Tool operations that create, modify, or delete should be idempotent where possible. Agents may retry on failure and should not create duplicates. |
| Read/write separation | Distinguish read-only tools from tools that take actions. Read-only tools require lower approval thresholds. Write tools require explicit control (IAM-05 for Tier 3+). |
| Scope constraints | Tool access is scoped to the minimum necessary data and operations. A "query customer" tool returns only the fields relevant to the task, not the full record. |
| MCP server signing | MCP servers must be signed with verified manifests (SC-2.2). Unsigned or modified MCP servers are blocked. |
| Timeout enforcement | Set max execution time per tool call. An agent waiting indefinitely for a tool response is a denial-of-service vector. |
| Sandboxing | Code execution tools must run in sandboxed environments with no network access, no filesystem persistence, and hard resource limits (SAND-01 through SAND-06). |
Deployment Security¶
Pre-Deployment Checklist¶
Before exposing any AI endpoint to traffic:
- Model endpoint is not directly addressable from outside Zone 2
- All traffic routes through API gateway with authentication and rate limiting
- Input and output guardrails are inline on the request path (not bypassable)
- Health check and metrics endpoints are on separate ports, not externally accessible
- TLS 1.2+ enforced on all connections
- mTLS enforced for internal service-to-service calls (Tier 3+)
- Correlation ID generated and propagated for every request
- Max input tokens, max output tokens, and request timeout are enforced at gateway
- Model version is pinned - no auto-updates in production
- Container runs as non-root with read-only filesystem
- No outbound network access from model endpoint (except via egress proxy for agents)
- Diagnostic logs flowing to SIEM with identity correlation
- Rate limits configured per consumer: requests, tokens, and cost
- Credential scanning active on model I/O
- Network policy tests automated in deployment pipeline
Canary Deployment¶
When deploying model updates or configuration changes:
- Route a small percentage of traffic (1–5%) to the new deployment.
- Monitor guardrail block rates, Judge flag rates, and output token distributions.
- Compare against baseline metrics from the existing deployment.
- If anomaly metrics exceed thresholds, roll back automatically.
- Maintain the previous deployment as a hot standby for immediate rollback.
Model updates are not code updates. A new model version can change behavior in ways that bypass existing guardrails. Treat model deployment with the same caution as infrastructure changes.
PACE Resilience for Endpoints¶
Apply the PACE resilience model to endpoint availability:
| State | Condition | Endpoint Behavior |
|---|---|---|
| Primary | All layers operational | Full inference with guardrails, Judge evaluation, and human oversight |
| Alternate | Guardrails degraded | Switch to cached/static responses for high-risk queries. Continue serving low-risk queries with Judge-only evaluation. |
| Contingency | Model endpoint degraded | Return graceful error messages. Queue requests for delayed processing. Activate backup model if available. |
| Emergency | Confirmed compromise | Kill switch. All traffic rejected with maintenance response. Incident response activated. |
The kill switch must be infrastructure-level (gateway config or DNS), not dependent on the model endpoint itself being responsive.
Monitoring and Alerting¶
Endpoint-specific signals to feed into the SOC Content Pack:
| Signal | Threshold | Severity |
|---|---|---|
| Error rate (5xx) | >5% sustained for 5 minutes | High |
| Latency (p99) | >2x baseline for 10 minutes | Medium |
| GPU memory utilisation | >90% sustained | Medium |
| Token throughput drop | >50% below baseline | High |
| Guardrail block rate spike | >3x baseline in 15 minutes | Medium |
| Authentication failure spike | >10 failures from single IP in 1 minute | High |
| Request from unknown consumer | Any | Medium |
| Quota approaching limit | >80% of allocated quota | Low |
Three-Layer Mapping¶
| Hardening Area | Guardrails | Model-as-Judge | Human Oversight |
|---|---|---|---|
| Gateway | Guardrails enforced at gateway - cannot be bypassed | Gateway logs feed Judge evaluation | Humans configure gateway policies |
| Rate limiting | Token-aware limits prevent resource abuse | Judge detects cost anomalies that bypass rate limits | Humans set per-consumer quotas |
| Model endpoint | Guardrails inline on request/response path | Judge evaluates output quality independently | Humans approve model version changes |
| Tool servers | Parameter validation on tool inputs | Judge evaluates tool invocation patterns | Humans define tool allowlists |
| Deployment | Canary deployment monitors guardrail effectiveness | Judge metrics compared between old and new versions | Humans approve production rollout |
| PACE | Guardrail degradation triggers Alternate mode | Judge degradation triggers Contingency escalation | Human-activated Emergency kill switch |
OWASP Mapping¶
| Hardening Area | OWASP LLM Risk | OWASP Agentic Risk |
|---|---|---|
| Gateway rate limiting | LLM10: Unbounded Consumption | - |
| Input validation | LLM01: Prompt Injection | ASI01: Agent Goal Hijack |
| Output scanning | LLM02: Sensitive Information Disclosure | ASI02: Tool Misuse and Exploitation |
| Token limiting | LLM10: Unbounded Consumption | - |
| Model version pinning | LLM04: Data and Model Poisoning | ASI04: Agentic Supply Chain Compromise |
| Tool server hardening | LLM07: System Prompt Leakage | ASI02: Tool Misuse and Exploitation |
| Private endpoints | LLM06: Excessive Agency | ASI05: Unexpected Code Execution |
| Kill switch | - | ASI08: Cascading Agent Failures |
Related¶
- Network & Segmentation - Zone architecture and traffic rules
- Identity & Access Management - Authentication and authorization controls
- Secrets & Credentials - Credential management for AI components
- Data Protection - Data classification and protection in AI pipelines
- SOC Content Pack - Detection rules for endpoint monitoring
- SOC Integration - SOC architecture and triage procedures
- Bypass Prevention - Guardrail bypass taxonomy and defences