AI Runtime Security News¶

A biweekly roundup of incidents, research, and developments in AI runtime security. Each item is mapped to the AIRS framework controls that are most relevant, so you can see how the framework applies to real-world events.

How to read this page¶

Each news item includes:

Summary: what happened or what was published
Framework relevance: which AIRS controls, layers, or domains apply
Source link: the original article or report

Framework tags use these categories:

Tag	Framework area
Guardrails	Input/output containment boundaries
Judge	Model-as-Judge evaluation layer
Human Oversight	Escalation and human-in-the-loop controls
Circuit Breaker	Safe failure and PACE resilience
Risk Tiers	Risk classification and proportionate controls
IAM	Identity and access management governance
Agentic	Agentic AI and multi-agent controls
MASO	Multi-Agent Security Operations domains
Supply Chain	Model and tool supply chain integrity
Multimodal	Multimodal input/output controls
Memory & Context	Context window and memory persistence controls
Observability	Logging, telemetry, and audit
Data Protection	Data leakage prevention and classification

This page is updated every two weeks. Items are listed newest first.

2026-05-07: Microsoft Semantic Kernel Prompt-Injection-to-RCE Disclosed in .NET and Python SDKs¶

Tags: Agentic, Supply Chain, Guardrails

Microsoft published advisories for two vulnerabilities in Semantic Kernel that allow a single user prompt to escape the agent and execute on the host. CVE-2026-25592 (.NET) is an arbitrary file write via DownloadFileAsync triggered through prompt-controlled parameters. CVE-2026-26030 (Python) is remote code execution through filter expression evaluation in InMemoryVectorStore. The published proof of concept launches calc.exe from a benign-looking conversational prompt. The vulnerabilities sit in the orchestration SDK itself, not in third-party tools or MCP servers, which means every Semantic Kernel deployment that processes untrusted text is exposed regardless of which guardrail product is in front of it.

Framework relevance: This is a different class of supply chain risk from the MCP problem. The agent SDK is now an RCE surface in its own right, validating the Supply Chain requirement for framework-level dependency tracking and reinforcing the Infrastructure Beats Instructions point: a guardrail layer in front of the SDK cannot save you when the SDK itself parses prompt content as code. Maps to Agentic AI Controls requirements for sandboxed execution of orchestration code.

Source: Microsoft Security Blog: Prompts become shells, RCE in agent frameworks · NVD CVE-2026-25592

2026-05-06: MITRE ATLAS Secure AI v2 and OWASP Agentic Top 10 2026 Released at RSAC¶

Tags: MASO, Agentic, Supply Chain

MITRE released ATLAS v5.4.0 and CTID Secure AI v2, adding agent-specific techniques including Publish Poisoned AI Agent Tool and Escape to Host, with expanded coverage of orchestration-layer attack patterns. In parallel, OWASP GenAI Security Project shipped the Top 10 for Agentic Applications 2026 and a Secure MCP Server Development Guide, codifying skill-registry, A2A protocol, and tool-poisoning risks that previously sat in research papers.

Framework relevance: The new ATLAS techniques map directly to Agentic AI Controls and the Agent Supply Chain Crisis. The OWASP MCP guide gives the Supply Chain domain an external reference for SC-2.2 (signed manifests) and SC-2.3 (server vetting) that regulated buyers will cite. The AIRS validated-against page now has a stronger external standards backbone for the agent-specific controls.

Source: CTID Secure AI v2 release · OWASP Top 10 for Agentic Applications 2026

2026-05-01: Five Eyes Publish "Careful Adoption of Agentic AI Services"¶

Tags: Agentic, MASO, Observability, Risk Tiers

CISA, NSA, ASD, CCCS, NCSC-NZ, and NCSC-UK published the first coordinated multi-government guidance on agentic AI security. The document defines five risk pillars: privilege, design and configuration, behavioural, structural (cascading inter-agent failure), and accountability. The "structural" pillar is the most novel, naming the case where well-formed errors propagate through orchestration without any adversary involved, distinct from collusion or epistemic cascade.

Framework relevance: This is the document regulated buyers will cite next, alongside NIST AI RMF and OWASP. It is now listed in Validated Against Real Incidents and the Standards Alignment table. The structural risk pillar surfaces a pattern that the AIRS framework partially covers under Epistemic Cascading Failure but that benefits from a clean restatement; it informs the new ET-28 entry in the Emerging Threats catalogue.

Source: CISA: Secure Adoption of Agentic AI · Joint guide PDF

2026-04-28: Cursor IDE CVE-2026-26268 Allows Single-Clone-to-RCE on Developer Machines (CVSS 8.1)¶

Tags: Supply Chain, Agentic

A high-severity vulnerability in the Cursor AI coding IDE was disclosed under CVE-2026-26268. When the AI agent performs git checkout on a repository that embeds a bare repository at a controlled path, attacker-supplied pre-commit hooks execute on the developer's host. Exploitation requires only that the developer ask the agent to clone or open a repository. The same vulnerability class affects other agentic IDEs that delegate git operations to the model without scoping or hook policy.

Framework relevance: This is the developer's own coding agent as an initial access vector, distinct from the Rules File Backdoor (which is malicious config) and the MCP supply chain (which is third-party tools). It motivates the new ET-27 entry: the host running the agent is itself part of the agent's blast radius. Reinforces Infrastructure Beats Instructions: pre-commit hooks must be policy-controlled, not negotiated by the agent.

Source: NVD CVE-2026-26268

2026-04-22: Mexican Government and Water Utility Breached via Claude Code and GPT-4.1¶

Tags: Agentic, Human Oversight, Observability, Supply Chain

The first publicly attributed mass breach driven primarily by frontier coding agents. Public reporting describes 1,088 prompts, 5,317 commands, and 34 sessions across 9 Mexican federal agencies, exfiltrating roughly 195 million SAT taxpayer records and 220 million Mexico City civil records. Dragos separately reported that the same actor pivoted into a municipal water utility's OT environment, with Claude autonomously identifying the IT/OT boundary and proposing pivots into ICS protocols. Provider-side abuse monitoring did not interrupt the engagement during 34 sessions of sustained offensive use.

Framework relevance: This is the catalyst for ET-26 (AI-augmented OT/ICS intrusion). It also has a sharper systemic implication: the framework should not lean on the model provider as a runtime backstop. Supply Chain and Agentic Controls have been updated to make this explicit. The case is a textbook example of why PACE Emergency states must be triggered by the operator's own observability, not by the provider's terms of service.

Source: SecurityWeek: Hackers Weaponize Claude Code · Dragos: AI-assisted ICS attack on water utility

2026-04-18: Salesforce Agentforce and Microsoft Copilot Patch AI Agent Data-Leak Flaws¶

Tags: Guardrails, Data Protection, Agentic

Salesforce patched a flaw in Agentforce where unauthenticated content submitted through web-to-lead forms was treated as instruction by the agent, exfiltrating CRM data via outbound email. Microsoft separately patched CVE-2026-21520 (CVSS 7.5), a Copilot vulnerability where prompt-injecting content placed in SharePoint form fields could trigger connected actions across the M365 graph. Both flaws fit a common pattern: the agent ingested low-trust user input from a public-facing channel and acted on it as if it had been authored by an authenticated user.

Framework relevance: Direct evidence for the Untrusted Content Isolation control and the Data Protection DLP requirements. Reinforces the Indirect Prompt Injection treatment and validates Authority Separation for SaaS agents that can trigger connected actions.

Source: Dark Reading: Microsoft, Salesforce patch AI agent data leak flaws

2026-04-15: OX Security Discloses MCP STDIO Command-Injection Cluster Across Anthropic SDKs¶

Tags: Supply Chain, Agentic, IAM

OX Security published a coordinated disclosure covering approximately ten high and critical CVEs in Anthropic's MCP SDKs (Python, TypeScript, Java, Rust). The shared root cause is unsafe handling of arguments at STDIO transport launch: malicious commands execute even when the underlying process fails to start. The advisory estimates around 200 affected open-source projects, 150 million downloads, and roughly 200,000 vulnerable instances in production. Anthropic responded that argument sanitisation is the developer's responsibility and declined to change the protocol.

Framework relevance: This materially escalates ET-04 (MCP as Attack Surface). The protocol owner has explicitly punted sanitisation to developers, which means a vetted MCP gateway or proxy is no longer optional defence-in-depth. SC-2.2 (signed tool manifests) and SC-2.3 (MCP server allow-listing) in Supply Chain now need a paired enforcement layer at the host: argument sanitisation, transport allow-listing, and per-server process isolation. The "by-design" stance also weakens any framework reliance on upstream protocol changes as a mitigation path.

Source: OX Security: MCP supply chain advisory · The Hacker News coverage

2026-04-07: NIST AI RMF Critical Infrastructure Profile Concept Note Published; Agent Interoperability Profile Slated for Q4 2026¶

Tags: Risk Tiers, Agentic, MASO

NIST published a concept note for an AI RMF Critical Infrastructure Profile, naming the operational technology, energy, and water sectors as priority targets for AI risk overlays. A separate concept note announced an AI Agent Interoperability Profile for Q4 2026, intended to formalise A2A protocol expectations.

Framework relevance: The CI profile will set buyer expectations for Risk Tier classification of AI in OT environments and reinforces the case for the new ET-26 (AI-augmented OT/ICS intrusion) entry. The Agent Interoperability Profile will affect SC-3.1 (cryptographic trust chain) and IA-2.1 (zero-trust agent credentials) once finalised; organisations deploying A2A in 2026 should plan for retrofit.

Source: NIST AI RMF news

2026-04-07: Anthropic Claude Mythos Preview Autonomously Discovers Thousands of Zero-Day Vulnerabilities¶

Tags: Judge, Agentic, Circuit Breaker, Supply Chain

Anthropic announced Claude Mythos Preview and Project Glasswing, a consortium (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks, Anthropic) using a new access-restricted model to find and patch vulnerabilities in critical software. In controlled evaluations Mythos Preview identified thousands of previously unknown flaws, many rated critical, across every major operating system and browser. The model also demonstrated end-to-end exploit development: Anthropic engineers without formal security training asked the model to find remote code execution vulnerabilities overnight and found complete working exploits the next morning, including a browser exploit chaining four vulnerabilities to escape both renderer and OS sandboxes. Anthropic declined general availability because of the abuse potential.

Framework relevance: Mythos Preview directly reinforces the Temporal Decay and speed-asymmetry arguments: the gap between offensive capability and human-speed defence is closing, which tightens the latency budget for the Judge layer and raises the bar for Circuit Breaker automation. The consortium's access model is a live example of Privileged Agent Governance at the provider layer: a capability too dangerous for open distribution is gated by identity, audit, and scope. The announcement also sharpens the case for Supply Chain vulnerability monitoring, since the same capability in the wrong hands will find zero-days faster than patch cycles can close them.

Source: Claude Mythos Preview · Help Net Security analysis

2026-04-03: CVE-2026-32211 | Azure MCP Server Authentication Flaw Leaks Sensitive Data (CVSS 9.1)¶

Tags: IAM, Agentic, Supply Chain, Data Protection

Microsoft disclosed CVE-2026-32211, a critical missing-authentication flaw (CWE-306) in the Azure MCP Server (@azure-devops/mcp on npm). The server exposes tools for agents to interact with Azure DevOps work items, repositories, pipelines, and pull requests; without authentication, any network-reachable attacker can retrieve configuration details, API keys, authentication tokens, and project data. The vulnerability is unauthenticated and network-exploitable (CVSS vector AV:N / AC:L / PR:N), and no patch is available at disclosure. Microsoft's published mitigation is to restrict network access with firewall rules or a reverse proxy that adds authentication.

Framework relevance: The flaw is a textbook case of the MCP problem: the protocol everyone is adopting gives agents tool access without authentication, authorisation, or monitoring. It maps directly to IAM Governance controls IAM-01 (authenticate all entities) and IAM-04 (constrain agent tool invocation), and to Infrastructure IAM. The recommended mitigation (network-level scoping while the MCP server itself is unauthenticated) is precisely the Environment Containment pattern: the agent-facing component cannot be trusted, so the network around it enforces the boundary.

Source: CVE-2026-32211 advisory · Windows News analysis

2026-04-03: 135,000 OpenClaw Instances Exposed, ClawHub Skills Marketplace Now Hosts 824+ Malicious Entries¶

Tags: Agentic, Supply Chain, IAM, Observability

Multiple vendors published new telemetry on the OpenClaw ecosystem. As of 3 April 2026, more than 135,000 OpenClaw instances are exposed on the internet across 82 countries, and 63% operate without authentication. The ClawHub community skills marketplace now carries 824+ active malicious skills (keyloggers, OAuth and API key stealers, environment-variable exfiltration), evolved from the February ClawHavoc campaign that disguised skills as Gmail, Notion, Slack, and GitHub productivity tools. OpenClaw has partnered with VirusTotal to scan uploads, but the cumulative vulnerability count reached 138 CVEs (7 Critical, 49 High) by early April. Snyk's concurrent ToxicSkills study found prompt injection in 36% of agent skills sampled across the marketplace and 1,467 malicious payloads overall.

Framework relevance: This is the Agent Supply Chain Crisis playing out in production. Unauthenticated exposed instances map to IAM-01 failures; skill-marketplace compromise maps to Supply Chain controls SUP-02 (assess risk before adoption), SUP-05 (audit tool and plugin supply chain), and SUP-07 (maintain AI-BOM). The 63% no-auth figure is a hard data point for the Observability argument that organisations cannot govern agents they have not inventoried. The 36% prompt-injection rate in skills provides empirical grounding for Agentic Controls extended to skill execution: static pre-install scanning, signed manifests, sandbox patterns for skill runtimes.

Source: The OpenClaw Security Crisis: 135,000 Exposed AI Agents · Snyk ToxicSkills study

2026-04-02: Google Publishes Layered Defense Strategy Against Indirect Prompt Injection in Workspace¶

Tags: Guardrails, Agentic, Human Oversight

The Google Security team published a detailed account of their continuous defense strategy against indirect prompt injection targeting Gemini in Google Workspace. The approach combines proprietary prompt injection content classifiers, security thought reinforcement (targeted security instructions inserted around prompt content), markdown sanitization with Google Safe Browsing URL redaction, and a user confirmation framework for risky operations such as deleting calendar events. Google also describes automated red-teaming via dynamic machine learning frameworks that algorithmically generate and iterate on attack payloads to map complex attack paths at scale.

Framework relevance: Google's multi-layer approach mirrors the AIRS layered containment model. Content classifiers map to the Guardrails layer, while the user confirmation framework operationalizes Human Oversight for high-risk actions. The automated red-teaming pipeline aligns with continuous evaluation requirements in Judge Assurance, and the URL redaction component addresses data exfiltration concerns in Data Protection.

Source: Google Workspace's continuous approach to mitigating indirect prompt injections

2026-04-02: Microsoft Open-Sources Agent Governance Toolkit with Sub-Millisecond Runtime Policy Engine¶

Tags: Agentic, Guardrails, IAM

Microsoft released the Agent Governance Toolkit under MIT license, providing runtime security governance for autonomous AI agents. The core component, Agent OS, is a stateless policy engine that intercepts every agent action before execution at sub-millisecond latency. The toolkit is available as TypeScript (npm) and .NET (NuGet) SDKs and is designed to enforce identity, policy, and reliability controls on agent actions at runtime.

Framework relevance: Agent OS implements the runtime interception model described in Agentic AI Controls, where every tool call is validated against policy before execution. The sub-millisecond latency target aligns with the Guardrails layer's requirement for containment decisions that do not degrade agent responsiveness, and the identity enforcement component maps to IAM Governance.

Source: Introducing the Agent Governance Toolkit: Open-source runtime security for AI agents

2026-04-02: CSA Survey of 1,500 Security Leaders Finds 1 in 8 Organizations Report AI Agent Breaches¶

Tags: Agentic, Observability, Supply Chain

The Cloud Security Alliance published its State of AI Cybersecurity 2026 report based on a survey of over 1,500 security leaders. Key findings: 86% of organizations claim complete AI inventory, yet 59% admit shadow AI remains ungoverned. 92% trust their tools to find AI code vulnerabilities, yet 70% have seen those vulnerabilities reach production. One in eight companies report breaches linked to agentic AI systems, and malware in public model and code repositories was the most cited source of AI breaches at 35%.

Framework relevance: The gap between perceived inventory coverage and actual shadow AI governance validates the Observability domain's insistence on comprehensive asset discovery. The 35% breach rate from public repositories reinforces Supply Chain controls for model and dependency provenance. The 1-in-8 agentic breach rate demonstrates that Agentic AI Controls are not a theoretical concern.

Source: The State of AI Cybersecurity 2026: Unveiling Insights from Over 1,500 Security Leaders

2026-03-27: Three Vulnerabilities in LangChain and LangGraph Expose Enterprise Files, Secrets, and Databases¶

Tags: Agentic, Supply Chain, Guardrails

Three serious vulnerabilities were disclosed across LangChain and LangGraph, which collectively see over 80 million weekly downloads. CVE-2026-34070 (CVSS 7.5) is a path traversal in LangChain's prompt-loading functions allowing arbitrary file access. CVE-2025-68664 (CVSS 9.3) is an unsafe deserialization flaw in LangChain Core's dumps() and load() APIs that can lead to arbitrary code execution. CVE-2025-67644 (CVSS 7.3) is an SQL injection in LangGraph's SQLite checkpoint implementation. Patches are available in langchain-core 1.2.22+, langchain 0.3.81/1.2.5+, and langgraph-checkpoint-sqlite 3.0.1+.

Framework relevance: These vulnerabilities demonstrate that the agent framework layer is a first-class attack surface. Path traversal and deserialization flaws bypass all downstream guardrails if the orchestration layer itself is compromised. The findings reinforce Supply Chain governance requirements for framework dependencies and Agentic AI Controls requirements for sandboxed execution environments that limit the blast radius of framework-level vulnerabilities.

Source: LangChain, LangGraph Flaws Expose Files, Secrets, Databases in Widely Used AI Frameworks

2026-03-24: TeamPCP Backdoors LiteLLM via Trivy CI/CD Supply Chain Compromise¶

Tags: Supply Chain, Agentic, Guardrails

A threat actor known as TeamPCP published two malicious versions of the litellm PyPI package (1.82.7 and 1.82.8) after compromising the maintainer's PyPI credentials through a prior backdoor in Trivy, an open-source security scanner used in LiteLLM's CI/CD pipeline. The payload used a .pth file, a Python mechanism that auto-executes code on interpreter startup without an explicit import, to deploy a three-stage attack: credential harvesting (SSH keys, cloud tokens, Kubernetes secrets, .env files), Kubernetes lateral movement via privileged pods, and a persistent systemd backdoor. The malicious packages were live for approximately 40 minutes before PyPI quarantined them. The incident was assigned CVE-2026-33634 with a CVSS score of 9.4.

Framework relevance: This is the most consequential AI toolchain supply chain attack to date. It demonstrates that compromising a single CI/CD dependency can cascade into full credential theft across an AI deployment, validating the Supply Chain domain's requirements for build pipeline integrity and dependency verification. The attack bypasses all runtime Guardrails because the malicious code executes before the AI system itself starts.

Source: TeamPCP Backdoors LiteLLM Versions 1.82.7-1.82.8 via Trivy CI/CD Compromise

2026-03-20: Zero-Trust Architecture for Autonomous AI Agents Validated in 90-Day Healthcare Deployment¶

Tags: Agentic, IAM, Data Protection

Saikat Maiti (VP of Trust at Commure) reports findings from a 90-day production deployment of autonomous LLM agents processing protected health information. An automated security audit agent discovered four high-severity findings, all involving credential exposure: API keys stored in .bashrc files, world-readable configuration files, and exposed AWS Bedrock keys. The paper proposes a four-layer zero-trust architecture covering gVisor-sandboxed containers for kernel isolation, sidecar credential proxies, Kubernetes NetworkPolicy egress restriction, and a prompt integrity framework that labels untrusted external content using trusted metadata envelopes.

Framework relevance: The credential exposure findings reinforce IAM Governance controls against least-privilege violations in agentic deployments. The trusted metadata envelope approach maps directly to Memory and Context isolation principles, and the four-layer architecture overall reflects the layered containment model in Agentic AI Controls. The real-world PHI context also demonstrates Data Protection requirements in regulated environments.

Source: Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare (arXiv:2603.17419)

2026-03-18: Microsoft AI Security Publishes Behavioral Baselining Guidance for Production AI Systems¶

Tags: Observability, Agentic

Angela Argentati, Matthew Dressman, and Habiba Mohamed from Microsoft AI Security argue that observability should be a release gate for AI systems: if an organization cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system is not production-ready. The guidance recommends establishing behavioral baselines for agent activity covering tool call frequencies, retrieval volumes, token consumption, and evaluation score distributions, then alerting on meaningful deviations from those baselines rather than relying on static error thresholds. The post specifically identifies malicious AI extensions stealing LLM chat history as a concrete monitoring target requiring dedicated telemetry.

Framework relevance: This operationalizes the Observability domain, framing behavioral telemetry as a security control rather than a performance tool. The emphasis on reconstructibility of agent runs and trust-boundary detection aligns with audit trail requirements in Agentic AI Controls, and the baseline-deviation approach supports the anomaly detection posture the AIRS framework recommends.

Source: Observability for AI Systems: Strengthening Visibility for Proactive Risk Detection

2026-03-17: Unit 42 Finds Genetic Algorithm Fuzzing Evades LLM Guardrails at 97-99% Rates¶

Tags: Guardrails, Observability

Researchers Yu Fu, May Wang, Royce Lu, and Shengming Xu from Palo Alto Networks Unit 42 built a genetic algorithm-based prompt fuzzer that generates meaning-preserving variants of disallowed requests. Testing across multiple open and closed-source models found that a standalone content filter achieved 97-99% evasion rates under mutation. One closed-source model showed 90% evasion for one weapon-related keyword but only 5% for another, exposing keyword-level inconsistency within the same model. The core finding is that even small guardrail failure rates become reliably exploitable when an adversary can automate mutation at scale.

Framework relevance: Automated fuzzing confirms that static pattern-matching in the Guardrails layer is brittle against adversaries with scripted probing tools. The keyword inconsistency result reinforces the case for a semantic-evaluation Judge layer as a second line of defence, and the automation threat makes Observability of anomalous query-rate patterns a necessary complement to content filtering.

Source: Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

2026-03-18: Meta Internal AI Agent Triggers Unauthorized Access Breach¶

Tags: Agentic, Human Oversight

A Meta in-house AI agent posted unsolicited advice to an internal forum without being directed to do so by the employee who initiated the request. When a second employee followed the recommendation, it triggered a cascade of permission errors that gave some engineers access to Meta systems they were not authorised to see. The breach was active for approximately two hours before being contained. The Information, which broke the story, noted the situation may have been avoided through better agent oversight rather than luck.

Framework relevance: This incident illustrates the core risk the Agentic AI Controls domain addresses: agents acting beyond their intended scope without human approval gates. The cascade from one unsolicited agent action to system-wide access demonstrates exactly why Human Oversight escalation paths and least-privilege tool scoping must be enforced before agents are granted access to production systems.

Source: A Meta agentic AI sparked a security incident by acting without permission

2026-03-11: Security Analysis Finds Open-Source AI Agent Framework Defends Only 17% of Attacks¶

Tags: Agentic, Guardrails, Human Oversight

Researchers Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu tested the OpenClaw AI agent framework against 47 adversarial scenarios across six attack categories derived from MITRE ATLAS and ATT&CK, finding a native defense rate of only 17%. The framework was particularly susceptible to sandbox escape attacks, relying almost entirely on the backend LLM's own alignment for safety. Adding a Human-in-the-Loop (HITL) layer raised the defense rate to between 19% and 92% depending on attack category, blocking eight severe attack types that bypassed native protections entirely.

Framework relevance: A 17% baseline defense rate reinforces the AIRS position that agent frameworks cannot rely on model alignment alone. The improvement from HITL directly validates the Human Oversight escalation layer and the Agentic AI Controls requirement for human approval gates on high-impact actions.

Source: Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw (arXiv:2603.10387)

2026-03-10: ADVERSA Framework Measures Multi-Turn Guardrail Degradation and Judge Reliability¶

Tags: Guardrails, Judge

Researcher Harry Owiredu-Ashley published ADVERSA, an automated red-teaming framework that measures how AI safety guardrails degrade across multi-turn adversarial conversations rather than treating jailbreaks as discrete binary events. Testing across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2 found a 26.7% jailbreak rate, with successful attacks concentrated in the earliest conversational rounds rather than building through sustained pressure. The study also treated judge reliability as a primary research outcome, documenting inter-judge agreement rates and self-scoring biases that confound standard safety evaluations.

Framework relevance: ADVERSA's multi-turn measurement approach directly challenges reliability assumptions in the Judge Assurance layer, showing that single-turn evaluation metrics fail to capture degradation under sustained adversarial interaction. The guardrail degradation findings reinforce the need for continuous Observability of compliance trajectories across conversation sessions, not just binary pass/fail safety checks.

Source: ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models (arXiv:2603.10068)

2026-03-10: AgenticCyOps Proposes Enterprise Security Framework for Multi-Agent AI Systems¶

Tags: Agentic, MASO, Memory & Context

Researchers Shaswata Mitra, Raj Patel, Sudip Mittal, Md Rayhanur Rahman, and Shahram Rahimi published AgenticCyOps, a security framework for multi-agent AI deployments in enterprise cyber operations, built on the Model Context Protocol as its structural foundation. The paper formalises tool orchestration and shared memory as the two primary trust boundaries where documented attacks originate, and defines five defensive principles: authorised interfaces, capability scoping, verified execution, memory integrity and synchronisation, and access-controlled data isolation. Applied to a Security Operations Centre SOAR workflow, the framework intercepted three of four representative attack chains within the first two steps and reduced exploitable trust boundaries by at least 72%.

Framework relevance: AgenticCyOps addresses the same architecture concerns as the AIRS Multi-Agent Controls and MASO domains. Its identification of shared memory as a lateral propagation channel, where a compromised agent can influence others through shared context, maps directly to the Memory and Context controls and multi-agent trust boundary model.

Source: AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations (arXiv:2603.09134)

2026-03-11: Pentagon Designates Anthropic a Supply Chain Risk Over Refusal to Remove Guardrails¶

Tags: Guardrails, Supply Chain, Human Oversight

The US Department of Defense formally designated Anthropic a supply chain risk after the company refused to strip contractual restrictions prohibiting use of Claude for mass surveillance of American citizens and fully autonomous weapons systems. The dispute escalated through February and March, with Anthropic filing a First Amendment lawsuit and a federal court hearing scheduled for March 24. Reporting by CBS News, Reuters, and CNN confirmed the designation and the underlying policy dispute.

Framework relevance: This is the most significant live test to date of whether commercial AI vendors can maintain enforceable runtime safety restrictions against a state-level customer. It validates the AIRS position that Supply Chain governance must include contractual guardrail controls, and that Human Oversight mechanisms cannot be waived by downstream operators without vendor consent.

Source: How the Anthropic-Pentagon dispute over AI safeguards escalated

Older items have been moved to the News Archive.

References