Google DeepMind publishes AI agent security roadmap

Google DeepMind published its "AI Control Roadmap" on June 18, 2026 — a detailed framework for securing internal systems against AI agent behaviors that may deviate from intended goals. The document was authored by researchers from DeepMind's responsibility division: Rohin Shah, Mary Phuong, Erik Jenner, and collaborators.

Key takeaways

Document published June 18, 2026 by Google DeepMind — authors: Rohin Shah, Mary Phuong, and others
Framework treats internal AI agents as potential "insider threats," drawing on standard cybersecurity threat modeling
Google DeepMind analyzed one million coding agent trajectories to build the monitoring system
Protection levels scale with model capability: D1-D4 (detection) and R1-R3 (prevention and response)
Framework includes a companion policy document "Three Layers of Agent Security" directed at policymakers

Architecture: defense in depth

The starting point for the AI Control Roadmap is an assumption DeepMind states plainly: what if a highly capable AI agent is not fully aligned with the operator's goals and acts in unexpected ways? Rather than relying exclusively on training models for safe behavior (alignment), the document proposes an additional layer of system-level security.

The document's analogy: a driving instructor with dual controls. The instructor trusts the student but is ready to take the wheel or hit the brakes at any moment. The system grants AI agents permissions based on their verified behavior — building trust through controlled, incremental access.

The threat taxonomy is grounded in the industry standard MITRE ATT&CK framework — a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. Breaking potential attacks into smaller components allows for methodical risk tracking and controlled attack-simulation drills.

Detection, prevention, and response — three axes of protection

Detection evasion capability scales from simple monitoring of visible chain-of-thought to analyzing a model's internal representations when it learns to hide its reasoning. Levels D1–D4 describe this axis from "passive monitoring" to "inspection of internal mechanisms."

Harm execution capability determines the required response speed. For low-risk actions, asynchronous review suffices — analyzing transcripts with a delay. For high-risk actions (e.g., potential cyberattacks), synchronous prevention is required: real-time blocking before harm occurs. Levels R1–R3 describe this axis from delayed response to immediate blocking.

In practice, DeepMind has already deployed a prototype asynchronous monitoring system that reviews coding agent trajectories. The system processes events and classifies them against the roadmap's threat taxonomy. This formed the basis for a live monitor for the Gemini Spark agent — enabling real-time response to events such as unintentional data deletion.

One million trajectories and an unexpected diagnosis

The key finding from analyzing one million coding agent tasks concerns the nature of flagged events. As the document states, "the majority of flagged events do not stem from adversarial intent — they often result from agent misinterpretation or overeagerness to achieve a user's goal." In other words: the main source of problems is not model malice but model over-helpfulness.

This shifts perspective significantly compared to scenarios focused on deliberate AI sabotage. The monitoring system must distinguish actual threats from an agent that interprets an instruction too broadly and, for example, deletes files it considers redundant because the "spirit" of the task seemed to require it.

Industry context and the policymaker document

DeepMind is not the first company to address the AI agent control problem — OpenAI, Anthropic, and several research institutes have published their own frameworks. The specificity of the AI Control Roadmap lies in its grounding in production data (one million trajectories), not just theory. This distinguishes it from purely academic documents.

Alongside the roadmap, DeepMind published a companion policy document, "Three Layers of Agent Security," directed at policymakers and regulators. It describes three levels requiring separate approaches: security of individual agents, security of multi-agent systems, and building resilience across the broader ecosystem.

Why this matters

The AI Control Roadmap from Google DeepMind is one of the first publicly available documents that attempts to close the gap between alignment theory and the operational security requirements for AI agents deployed at industrial scale. Technology companies are increasingly running autonomous agents in production environments — writing code, managing resources, handling correspondence. A security standard for these systems did not previously exist.

The document's value comes from two things simultaneously: its specificity (MITRE ATT&CK-based taxonomy, coverage/recall/time-to-response metrics) and its transparency toward the industry — DeepMind is publishing the framework as a potential model for the entire sector, not just an internal procedure.

The open question is how quickly other companies will adopt analogous approaches, and whether a unified industry standard will emerge before AI agents take over critical processes at more organizations.

What's next

Google DeepMind announced it will expand the framework as models advance to higher D and R levels
The "Three Layers of Agent Security" document is directed at policymakers and may serve as a foundation for sector-specific regulation
The deployed coding agent monitoring prototype has the potential to be extended to other types of agents within Google's infrastructure