AI Agents as Invisible Failure Initiators: Enterprises Don't Track These Incidents Yet

A category of production incident is emerging in enterprise systems that engineering teams aren't tracking yet. An AI agent detected an anomaly, took a technically correct action, had incomplete context, the infrastructure cascaded — and the postmortem ended in a three-team argument about whose fault it was. Neither agents nor chaos engineering (the controlled injection of failure into production to verify system resilience) alone, but the space between them is generating the next wave of major production incidents.

Key takeaways

79% of organizations have AI agents in production, 96% plan expansion (PwC 2026)
Gartner: 40% of agentic AI projects will be canceled by 2027 due to poor risk controls
AI-related incidents grew 21% from 2024 to 2025 per the AI Incidents Database
Remediation agents act like chaos engineering experiments — but without SLO burn rate checks, blast radius calculations, or humans in the loop
Author proposes a "resilience budget" model — a shared absorb capacity resource updated in real time by both experiments and agent actions

The judgment call agents skip

In mature organizations, chaos engineering is a structured process: an engineer checks dashboards, looks at the error budget burn rate, assesses dependency stability — and only then decides whether right now is the right moment to inject failure. It's a human judgment call, imperfect and intuitive, but it asks one key question: does the system have the capacity to absorb additional stress right now?

When an autonomous remediation agent (an AI system that fixes incidents without a human in the loop) enters the system — capable of restarting services, rerouting traffic, scaling resources — that question disappears. The agent sees an anomaly, takes an action. The action is a chaos engineering event. Without checking SLO burn rate (the rate at which the error budget defined by SLOs is being consumed). Without blast radius calculation (the systemic reach of the action’s consequences). Without human judgment about whether this is the right moment.

The concrete failure pattern

Sayali Patil, author of the analysis (formerly Cisco and Splunk, patent holder on intent-based chaos engineering methodology), describes a typical scenario: a remediation agent detects elevated latency on a microservice and restarts the cluster — a rational action given its training data. What the agent doesn't know: three other services are handling peak traffic, the shared connection pool is at 87% utilization, a dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.

The resulting blast radius doesn't cover just the service restart. It covers everything downstream of that restart, in a system state the agent never had a complete picture of. No chaos engineering program tested that specific combination. No blast radius calculation included the agent as an actor.

Resilience budget as a solution

Based on primary research with SRE and platform engineering practitioners at organizations including Intuit and GPTZero, Patil proposed a resilience budget model — absorb capacity treated not as a static threshold but as a real-time consumable resource. Each agent action and each chaos engineering experiment draws from this resource. In multi-team organizations, the budget is shared.

The model relies on four signal classes: SLO burn rate (the primary signal — directly encodes distance from committed SLOs), P99 latency trend (more important than absolute value), dependency saturation state (the most commonly missed signal — a connection pool at 87% is a different context than 30%), and application behavioral signals (session completion rates, API call pattern shifts, conversion degradation — visible before infrastructure metrics fire).

LLMs for hypothesis generation, not execution decisions

Several organizations are already testing language models to generate chaos engineering hypotheses from dependency graphs and postmortem history. Results are directionally useful: LLMs surface plausible failure modes faster than manual processes. The hard limit is dependency graph staleness. The model doesn't know a service was extracted last month or a new shared library was added two sprints ago. It will be confidently wrong about the blast radius of an action on a system boundary that no longer exists.

Stanford's Trustworthy AI Research Lab confirms: model-level guardrails were bypassed in the majority of tested fine-tuning attacks. A model that cannot maintain its own safety boundaries should not be trusted to accurately model the blast radius of actions it has never seen in an unverified dependency graph.

Why this matters

Patil's analysis surfaces an important conceptual gap: the industry deployed remediation agents into infrastructure without extending the risk governance models that until now covered only human engineers. The effect is analogous to installing an autopilot in a car without updating the ABS system so it knows there are now two drivers. The data is alarming: 79% of deployments with agents in production, 21% growth in AI incidents year-over-year, and yet no standard postmortem templates that include the agent as the cascade-initiating actor. This isn't a future problem — it's a present one, which most organizations classify under misleading technical labels. The solution doesn't require an architectural revolution. It requires connecting agents to the same live signal layer that already governs chaos engineering experiments.

What's next

Author recommends auditing every agent touching infrastructure: mapping its action surface against live SLO burn rate and defining floor conditions below which the agent must wait or escalate
Gartner projects 33% of enterprise software will include agentic AI by 2028 — without a resilience budget governance layer, that scale means proportional growth in invisible incidents
AI Incidents Database plans to expand classification to include agent actions as cascade initiators — enabling comparable safety benchmarking for agentic systems