Big Sleep AI agent finds exploitable SQLite bug before release

Google Project Zero and DeepMind announced in October 2024 that their joint AI agent, called Big Sleep, became the first in the world to autonomously discover an exploitable security vulnerability in widely used production software — the SQLite database engine. The bug was reported to SQLite developers and fixed the same day, before it reached any official release.

Key takeaways

Big Sleep is a collaboration between Google Project Zero and DeepMind — an evolution of Project Naptime, powered by Gemini 1.5 Pro
Vulnerability found: exploitable stack buffer underflow in SQLite's seriesBestIndex function — reported and fixed in October 2024
The bug never reached an official SQLite release — no users were exposed to risk
150 CPU-hours of traditional AFL fuzzing failed to find the same bug
AI-powered OSS-Fuzz simultaneously found 26 open source vulnerabilities, including CVE-2024-9143 in OpenSSL — a bug likely present for two decades

From Project Naptime to Big Sleep

Big Sleep evolved from an earlier Google Project Zero research effort called Naptime, which assessed whether large language models could replicate offensive security techniques. After Naptime demonstrated improved performance on the CyberSecEval2 benchmark, Project Zero joined forces with Google DeepMind. The result — Big Sleep — is a full-fledged agent capable of variant analysis on real production code.

The agent has access to a toolset: a debugger, a code browser, and a runtime environment. Its task is variant analysis — searching for bugs similar to ones previously found and patched. The base model used: Gemini 1.5 Pro.

The bug that fuzzing missed

The discovered vulnerability is in the seriesBestIndex function within SQLite's series.c extension. The function handles queries using the generate_series virtual table. The root cause: the iColumn field in the sqlite3_index_constraint struct can hold the value -1, a special marker for ROWID constraints. The function failed to handle this edge case — it subtracted the SERIES_COLUMN_START constant, yielding -2, then used that value as an index into the aIdx[7] array, writing below the stack buffer.

In production builds, depending on compiler and optimization settings, this corrupted the pConstraint pointer, leading to a dereference at an invalid address in the next loop iteration — a potentially exploitable condition.

Big Sleep found the bug through variant analysis: it received a specific SQLite commit as a starting point, analyzed the diff, formed hypotheses, generated test cases, triggered a crash, and produced a report ready for disclosure. Large language models have a natural advantage in this kind of analysis: they carry extensive knowledge of known vulnerability classes, enabling faster hypothesis generation and validation.

Traditional AFL fuzzing ran 150 CPU-hours against the same target without finding the bug. The reason: standard OSS-Fuzz configurations for SQLite do not enable the generate_series extension. Code coverage is not the same as state coverage — this lesson is central to understanding what AI brings to security research.

A parallel front: AI-powered OSS-Fuzz

Concurrently, the Google Open Source Security team deployed AI for fuzz target generation in OSS-Fuzz. Code coverage expanded from 160 to 272 C/C++ projects, adding over 370,000 lines of new coverage. The effort uncovered 26 new vulnerabilities, including CVE-2024-9143 in OpenSSL — a bug likely present in the codebase for roughly two decades, undetectable by human-written fuzz targets.

Compared to Big Sleep, OSS-Fuzz takes a different approach: the LLM generates fuzz targets, fixes compilation errors, and triages crashes, but does not perform full variant analysis. Big Sleep operates as an autonomous AI agent capable of planning its own research steps. Both tools have different strengths: fuzzing excels at broad input-space exploration, Big Sleep at deep code logic analysis.

Why it matters

Big Sleep's discovery marks a concrete milestone: an AI agent autonomously found an exploitable vulnerability in widely deployed production software. This was not a proof of concept on a synthetic benchmark — it was a real bug in SQLite, an engine embedded in billions of devices and applications, from web browsers to embedded systems and cloud backends.

For defenders, this represents a potential tool for getting ahead of attackers: a vulnerability found by AI before release leaves nothing to exploit. Variant analysis is an area where LLMs hold a structural advantage over fuzzing — they respond to code semantics, not just branch coverage. Limitations remain clear, however: Big Sleep is still research-stage, and one found bug is a promising result, not proof of broad reliability.

What's next

The Big Sleep team is working on the fifth pipeline step: automated patch generation — this capability was not ready at the time of the November 2024 publication
OSS-Fuzz plans to expand into a full agentic pipeline with LLM access to debugger tooling, enabling automated triage and direct reporting to project maintainers
SQLite and OpenSSL, as projects of critical importance to global internet infrastructure, remain priority targets for AI-assisted security research