Retrieval

DiskANN

2019ActivePublished: 20 May 2026Updated: 20 May 2026Published

Key innovation

Efficient billion-scale ANN search on a single node with the index residing on SSD, enabled by the Vamana graph that is specifically designed to minimize disk reads per query.

How it works

Indexing: (1) a Vamana graph is built — each node is a vector and edges connect it to R chosen neighbors; construction is iterative, with an alpha parameter (>1.0) controlling how aggressively long-range "shortcut" edges between distant clusters are added. (2) Full vectors and adjacency lists are written to SSD in a layout aligned to 4 KB pages, so that a single page contains both a node's vector and its neighbors. (3) RAM holds Product-Quantized compressed codes of all vectors plus a small cache of nodes around the entry points. Query: (a) starts from the graph medoid, (b) at each step the SSD page of the current node is read to get its neighbors and their full vectors, but candidates for frontier expansion are scored with cheap PQ distances from RAM, (c) the frontier is kept as a priority queue of size L (search list), (d) the search terminates when none of the R closest candidates improves the distance. The FreshDiskANN variant adds an in-memory "delta" layer for fresh inserts and periodically merges it with the SSD index, allowing updates without full rebuilds.

Problem solved

Exact k-NN in spaces with hundreds or thousands of dimensions is infeasible at billion-scale, and classical ANN graph indexes (HNSW) require the whole index to fit in RAM, which at 10⁹ vectors translates into hundreds of GB to TB of memory per node. DiskANN tackles this economic problem: it achieves comparable recall and latency while keeping the index on much cheaper SSD storage and only a small footprint in RAM, so a single server can serve billions of vectors instead of an entire cluster.

Components

Vamana graphProximity index

Directed proximity graph with bounded maximum out-degree R, built iteratively with an alpha parameter that controls how long the shortcut edges between clusters can be; stored on SSD together with full vectors.

PQ compressed vectors (in-memory)Cheap distance estimator

All vectors are compressed by Product Quantization down to tens of bytes per vector and kept in RAM to quickly estimate distances when picking candidates to expand during the search.

Official

SSD storage layoutPersistent storage for graph and full vectors

Pages of 4 KB pack a node's vector together with its neighbor list, so that a single I/O request returns everything one search step needs.

Beam search with priority queue (L)Query algorithm

The search keeps a priority queue of size L (search list size) — larger L means higher recall at the cost of latency and more SSD reads.

FreshDiskANN delta indexUpdate handling

Optional in-memory layer for fresh inserts/deletes, periodically merged with the SSD index, enabling mutable collections without full rebuilds.

Official

Implementation

Reference implementations

microsoft/DiskANN

C++ · Microsoft Research

Official

Milvus DiskANN index

C++ / Go · Zilliz / Milvus

Azure Cosmos DB DiskANN

managed service · Microsoft Azure

Implementation pitfalls

Using HDD instead of NVMe SSDCritical

DiskANN is built around low-latency 4 KB random reads. On HDDs or slow SATA SSDs latency degrades by 1–2 orders of magnitude and the algorithm loses its advantage.

Fix:Require datacenter-class NVMe SSD; measure p99 4 KB random-read IOPS under load.

Search list L set too small → low recallHigh

L controls the frontier size. Values below ~50 sharply reduce recall on hard datasets (>1M vectors, high D).

Fix:Tune L per dataset and recall target; keep separate profiles for different SLOs (low-latency vs high-recall).

Mutations without FreshDiskANN → index degradationHigh

Vanilla DiskANN is semi-static. Frequent inserts/deletes without a FreshDiskANN delta layer degrade graph quality and force full rebuilds.

Fix:Use FreshDiskANN or an engine that manages it (Milvus / Cosmos); schedule periodic rebuilds.

Underestimating RAM needed for PQ codebooksMedium

Even though the index is "on disk", PQ codebooks and the frontier slice must fit in RAM. For 1B vectors × 64 B PQ that is ~64 GB — easy to overlook in capacity planning.

Fix:Budget RAM as N × b_PQ + cache; keep a 1.5–2× margin over the theoretical minimum.

No warm-up of cache → very high p99 latencyMedium

Right after startup every query reads from SSD — p99 latency can be 5–10× the steady-state value until the OS page cache warms up.

Fix:Run a synthetic warm-up (preload medoid + N hot queries) before adding the instance to the load balancer.