Robots Atlas>ROBOTS ATLAS
Retrieval

DiskANN

2019ActivePublished: 20 May 2026Updated: 20 May 2026Published
Key innovation
Efficient billion-scale ANN search on a single node with the index residing on SSD, enabled by the Vamana graph that is specifically designed to minimize disk reads per query.
Category
Retrieval
Abstraction level
Building block
Operation level
RetrievalDataInference
Use cases
Semantic search over billions of embeddingsScaling RAG beyond RAM limits (single-node billion-scale)Vector databases (Azure Cosmos DB, Milvus)Content / product recommendations over large catalogsDeduplication and near-duplicate detection on huge corporaImage-to-image similarity searchANN for multimodal embeddings (text + image + audio)

How it works

Indexing: (1) a Vamana graph is built โ€” each node is a vector and edges connect it to R chosen neighbors; construction is iterative, with an alpha parameter (>1.0) controlling how aggressively long-range "shortcut" edges between distant clusters are added. (2) Full vectors and adjacency lists are written to SSD in a layout aligned to 4 KB pages, so that a single page contains both a node's vector and its neighbors. (3) RAM holds Product-Quantized compressed codes of all vectors plus a small cache of nodes around the entry points. Query: (a) starts from the graph medoid, (b) at each step the SSD page of the current node is read to get its neighbors and their full vectors, but candidates for frontier expansion are scored with cheap PQ distances from RAM, (c) the frontier is kept as a priority queue of size L (search list), (d) the search terminates when none of the R closest candidates improves the distance. The FreshDiskANN variant adds an in-memory "delta" layer for fresh inserts and periodically merges it with the SSD index, allowing updates without full rebuilds.

Problem solved

Exact k-NN in spaces with hundreds or thousands of dimensions is infeasible at billion-scale, and classical ANN graph indexes (HNSW) require the whole index to fit in RAM, which at 10โน vectors translates into hundreds of GB to TB of memory per node. DiskANN tackles this economic problem: it achieves comparable recall and latency while keeping the index on much cheaper SSD storage and only a small footprint in RAM, so a single server can serve billions of vectors instead of an entire cluster.

Components

Vamana graphProximity index

Directed proximity graph with bounded maximum out-degree R, built iteratively with an alpha parameter that controls how long the shortcut edges between clusters can be; stored on SSD together with full vectors.

PQ compressed vectors (in-memory)Cheap distance estimator

All vectors are compressed by Product Quantization down to tens of bytes per vector and kept in RAM to quickly estimate distances when picking candidates to expand during the search.

Official

SSD storage layoutPersistent storage for graph and full vectors

Pages of 4 KB pack a node's vector together with its neighbor list, so that a single I/O request returns everything one search step needs.

Beam search with priority queue (L)Query algorithm

The search keeps a priority queue of size L (search list size) โ€” larger L means higher recall at the cost of latency and more SSD reads.

FreshDiskANN delta indexUpdate handling

Optional in-memory layer for fresh inserts/deletes, periodically merged with the SSD index, enabling mutable collections without full rebuilds.

Official

Implementation

Implementation pitfalls
Using HDD instead of NVMe SSDCritical

DiskANN is built around low-latency 4 KB random reads. On HDDs or slow SATA SSDs latency degrades by 1โ€“2 orders of magnitude and the algorithm loses its advantage.

Fix:Require datacenter-class NVMe SSD; measure p99 4 KB random-read IOPS under load.
Search list L set too small โ†’ low recallHigh

L controls the frontier size. Values below ~50 sharply reduce recall on hard datasets (>1M vectors, high D).

Fix:Tune L per dataset and recall target; keep separate profiles for different SLOs (low-latency vs high-recall).
Mutations without FreshDiskANN โ†’ index degradationHigh

Vanilla DiskANN is semi-static. Frequent inserts/deletes without a FreshDiskANN delta layer degrade graph quality and force full rebuilds.

Fix:Use FreshDiskANN or an engine that manages it (Milvus / Cosmos); schedule periodic rebuilds.
Underestimating RAM needed for PQ codebooksMedium

Even though the index is "on disk", PQ codebooks and the frontier slice must fit in RAM. For 1B vectors ร— 64 B PQ that is ~64 GB โ€” easy to overlook in capacity planning.

Fix:Budget RAM as N ร— b_PQ + cache; keep a 1.5โ€“2ร— margin over the theoretical minimum.
No warm-up of cache โ†’ very high p99 latencyMedium

Right after startup every query reads from SSD โ€” p99 latency can be 5โ€“10ร— the steady-state value until the OS page cache warms up.

Fix:Run a synthetic warm-up (preload medoid + N hot queries) before adding the instance to the load balancer.

Evolution

Original paper ยท 2019 ยท NeurIPS 2019 ยท Suhas Jayaram Subramanya
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
Suhas Jayaram Subramanya, Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnaswamy, Rohan Kadekodi
2019
DiskANN published at NeurIPS
Inflection point

Microsoft Research presents DiskANN and the Vamana graph โ€” the first ANN that handles a billion vectors on a single node with the index on SSD.

2021
FreshDiskANN โ€” support for fresh updates

Extension that supports inserts and deletes without full index rebuilds via a two-tier RAM + SSD architecture.

2022
DiskANN ships in production vector databases

DiskANN implementations appear in Milvus and Microsoft Azure services (Cosmos DB, Bing).

2024
DiskANN in Azure Cosmos DB for MongoDB vCore

Microsoft announces DiskANN as the default vector engine for Cosmos DB for MongoDB vCore โ€” production usage at very large scale.

Hyperparameters (configurable axes)

Max graph degree (R)Critical

Maximum out-degree of a node in the Vamana graph. Larger R means higher recall but more SSD space and slower build.

32
64
128
Search list size (L)Critical

Priority queue size during query. Directly controls the recall vs latency trade-off.

50
100
200+
Vamana alphaHigh

Edge-pruning parameter (>1.0) during graph construction โ€” larger alpha keeps more long-range shortcut edges, yielding shorter search paths.

1.0
1.2
PQ footprint per vectorHigh

How many bytes to allocate per vector after Product Quantization in RAM. Trades recall (more bytes = better distance estimates) against RAM footprint.

32 B
64 B
96 B

Computational complexity

Time complexity: O(log N) per query (empirical). Space complexity: RAM: O(N ยท b_PQ); SSD: O(N ยท (D ยท 4B + R ยท 4B)).

Compute bottleneck

Random 4 KB reads from SSD

Each search step is a fresh random 4 KB read from SSD. Performance is bound by drive IOPS, not by CPU/RAM โ€” hence the strong preference for NVMe SSD.

Execution paradigm

Primary mode
sparse

Only a small subset of graph nodes is visited per query โ€” a "sparse exploration" paradigm.

Activation pattern
subset_active

Parallelism

Parallelism level
fully_parallel

Each query is independent; throughput scales linearly with CPU cores and the number of concurrent I/O operations the SSD can sustain.

Scope
inferenceacross_devices

Hardware requirements

Primary

Search is dominated by PQ distance computations (small dot products) and graph logic โ€” a perfect fit for vector CPU instructions (AVX2/AVX-512). GPU is not cost-effective because of the small per-query batch sizes.

Good fit

DiskANN implementations run on ARM (Graviton, Apple Silicon) and x86. What matters is NVMe SSD and memory bandwidth, not a specific ISA.

Limited

Tensor Cores designed for dense GEMM offer no advantage for random reads and sparse graph traversal. GPU can speed up index construction but not the query path.