DiskANN
How it works
Indexing: (1) a Vamana graph is built โ each node is a vector and edges connect it to R chosen neighbors; construction is iterative, with an alpha parameter (>1.0) controlling how aggressively long-range "shortcut" edges between distant clusters are added. (2) Full vectors and adjacency lists are written to SSD in a layout aligned to 4 KB pages, so that a single page contains both a node's vector and its neighbors. (3) RAM holds Product-Quantized compressed codes of all vectors plus a small cache of nodes around the entry points. Query: (a) starts from the graph medoid, (b) at each step the SSD page of the current node is read to get its neighbors and their full vectors, but candidates for frontier expansion are scored with cheap PQ distances from RAM, (c) the frontier is kept as a priority queue of size L (search list), (d) the search terminates when none of the R closest candidates improves the distance. The FreshDiskANN variant adds an in-memory "delta" layer for fresh inserts and periodically merges it with the SSD index, allowing updates without full rebuilds.
Problem solved
Exact k-NN in spaces with hundreds or thousands of dimensions is infeasible at billion-scale, and classical ANN graph indexes (HNSW) require the whole index to fit in RAM, which at 10โน vectors translates into hundreds of GB to TB of memory per node. DiskANN tackles this economic problem: it achieves comparable recall and latency while keeping the index on much cheaper SSD storage and only a small footprint in RAM, so a single server can serve billions of vectors instead of an entire cluster.
Components
Directed proximity graph with bounded maximum out-degree R, built iteratively with an alpha parameter that controls how long the shortcut edges between clusters can be; stored on SSD together with full vectors.
All vectors are compressed by Product Quantization down to tens of bytes per vector and kept in RAM to quickly estimate distances when picking candidates to expand during the search.
Official
Pages of 4 KB pack a node's vector together with its neighbor list, so that a single I/O request returns everything one search step needs.
The search keeps a priority queue of size L (search list size) โ larger L means higher recall at the cost of latency and more SSD reads.
Optional in-memory layer for fresh inserts/deletes, periodically merged with the SSD index, enabling mutable collections without full rebuilds.
Official
Implementation
DiskANN is built around low-latency 4 KB random reads. On HDDs or slow SATA SSDs latency degrades by 1โ2 orders of magnitude and the algorithm loses its advantage.
L controls the frontier size. Values below ~50 sharply reduce recall on hard datasets (>1M vectors, high D).
Vanilla DiskANN is semi-static. Frequent inserts/deletes without a FreshDiskANN delta layer degrade graph quality and force full rebuilds.
Even though the index is "on disk", PQ codebooks and the frontier slice must fit in RAM. For 1B vectors ร 64 B PQ that is ~64 GB โ easy to overlook in capacity planning.
Right after startup every query reads from SSD โ p99 latency can be 5โ10ร the steady-state value until the OS page cache warms up.
Evolution
Microsoft Research presents DiskANN and the Vamana graph โ the first ANN that handles a billion vectors on a single node with the index on SSD.
Extension that supports inserts and deletes without full index rebuilds via a two-tier RAM + SSD architecture.
DiskANN implementations appear in Milvus and Microsoft Azure services (Cosmos DB, Bing).
Microsoft announces DiskANN as the default vector engine for Cosmos DB for MongoDB vCore โ production usage at very large scale.
Hyperparameters (configurable axes)
Maximum out-degree of a node in the Vamana graph. Larger R means higher recall but more SSD space and slower build.
Priority queue size during query. Directly controls the recall vs latency trade-off.
Edge-pruning parameter (>1.0) during graph construction โ larger alpha keeps more long-range shortcut edges, yielding shorter search paths.
How many bytes to allocate per vector after Product Quantization in RAM. Trades recall (more bytes = better distance estimates) against RAM footprint.
Computational complexity
Time complexity: O(log N) per query (empirical). Space complexity: RAM: O(N ยท b_PQ); SSD: O(N ยท (D ยท 4B + R ยท 4B)).
Compute bottleneck
Each search step is a fresh random 4 KB read from SSD. Performance is bound by drive IOPS, not by CPU/RAM โ hence the strong preference for NVMe SSD.
Execution paradigm
Only a small subset of graph nodes is visited per query โ a "sparse exploration" paradigm.
Parallelism
Each query is independent; throughput scales linearly with CPU cores and the number of concurrent I/O operations the SSD can sustain.
Hardware requirements
Search is dominated by PQ distance computations (small dot products) and graph logic โ a perfect fit for vector CPU instructions (AVX2/AVX-512). GPU is not cost-effective because of the small per-query batch sizes.
DiskANN implementations run on ARM (Graviton, Apple Silicon) and x86. What matters is NVMe SSD and memory bandwidth, not a specific ISA.
Tensor Cores designed for dense GEMM offer no advantage for random reads and sparse graph traversal. GPU can speed up index construction but not the query path.