Robots Atlas>ROBOTS ATLAS
Retrieval

HNSW

2016ActivePublished: 20 May 2026Updated: 20 May 2026Published
Key innovation
A multi-layer "small-world" proximity graph: upper layers carry sparse long-range shortcuts for fast global navigation, lower layers carry dense local neighborhoods for precise refinement. Queries descend layer by layer in time logarithmic in N.
Category
Retrieval
Abstraction level
Building block
Operation level
RetrievalDataInference
Use cases
Semantic search over 10⁶–10⁸ embeddings (RAG, search)Default index in vector databases: Qdrant, Pinecone, Weaviate, pgvector, MilvusImage / product similarity searchReal-time recommendations (sub-millisecond)Deduplication and near-duplicate detectionMobile / edge ANN (with smaller M)Hybrid search with BM25 + ANN (Elasticsearch, OpenSearch)

How it works

Construction: (1) For each new vector, sample its maximum layer l ~ Geometric(1/ln(M)) — most end up in l=0, only a few reach higher layers. (2) The vector is inserted from layer l downwards. At each layer, an expanded greedy search collects ef_construction candidate neighbors, then M (or M_max0 for layer 0) final neighbors are picked using the pruning heuristic — points lying in "different directions" from the new node are preferred, not just the closest ones. (3) Edges are bidirectional; when a neighbor exceeds M_max connections its edges are re-pruned too. Search: (a) start from the entry point at the highest existing layer, (b) greedy search at each layer until a local minimum, with fixed ef=1 above layer 0, (c) at layer 0 run an expanded search with a size-ef priority queue (search-time), (d) return top-k from that queue. Parameters M (typically 16–48), ef_construction (100–500) and ef (50–500) control the build time / memory / recall / latency trade-off. Filtering (e.g. WHERE category='X') is usually implemented as filtered search with dynamic ef or as separate per-category graphs.

Problem solved

Exact k-NN in high-dimensional spaces costs O(N · D) per query — infeasible for a million vectors under millisecond SLOs. IVF and KD/ball-tree degrade in high dimensions due to the "curse of dimensionality". HNSW solves this graph-theoretically: it empirically delivers O(log N) query time and ≥ 95% recall at very low latency, which makes it the de facto standard for vector search over AI embeddings at single-node scale.

Components

Hierarchical layered graphIndex structure

Layers 0..L: higher are sparse (long-range shortcuts), lower are dense. The number of layers per vector is sampled geometrically with parameter 1/ln(M).

Greedy beam search per layerQuery algorithm

Above layer 0: ef=1 (pure navigation). At layer 0: a size-ef priority queue (search-time) controls recall/latency.

Diversity pruning heuristicNeighbor selection during construction

Instead of taking the M closest candidates, HNSW prefers candidates spread across different directions (relative-neighborhood-graph style). The key to high recall.

Official

Adjacency lists (per layer)Edge representation

Each node has at most M (M_max0 in layer 0) edges per layer. Bidirectional — insertion updates both sides.

Implementation

Implementation pitfalls
ef (search) too small → low recallHigh

Default ef=10–32 yields <90% recall on real AI embeddings (D≥384). This is the most common reason behind "HNSW is bad" benchmark blogs.

Fix:Tune ef on a validation set to hit the target recall; ef is a per-query knob and needs no rebuild.
Deletes / updates degrade the graphHigh

Vanilla HNSW is append-mostly. Deletes are tombstoned, and inserting new versions doesn't always rewire edges optimally — after many mutations recall drops and a full rebuild becomes necessary.

Fix:Schedule periodic rebuilds; use variants that support deletes with edge repair (Qdrant, pgvector ≥ 0.6).
Filtered search — empty-queue pitfallMedium

Post-hoc filters like WHERE category='X' applied on top-ef results can return fewer than k matching vectors. You must dynamically grow ef or use filtered-HNSW (predicate checked during traversal).

Fix:Use engines with native filtered HNSW (Qdrant, Milvus, Weaviate); for rare categories consider per-category graphs.
RAM blow-up at large M and high DHigh

HNSW keeps full vectors + edges in RAM. For 100M vectors at D=1536 (OpenAI embeddings) with M=32 that is ~700 GB RAM — quickly becomes economically unreasonable.

Fix:For N > ~50M consider HNSW-PQ / HNSW-INT8 or move to DiskANN / IVF-PQ.
Bad entry point placement after rebuildLow

The HNSW entry point is a single highest-layer node. After a full rebuild a suboptimal entry point inflates the number of steps per query — sometimes visible as a latency regression with no recall change.

Fix:Monitor mean hop count; some implementations support warm-up and entry-point selection via sampling.

Evolution

Original paper · 2018 · IEEE TPAMI 2018 (preprint 2016) · Yury A. Malkov
Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
Yury A. Malkov, Dmitry A. Yashunin
2014
Malkov et al. — NSW (Navigable Small World)

HNSW's predecessor: a single-layer "small world" graph with both short and long edges. Already competitive, but slower than HNSW.

2016
Malkov & Yashunin — HNSW preprint (arXiv:1603.09320)
Inflection point

First HNSW paper: adds hierarchical layers and the diversity pruning heuristic. A breakthrough in ANN quality and speed.

2017
nmslib — reference C++ implementation

Open-source HNSW in nmslib (Non-Metric Space Library) — the first widely used implementation, basis for many ports.

2018
TPAMI publication; hnswlib as standalone
Inflection point

Official IEEE TPAMI publication. Yury Malkov releases hnswlib — a standalone, fast C++/Python library that remains the reference implementation.

2022
HNSW becomes the default index in vector databases

Qdrant, Pinecone, Weaviate, pgvector (≥ 0.5), Elasticsearch (8.0+), OpenSearch and Milvus — all expose HNSW as their primary or only ANN index.

2024
HNSW-PQ / HNSW-INT8 variants in mainstream use

Qdrant, Milvus, Lucene and pgvector add vector compression on top of HNSW (PQ, scalar quantization, INT8) to reduce RAM without giving up the graph.

Hyperparameters (configurable axes)

Max neighbors per node (M)Critical

Maximum neighbors per node in layers >0 (in layer 0 usually M_max0 = 2·M). Larger M means higher recall, more RAM, slower build.

8
16
32–48
ef_constructionHigh

Dynamic candidate list size during construction. Larger means better graph quality (higher recall) at the cost of longer build.

100
200–400
ef (search-time)Critical

Priority queue size during query at layer 0. The main runtime recall vs latency knob — can be changed per-query without rebuilding.

32
128
500
M_max0 (layer 0 cap)Medium

Max neighbor count in layer 0 (the densest). Typically 2·M; raising it improves recall but doubles layer-0 RAM.

2·M

Computational complexity

Time complexity: Query: O(log N) (empirical); Insert: O(log N · ef_construction · D). Space complexity: O(N · (D · 4B + M · 8B · L_avg)).

Compute bottleneck

Distance computations during graph traversal

Most query time is spent on dot products / L2 distances between the query and visited node vectors. SIMD implementations (AVX-512, NEON) are critical for performance.

Execution paradigm

Primary mode
sparse

Only a small subset of graph nodes is visited per query (sparse traversal). This contrasts with PQ (dense scan) and is closer to DiskANN.

Activation pattern
subset_active

Parallelism

Parallelism level
fully_parallel

Queries are independent — throughput scales linearly with cores. Index construction is partly parallel (per-node lock during edge updates).

Scope
inferenceacross_devices

Hardware requirements

Primary

HNSW search is dominated by distance computations (dot product / L2) over small vector sets per step. AVX2/AVX-512 provides the needed throughput. GPU has no edge due to small per-query batches and irregular memory access.

Good fit

HNSW implementations run efficiently on ARM (Graviton, Apple Silicon, NEON) and x86. What matters is memory bandwidth and SIMD, not a specific ISA.

Limited

Irregular graph reads and small per-query batches don't map well to Tensor Cores designed for dense GEMM. GPU can speed up index construction and extreme throughput scenarios, but not the typical single-query path.