HNSW
How it works
Construction: (1) For each new vector, sample its maximum layer l ~ Geometric(1/ln(M)) — most end up in l=0, only a few reach higher layers. (2) The vector is inserted from layer l downwards. At each layer, an expanded greedy search collects ef_construction candidate neighbors, then M (or M_max0 for layer 0) final neighbors are picked using the pruning heuristic — points lying in "different directions" from the new node are preferred, not just the closest ones. (3) Edges are bidirectional; when a neighbor exceeds M_max connections its edges are re-pruned too. Search: (a) start from the entry point at the highest existing layer, (b) greedy search at each layer until a local minimum, with fixed ef=1 above layer 0, (c) at layer 0 run an expanded search with a size-ef priority queue (search-time), (d) return top-k from that queue. Parameters M (typically 16–48), ef_construction (100–500) and ef (50–500) control the build time / memory / recall / latency trade-off. Filtering (e.g. WHERE category='X') is usually implemented as filtered search with dynamic ef or as separate per-category graphs.
Problem solved
Exact k-NN in high-dimensional spaces costs O(N · D) per query — infeasible for a million vectors under millisecond SLOs. IVF and KD/ball-tree degrade in high dimensions due to the "curse of dimensionality". HNSW solves this graph-theoretically: it empirically delivers O(log N) query time and ≥ 95% recall at very low latency, which makes it the de facto standard for vector search over AI embeddings at single-node scale.
Components
Layers 0..L: higher are sparse (long-range shortcuts), lower are dense. The number of layers per vector is sampled geometrically with parameter 1/ln(M).
Above layer 0: ef=1 (pure navigation). At layer 0: a size-ef priority queue (search-time) controls recall/latency.
Instead of taking the M closest candidates, HNSW prefers candidates spread across different directions (relative-neighborhood-graph style). The key to high recall.
Official
Each node has at most M (M_max0 in layer 0) edges per layer. Bidirectional — insertion updates both sides.
Implementation
Default ef=10–32 yields <90% recall on real AI embeddings (D≥384). This is the most common reason behind "HNSW is bad" benchmark blogs.
Vanilla HNSW is append-mostly. Deletes are tombstoned, and inserting new versions doesn't always rewire edges optimally — after many mutations recall drops and a full rebuild becomes necessary.
Post-hoc filters like WHERE category='X' applied on top-ef results can return fewer than k matching vectors. You must dynamically grow ef or use filtered-HNSW (predicate checked during traversal).
HNSW keeps full vectors + edges in RAM. For 100M vectors at D=1536 (OpenAI embeddings) with M=32 that is ~700 GB RAM — quickly becomes economically unreasonable.
The HNSW entry point is a single highest-layer node. After a full rebuild a suboptimal entry point inflates the number of steps per query — sometimes visible as a latency regression with no recall change.
Evolution
HNSW's predecessor: a single-layer "small world" graph with both short and long edges. Already competitive, but slower than HNSW.
First HNSW paper: adds hierarchical layers and the diversity pruning heuristic. A breakthrough in ANN quality and speed.
Open-source HNSW in nmslib (Non-Metric Space Library) — the first widely used implementation, basis for many ports.
Official IEEE TPAMI publication. Yury Malkov releases hnswlib — a standalone, fast C++/Python library that remains the reference implementation.
Qdrant, Pinecone, Weaviate, pgvector (≥ 0.5), Elasticsearch (8.0+), OpenSearch and Milvus — all expose HNSW as their primary or only ANN index.
Qdrant, Milvus, Lucene and pgvector add vector compression on top of HNSW (PQ, scalar quantization, INT8) to reduce RAM without giving up the graph.
Hyperparameters (configurable axes)
Maximum neighbors per node in layers >0 (in layer 0 usually M_max0 = 2·M). Larger M means higher recall, more RAM, slower build.
Dynamic candidate list size during construction. Larger means better graph quality (higher recall) at the cost of longer build.
Priority queue size during query at layer 0. The main runtime recall vs latency knob — can be changed per-query without rebuilding.
Max neighbor count in layer 0 (the densest). Typically 2·M; raising it improves recall but doubles layer-0 RAM.
Computational complexity
Time complexity: Query: O(log N) (empirical); Insert: O(log N · ef_construction · D). Space complexity: O(N · (D · 4B + M · 8B · L_avg)).
Compute bottleneck
Most query time is spent on dot products / L2 distances between the query and visited node vectors. SIMD implementations (AVX-512, NEON) are critical for performance.
Execution paradigm
Only a small subset of graph nodes is visited per query (sparse traversal). This contrasts with PQ (dense scan) and is closer to DiskANN.
Parallelism
Queries are independent — throughput scales linearly with cores. Index construction is partly parallel (per-node lock during edge updates).
Hardware requirements
HNSW search is dominated by distance computations (dot product / L2) over small vector sets per step. AVX2/AVX-512 provides the needed throughput. GPU has no edge due to small per-query batches and irregular memory access.
HNSW implementations run efficiently on ARM (Graviton, Apple Silicon, NEON) and x86. What matters is memory bandwidth and SIMD, not a specific ISA.
Irregular graph reads and small per-query batches don't map well to Tensor Cores designed for dense GEMM. GPU can speed up index construction and extreme throughput scenarios, but not the typical single-query path.