Training (offline): (1) collect a representative sample of database vectors, (2) choose hyperparameters — number of sub-quantizers m (must divide D) and bits per code (typically 8, i.e. k=256 centroids per sub-space), (3) split each sample vector into m segments of D/m dims, (4) run k-means independently on each of the m segment sets — yielding m codebooks of 256 centroids each. Database encoding: each vector v is split into (v₁,…,v_m), each segment is matched to its closest centroid c_j in the corresponding codebook, and the vector is stored as m bytes (centroid ids). Query (ADC): (a) split q into (q₁,…,q_m), (b) for each sub-space j compute a LUT_j[i] = dist(q_j, c_j^i) for i=0..255 — m × 256 operations total, (c) for each database vector with code (i₁,…,i_m) the approximate distance = Σ LUT_j[i_j] — m additions plus m cache lookups, (d) keep a top-k heap. The OPQ variant prepends PQ with an orthogonal rotation R chosen to minimize quantization error — substantially improves recall on data with correlated dimensions. The RQ (Residual Quantization) variant quantizes residuals iteratively — successive codebooks learn the difference between the vector and its current approximation.
Embeddings from AI models (BERT, CLIP, OpenAI) typically have D=384–1536 dimensions at 4 bytes (float32), giving 1.5–6 KB per vector. For one billion vectors that is 1.5–6 TB of RAM — infeasible on a single server. PQ solves this: the same billion vectors after PQ take 16–64 GB of RAM (50–100× compression) with minimal recall loss, enabling huge indexes on a single node and dramatically cheaper infrastructure.
A D-dimensional vector is split into m disjoint segments of D/m dims each. m must divide D; choosing m controls the compression vs quality trade-off.
m independent codebooks, each typically with k=256 centroids (8 bits). Trained with k-means on the corresponding sub-space sample.
Official
Each vector is represented as an m-tuple of centroid indices (m bytes when k=256). This is what indexes hold in RAM/SSD.
Precompute LUT_j[i] = dist(q_j, c_j^i) per sub-space, then scan the database as m lookups + m additions per vector. Symmetric DC also quantizes the query (faster but lower recall).
OPQ learns an orthogonal matrix R that rotates vectors so variance is spread evenly across sub-spaces — typically 2–5 p.p. higher recall at the same compression.
Official
Plain PQ requires D to be divisible by m. E.g. for D=384 valid m: 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192. Picking m=10 or m=20 throws an error or forces padding.
PQ is metric-agnostic, but for cosine you must L2-normalize embeddings before training codebooks — otherwise centroids capture norm differences instead of directional ones.
k-means codebooks are frozen after training. If the training sample doesn't cover the production distribution, recall drops dramatically (drift effect).
For AI embeddings (BERT/CLIP/OpenAI) dimensions are often correlated. Vanilla PQ then yields significantly worse recall than OPQ for the same byte budget.
ADC produces approximate distances. Without re-ranking top-N candidates with full vectors from disk, recall@10 can be much worse than recall@100 after re-ranking.
The original paper introduces PQ together with ADC and two modes: SDC (symmetric) and ADC (asymmetric). Experiments on SIFT/GIST show 32× compression with a small recall drop.
OPQ learns an orthogonal rotation that minimizes quantization error. Became the default PQ variant in FAISS.
A family of methods (AQ, RQ, LSQ) generalizing PQ — vector as a sum of centroids from multiple codebooks. Higher recall at the cost of more expensive encoding.
Meta AI releases FAISS, including PQ4 Fast Scan with AVX-512 — a breakthrough in PQ CPU performance.
ScaNN improves PQ with a dot-product-aware (anisotropic) loss — higher recall for cosine/IP, widely used in Google production.
Time complexity: Encoding: O(D · k); Query (ADC): O(m · k) LUT + O(m) per database vector. Space complexity: O(N · m) per dataset + O(m · k · D/m) = O(m · k · D/m) codebooks.
The bottleneck is the dense LUT lookup + accumulation per vector. Modern implementations (FAISS, ScaNN) use SIMD/AVX-512 to sum LUTs for multiple vectors at once (SIMD-based ADC, "PQ Fast Scan").
Unlike IVF/HNSW, PQ at query time usually scans ALL vectors in the (sub)set — its strength is the cheap per-vector op (LUT lookup), not sparsity.
Encoding and scan are trivially data-parallel; k-means training can be parallelized per sub-space.
Number of sub-spaces / bytes per vector after PQ. Larger m means higher recall and more memory; m must divide D.
Bits per centroid index (k = 2^bits). 8 bits (k=256) is the standard due to byte alignment and SIMD; 4 bits → 16 centroids (PQ4) — extreme compression, lower recall.
Number of vectors used to train the codebooks. Recommended ≥ 30 · k per sub-space (FAISS); too small → unrepresentative centroids.
Choice: no rotation (vanilla PQ), OPQ (learned orthogonal), random rotation (cheap compromise). OPQ almost always adds 2–5 p.p. recall.
PQ Fast Scan is designed for AVX2/AVX-512 — parallel LUT summing for many vectors per instruction. LUTs stay in L1, giving hundreds of M comparisons/s per core.
FAISS GPU accelerates IVF-PQ and PQ — large query × database batches fit GPU memory bandwidth; Tensor Cores are only partially leveraged (lookup + add, not GEMM).
PQ runs efficiently on ARM (NEON) and x86 (AVX). What matters is memory bandwidth and cache locality, not a specific ISA.