Retrieval

SIDs

2023ActivePublished: 25 June 2026Updated: 25 June 2026Published

Key innovation

Representing items (e.g. videos, products) as semantically meaningful tuples of discrete codewords that can be predicted autoregressively instead of being retrieved via nearest neighbor search in a dense embedding space.

How it works

The pipeline has three phases. Phase 1 (offline, once per catalogue): for each item a content encoder (e.g. Sentence-T5) computes an embedding from textual attributes; an RQ-VAE with N hierarchical codebooks quantises the embedding into an N-tuple of discrete codes — this is the item's SID. Phase 2 (offline, training): a sequence-to-sequence Transformer (e.g. T5) is trained on user sessions represented as SID sequences, with a next-SID prediction loss. Phase 3 (online, inference): for a given user the system takes their recent interactions (as a SID sequence), feeds them to the Transformer and beam-searches the top-K candidate next-SIDs, each decoded back to a concrete item from the catalogue.

Problem solved

Classical recommender systems rely on atomic IDs (one-off item identifiers) or dense embeddings with nearest neighbor search. Atomic IDs carry no intrinsic semantics (item_42851 tells you nothing), require a separate embedding lookup for each item and scale poorly across catalogues of hundreds of millions of items. Embedding retrieval with ANN search requires maintaining huge indexes and loses semantic structure between items. They also handle cold-start poorly.

Key mechanisms

Content encoder (e.g. Sentence-T5) — generates dense semantic embeddings from descriptive item attributes

Residual-Quantized VAE (RQ-VAE) — hierarchical quantisation of the embedding into N codes: each layer encodes the residual of the previous one

N hierarchical codebooks (typically 4 layers × 256–8,192 entries per layer) with learned centroid vectors

Sequence-to-sequence Transformer (e.g. T5) trained with a next-SID prediction loss on SID sequences from user sessions

Beam-search decoding at inference: the model autoregressively generates the top-K candidate next-SIDs

Tie-breaking suffix for collisions (rare but possible — different items with identical code tuples)

Hierarchical semantics: semantically close items share SID prefixes, improving cold-start and generalisation

Strengths & limitations

Strengths

✓Dramatically reduces the recommendation vocabulary size (from hundreds of millions of atomic IDs to thousands of codes × N layers)

✓Natural cold-start handling: new items with similar semantics inherit SID prefixes from existing ones

✓Eliminates the need to maintain a separate ANN index — candidates are generated directly by the model

✓Hierarchical structure facilitates LLM-based recommendation (SID tokens are a narrow vocabulary, similar to BPE tokens)

✓Scales well with model size — larger Transformers predict SIDs more accurately

✓Modality-agnostic: only the content encoder changes (T5 for text, Qwen-VL for video)

✓Demonstrated SOTA on standard benchmarks (Amazon Beauty/Sports/Toys in the original TIGER)

Limitations

✗Requires separate RQ-VAE training and periodic retraining when the catalogue changes significantly (embedding distribution drift)

✗SID collisions (different items with identical code tuples) need handling via a tie-breaking suffix or augmentation

✗Recommendation quality is bounded by the underlying content encoder — a weak embedding means a weak SID

✗Combining behavioural signal (collaborative filtering) with purely semantic SIDs is hard — additional mechanisms are needed

✗The first codebook dominates (encodes most variance), leading to uneven utilisation of subsequent layers without special regularisation

✗Beam search is limited to top-K — the model fails to generate reasonable items for users with extremely long-tail interests

Components

Content EncoderMapping an item to a dense semantic representation

Pre-trained encoder model (originally Sentence-T5) that takes item attributes (e.g. title, description, category) and returns a dense embedding vector in semantic space. For modalities other than text, analogous multimodal encoders are used (e.g. CLIP, Qwen2.5-VL for video).

Official

Residual-Quantized VAE (RQ-VAE)Converting a dense embedding into a tuple of discrete codes (SID)

A hierarchical quantisation mechanism that turns an embedding into N discrete codes. The first layer finds the nearest centroid in the first codebook; the second approximates the residual of the first; the third the residual of the second; and so on. Each layer has its own codebook (typically 256–8,192 entries). Trained jointly with the encoder via reconstruction loss + commitment loss.

Official

Hierarchical codebooksVocabulary of discrete units that SIDs are built from

N separate codebooks (one per layer), each with C learned centroid vectors (typically C=256 or C=8,192). Trained by RQ-VAE jointly with the encoder. Total SID vocabulary size is N × C — dramatically less than the number of items in the catalogue (hundreds of millions).

Sequence-to-Sequence TransformerAutoregressive generation of SIDs as Generative Retrieval

In the original work (TIGER) this is a T5 model. It takes a sequence of SIDs representing the user's interactions within a session and autoregressively predicts the SID of the next item the user is likely to engage with. Trained with standard next-token prediction loss on code sequences.

Official

Implementation

Implementation pitfalls

SID collisionsMedium

Finite codebook sizes mean different items can end up with identical code tuples (collision). The collision probability grows with the catalogue size.

Fix:Add a tie-breaking suffix (e.g. an extra unique token) or augment with randomised permutations of code order to give the model the ability to disambiguate.

Embedding distribution drift after adding new itemsMedium

When a significant share of the catalogue changes (new categories, new trends), the original RQ-VAE codebooks may no longer cover the embedding space well, leading to worse quantisation.

Fix:Periodically retrain RQ-VAE on fresh embeddings, or use online clustering (e.g. incremental RQ-K-means) to update centroids without a full retraining.

Dominance of the first codebook layerMedium

Without special regularisation, the first codebook absorbs most embedding variance, while subsequent layers only approximate small residuals — leading to uneven vocabulary utilisation.

Fix:Apply per-layer weighted commitment loss, entropy regularisation, or uniform code sampling during RQ-VAE training.

Evolution

Original paper · 2023 · NeurIPS 2023 (Google Research, arXiv 2305.05065) · Shashank Rajput

Recommender Systems with Generative Retrieval

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, Maheswaran Sathiamoorthy

2017

VQ-VAE — foundation of discrete embedding quantisation

Van den Oord et al. introduce the Vector Quantized Variational Autoencoder, the foundation of all later techniques for quantising continuous embeddings into discrete codes.

2022

RQ-VAE — Residual Quantization in generative audio and image

Lee et al. (Autoregressive Image Generation using Residual Quantization, CVPR 2022) and Zeghidour et al. (SoundStream) popularise hierarchical residual quantisation — the precursor to SIDs.

2023

TIGER — Semantic IDs as Generative Retrieval for recommendation

Inflection point

Rajput et al. (Google Research, NeurIPS 2023) publish Recommender Systems with Generative Retrieval — the first use of SIDs as targets for autoregressive generation in recommendation. SOTA on Amazon Beauty/Sports/Toys.

2025

Industrial-scale GRM with SIDs (Deng et al., Xue et al.)

First deployments of generative recommendation models based on SIDs in production environments with hundreds of millions of users.

2026

Disentangled SIDs (D-SIDs) in the RaG paradigm (Kuaishou)

Inflection point

Kuaishou extends SIDs into Disentangled SIDs that factor video content and creative semantics. D-SIDs are the key latent interface of the Recommendation-as-Generation paradigm, generating personalized videos on demand instead of retrieving from a static pool.

RaG (Recommendation-as-Generation) (concept)