The pipeline has three phases. Phase 1 (offline, once per catalogue): for each item a content encoder (e.g. Sentence-T5) computes an embedding from textual attributes; an RQ-VAE with N hierarchical codebooks quantises the embedding into an N-tuple of discrete codes โ this is the item's SID. Phase 2 (offline, training): a sequence-to-sequence Transformer (e.g. T5) is trained on user sessions represented as SID sequences, with a next-SID prediction loss. Phase 3 (online, inference): for a given user the system takes their recent interactions (as a SID sequence), feeds them to the Transformer and beam-searches the top-K candidate next-SIDs, each decoded back to a concrete item from the catalogue.
Classical recommender systems rely on atomic IDs (one-off item identifiers) or dense embeddings with nearest neighbor search. Atomic IDs carry no intrinsic semantics (item_42851 tells you nothing), require a separate embedding lookup for each item and scale poorly across catalogues of hundreds of millions of items. Embedding retrieval with ANN search requires maintaining huge indexes and loses semantic structure between items. They also handle cold-start poorly.
Pre-trained encoder model (originally Sentence-T5) that takes item attributes (e.g. title, description, category) and returns a dense embedding vector in semantic space. For modalities other than text, analogous multimodal encoders are used (e.g. CLIP, Qwen2.5-VL for video).
Official
A hierarchical quantisation mechanism that turns an embedding into N discrete codes. The first layer finds the nearest centroid in the first codebook; the second approximates the residual of the first; the third the residual of the second; and so on. Each layer has its own codebook (typically 256โ8,192 entries). Trained jointly with the encoder via reconstruction loss + commitment loss.
Official
N separate codebooks (one per layer), each with C learned centroid vectors (typically C=256 or C=8,192). Trained by RQ-VAE jointly with the encoder. Total SID vocabulary size is N ร C โ dramatically less than the number of items in the catalogue (hundreds of millions).
In the original work (TIGER) this is a T5 model. It takes a sequence of SIDs representing the user's interactions within a session and autoregressively predicts the SID of the next item the user is likely to engage with. Trained with standard next-token prediction loss on code sequences.
Official
Finite codebook sizes mean different items can end up with identical code tuples (collision). The collision probability grows with the catalogue size.
When a significant share of the catalogue changes (new categories, new trends), the original RQ-VAE codebooks may no longer cover the embedding space well, leading to worse quantisation.
Without special regularisation, the first codebook absorbs most embedding variance, while subsequent layers only approximate small residuals โ leading to uneven vocabulary utilisation.
Van den Oord et al. introduce the Vector Quantized Variational Autoencoder, the foundation of all later techniques for quantising continuous embeddings into discrete codes.
Lee et al. (Autoregressive Image Generation using Residual Quantization, CVPR 2022) and Zeghidour et al. (SoundStream) popularise hierarchical residual quantisation โ the precursor to SIDs.
Rajput et al. (Google Research, NeurIPS 2023) publish Recommender Systems with Generative Retrieval โ the first use of SIDs as targets for autoregressive generation in recommendation. SOTA on Amazon Beauty/Sports/Toys.
First deployments of generative recommendation models based on SIDs in production environments with hundreds of millions of users.
Kuaishou extends SIDs into Disentangled SIDs that factor video content and creative semantics. D-SIDs are the key latent interface of the Recommendation-as-Generation paradigm, generating personalized videos on demand instead of retrieving from a static pool.
The autoregressive model is dense in computation, but routing through discrete SID codes conditionally activates specific paths in the item space.
Inference proceeds via beam search over the hierarchical code vocabulary: the model autoregressively selects the top-K candidates at each layer, ultimately generating the top-K complete SID tuples as recommendations.
SID generation is sequential โ each layer's code depends on the previous ones. Training can be partially parallelised like in a typical Transformer (teacher forcing), but inference per session is sequential.
Both content encoder + RQ-VAE training and sequence-to-sequence Transformer training are typical GPU workloads. Autoregressive inference benefits from optimisations like KV-cache.
The original work (Google Research) used TPU for T5 training โ TPU is a natural choice for this family of architectures.
The SIDs concept itself is hardware-independent โ they can be generated and predicted on any platform supporting Transformers and quantisation operations.