The DLRM forward pass has four stages: (1) dense features → bottom MLP (several FC layers with ReLU) producing a fixed-dimensional vector D; (2) sparse features → embedding lookup in the respective tables, each returning a vector of the same dimensionality D; (3) feature interaction layer — all vectors (bottom MLP output + embeddings) treated as rows of a matrix, pairwise dot products are computed yielding an N×N matrix whose upper triangle (N(N-1)/2 values) is taken; (4) top MLP — concatenation of interaction output + bottom MLP output → several FC layers with a sigmoid at the end for CTR prediction. Training: cross-entropy loss against actual clicks, SGD/Adagrad with sharded gradient sync.
Classical recommendation approaches (e.g. pure matrix factorization) cannot handle high-dimensional data with hundreds of features (both dense numerical and sparse categorical) in Big Tech production environments. Early neural recommendation networks (Wide & Deep, NeuralCF, DeepFM) differed in architectural details but lacked a unified, open reference implementation with a dedicated parallel strategy for terabyte-scale embedding tables. DLRM solves this with a unified architectural pattern + reference implementation + system/algorithm co-design.
A multi-layer perceptron processing dense (numerical, continuous) features. Typically 2–4 FC layers with ReLU, ending with a vector of fixed dimensionality D — the same dimension as the embeddings. Computationally light and replicated across devices (data parallel).
Official
Huge learned tables (E_i: V_i × D, where V_i is the number of categories of the i-th feature and D is the embedding dimension) with indexed lookup. In production each table can have billions of entries. Sharded model-parallel across devices: different devices hold different columns (D dimensions) or different row segments.
A layer that explicitly models cross-feature interactions via pairwise dot products. For N vectors (bottom MLP output + N-1 embeddings) it produces N(N-1)/2 scalar values — second-order interactions inspired by factorization machines. The result is concatenated with the bottom MLP output and passed to the top MLP.
Official
A multi-layer perceptron taking the concatenation of bottom MLP output + pairwise interaction values. Several FC layers with ReLU + sigmoid output for CTR prediction (or another score metric). Also replicated data-parallel across devices.
Official
A parallelisation scheme combining data parallel (for MLPs) with model parallel (for embedding tables sharded across devices). A critical all-to-all operation moves embedding lookup results between devices before the interaction layer. This pattern became the standard for scalable recommendation training.
In production, embedding table sizes quickly grow to terabytes (billions of entries × tens/hundreds of columns). Without model-parallel sharding, training is impossible on a single GPU.
Hybrid parallelism requires all-to-all communication of embedding outputs between every device in each iteration — at hundreds/thousands of GPUs this becomes the main bottleneck.
Embeddings of new items are started from random initialisation — the model has no semantic prior about what the new item is, so recommendation quality for new items is very low initially.
The original DLRM models only second-order interactions (pairwise dot products). Higher-order interactions (third-order cross, e.g. user×item×context) require explicitly added mechanisms.
Matrix factorization (e.g. SVD++) as the dominant recommendation technique — the foundational intuition for feature interaction in later deep learning models.
First production neural recommendation networks: Wide & Deep combines memorisation (wide) with generalisation (deep); YouTube Deep NN introduces two-stage candidate generation + ranking.
DeepFM combines factorization machines (second-order) and deep neural networks (higher orders) in a single end-to-end model — an important DLRM precursor.
Naumov et al. publish DLRM as a unified architectural pattern + open-source implementation in PyTorch and Caffe2 + dedicated hybrid parallelisation. Establishes the MLPerf standard.
TorchRec (PyTorch) — a library evolving the DLRM pattern into a productionizable framework with modular sharded embeddings, model parallel and hybrids for industrial-scale recommendation.
Rajput et al. (Google, NeurIPS 2023) introduce Semantic IDs and Generative Retrieval — a paradigm shift away from DLRM-style dot-product retrieval toward autoregressive SID generation. DLRM remains a strong baseline for comparisons.
The RaG paradigm (Kuaishou, arXiv 2606.25496) explicitly compares its production results (400M+ DAU) to the DLRM baseline: +5.46% ad revenue for RaG vs DLRM, confirming DLRM's role as the standard against which progress is measured.
A dense (MLP) + conditional (embedding lookup) hybrid — which is why DLRM requires a specialised training infrastructure.
Embedding lookup is a form of input-dependent routing: for each sample only a small subset of embedding table rows (corresponding to active categories) is actually read. This contrasts with dense MLPs in which all weights are active.
DLRM training is one of the most heavily parallelised workloads in deep learning: hundreds/thousands of GPUs work simultaneously. Inference can be data-parallel replicated but requires huge embedding tables in serving memory.
DLRM MLPs are typical GEMMs on tensor cores; embedding lookup with high-bandwidth HBM memory. The standard target for NVIDIA A100/H100 and AMD MI250.
Google uses DLRM variants on TPU for its recommendation systems. SparseCore in TPU v5e/v5p was specifically designed for DLRM-style recommendation.
DLRM inference can be served on CPU (e.g. Intel AVX-512) — Meta historically used CPU for parts of recommendation inference due to its memory capacity advantage over GPU for embedding tables.