Retrieval

DLRM

2019ActivePublished: 25 June 2026Updated: 25 June 2026Published

Key innovation

A unified architectural pattern for deep learning in recommender systems that combines dense (numerical) and sparse (categorical) features via embedding tables and a multi-layer MLP, with a dedicated parallelisation strategy: data parallel for FC layers and model parallel for the huge embedding tables.

How it works

The DLRM forward pass has four stages: (1) dense features → bottom MLP (several FC layers with ReLU) producing a fixed-dimensional vector D; (2) sparse features → embedding lookup in the respective tables, each returning a vector of the same dimensionality D; (3) feature interaction layer — all vectors (bottom MLP output + embeddings) treated as rows of a matrix, pairwise dot products are computed yielding an N×N matrix whose upper triangle (N(N-1)/2 values) is taken; (4) top MLP — concatenation of interaction output + bottom MLP output → several FC layers with a sigmoid at the end for CTR prediction. Training: cross-entropy loss against actual clicks, SGD/Adagrad with sharded gradient sync.

Problem solved

Classical recommendation approaches (e.g. pure matrix factorization) cannot handle high-dimensional data with hundreds of features (both dense numerical and sparse categorical) in Big Tech production environments. Early neural recommendation networks (Wide & Deep, NeuralCF, DeepFM) differed in architectural details but lacked a unified, open reference implementation with a dedicated parallel strategy for terabyte-scale embedding tables. DLRM solves this with a unified architectural pattern + reference implementation + system/algorithm co-design.

Key mechanisms

Bottom MLP — several ReLU-activated FC layers processing dense numerical features into a fixed dimensionality D

Sparse embedding tables — large learned tables (potentially billions of entries × tens/hundreds of columns) with indexed lookup for categorical features

Pairwise feature interaction — explicit N×(N-1)/2 dot products between feature vectors inspired by factorization machines

Top MLP — CTR prediction from a concatenated vector of bottom MLP output + pairwise interactions

Hybrid parallelisation — data parallel for MLPs + model parallel (sharded) for embedding tables, with all-to-all between phases

Reference implementations in PyTorch and Caffe2 (originally) and later evolution into TorchRec

Standard MLPerf benchmark for recommendation system compute (DLRM-DCNv2 in newer editions)

Strengths & limitations

Strengths

✓Open, well-documented reference implementation — easy to reproduce and compare against

✓Unified pattern for dense + sparse features — covers the typical production recommendation pipeline

✓Scalability to petabyte-scale embedding tables thanks to hybrid parallelisation

✓MLPerf de-facto standard — chosen for evaluating GPU/TPU for recommendation workloads

✓Stable and predictable — easy to tune and debug compared to more exotic architectures

✓Strong base for hybrids with more advanced mechanisms (e.g. adding attention over user history) — DLRM as a 'backbone'

Limitations

✗Models only pairwise (second-order) interactions — higher-order interactions require additional mechanisms (DCN, DCNv2, AutoInt)

✗Terabyte-scale embedding tables are very hard to maintain (memory, parameter distribution networks, checkpoints, serving)

✗No modelling of sequential user context (behaviour over time) — requires extensions like DIN/DIEN

✗Retrieve-and-rank paradigm limited to a pre-built candidate pool — does not generate content for long-tail intents (solved by RaG/SIDs)

✗Cold-start for new items is weak — the model learns embeddings from scratch without semantic transfer (solved by Semantic IDs)

✗High compute cost in production training (GPU-hours for billions of entries), inference requires keeping huge tables in serving memory

Components

Bottom MLPMapping dense features to a unified embedding space

A multi-layer perceptron processing dense (numerical, continuous) features. Typically 2–4 FC layers with ReLU, ending with a vector of fixed dimensionality D — the same dimension as the embeddings. Computationally light and replicated across devices (data parallel).

Official

Sparse Embedding TablesRepresenting sparse categorical features as dense vectors

Huge learned tables (E_i: V_i × D, where V_i is the number of categories of the i-th feature and D is the embedding dimension) with indexed lookup. In production each table can have billions of entries. Sharded model-parallel across devices: different devices hold different columns (D dimensions) or different row segments.

Feature Interaction LayerExplicit modelling of second-order feature interactions

A layer that explicitly models cross-feature interactions via pairwise dot products. For N vectors (bottom MLP output + N-1 embeddings) it produces N(N-1)/2 scalar values — second-order interactions inspired by factorization machines. The result is concatenated with the bottom MLP output and passed to the top MLP.

Official

Top MLPFinal CTR/score prediction from interacted features

A multi-layer perceptron taking the concatenation of bottom MLP output + pairwise interaction values. Several FC layers with ReLU + sigmoid output for CTR prediction (or another score metric). Also replicated data-parallel across devices.

Official

Hybrid Parallelism SchemeScaling DLRM to terabyte embedding tables and billions of training samples

A parallelisation scheme combining data parallel (for MLPs) with model parallel (for embedding tables sharded across devices). A critical all-to-all operation moves embedding lookup results between devices before the interaction layer. This pattern became the standard for scalable recommendation training.

Implementation

Reference implementations

DLRM (Meta) — official PyTorch implementation

Python (PyTorch) · Meta AI (Facebook Research)

Official

TorchRec — successor of DLRM in productionizable form

Python (PyTorch) · PyTorch / Meta

Official

NVIDIA Merlin — alternative implementation with TensorRT acceleration

Python (PyTorch, TensorFlow) · NVIDIA

Implementation pitfalls

Embedding table memory blowupCritical

In production, embedding table sizes quickly grow to terabytes (billions of entries × tens/hundreds of columns). Without model-parallel sharding, training is impossible on a single GPU.

Fix:Sharded model-parallel embedding tables (as in TorchRec), hash collision compression, low-bit quantization (8-bit), pruning rarely used entries.

All-to-all communication overheadHigh

Hybrid parallelism requires all-to-all communication of embedding outputs between every device in each iteration — at hundreds/thousands of GPUs this becomes the main bottleneck.

Fix:High-bandwidth networking (NVLink, NVSwitch, InfiniBand), batching of many samples per request, overlap of communication with computation (pipelining).

Cold-start for new items/usersMedium

Embeddings of new items are started from random initialisation — the model has no semantic prior about what the new item is, so recommendation quality for new items is very low initially.

Fix:Hybrid with Semantic IDs (semantic content transfer through shared prefixes), use of metadata features (category, brand) as sparse features, periodic re-training with current data.

Limited to pairwise interactionsMedium

The original DLRM models only second-order interactions (pairwise dot products). Higher-order interactions (third-order cross, e.g. user×item×context) require explicitly added mechanisms.

Fix:Use the DLRM-DCNv2 variant with Cross Network (models arbitrary-order interactions), AutoInt (multi-head self-attention over features), or a deep & cross hybrid.

Evolution

Original paper · 2019 · arXiv 1906.00091 (Facebook AI, May 2019) · Maxim Naumov

Deep Learning Recommendation Model for Personalization and Recommendation Systems

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, Misha Smelyanskiy

2009

Matrix Factorization in the Netflix Prize

Matrix factorization (e.g. SVD++) as the dominant recommendation technique — the foundational intuition for feature interaction in later deep learning models.

2016

Wide & Deep (Google) + YouTube Deep NN (Covington)

First production neural recommendation networks: Wide & Deep combines memorisation (wide) with generalisation (deep); YouTube Deep NN introduces two-stage candidate generation + ranking.

2017

DeepFM (Huawei) — joint factorization + deep

DeepFM combines factorization machines (second-order) and deep neural networks (higher orders) in a single end-to-end model — an important DLRM precursor.

2019

DLRM (Meta) — open reference + co-design

Inflection point

Naumov et al. publish DLRM as a unified architectural pattern + open-source implementation in PyTorch and Caffe2 + dedicated hybrid parallelisation. Establishes the MLPerf standard.

2021

TorchRec — Meta open-sources production-grade recommendation

TorchRec (PyTorch) — a library evolving the DLRM pattern into a productionizable framework with modular sharded embeddings, model parallel and hybrids for industrial-scale recommendation.

2023

TIGER / Semantic IDs — paradigm shift to Generative Retrieval

Inflection point

Rajput et al. (Google, NeurIPS 2023) introduce Semantic IDs and Generative Retrieval — a paradigm shift away from DLRM-style dot-product retrieval toward autoregressive SID generation. DLRM remains a strong baseline for comparisons.

SIDs (concept)

2026

Recommendation-as-Generation (Kuaishou) — DLRM as comparison baseline

The RaG paradigm (Kuaishou, arXiv 2606.25496) explicitly compares its production results (400M+ DAU) to the DLRM baseline: +5.46% ad revenue for RaG vs DLRM, confirming DLRM's role as the standard against which progress is measured.

RaG (Recommendation-as-Generation) (concept)