Training

CLM

2018ActivePublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Framing language modeling as autoregressive next-token prediction conditioned only on previous tokens (left context), enabling self-supervised training of generative sequence models on massive unlabeled text corpora.

How it works

Text is first tokenized (e.g. BPE / SentencePiece) into a sequence x = (x₁,…,x_T). The model — typically a decoder-only Transformer — transforms input token embeddings through a stack of layers in which self-attention uses a causal mask: position t may attend only to positions ≤ t. At the output, a linear projection maps hidden states to vocabulary logits; a softmax yields the distribution P(x_t | x_<t). The loss is mean cross-entropy between this distribution and the ground-truth next token (teacher forcing — the context fed during training is always the gold x_<t, never the model's own predictions). The whole procedure is self-supervised: the label at position t is simply the corpus token x_t, so no manual annotation is required. At inference, generation is autoregressive: a token x_t is sampled from the distribution (greedy / top-k / top-p / temperature), appended to the context, and the step is repeated until an end-of-sequence token or length limit is reached.

Problem solved

How to train generative language models on unlabeled text corpora and how, at inference time, to generate new sequences token by token. CLM provides a simple self-supervised objective that scales with data and model size, while naturally matching autoregressive generation at deployment.

Components

TokenizerInput / output

Converts raw text into discrete token IDs from the model vocabulary (typically BPE / WordPiece / SentencePiece).

Causal attention maskSelf-attention constraint

Triangular mask that zeros out attention to positions greater than the current one — the key mechanism enforcing conditioning on the left context only.

Decoder-only Transformer stackContext modeling

Stack of masked self-attention + FFN layers that maps token embeddings to hidden states.

LM head (linear projection + softmax)Model output

Linear projection from hidden states to vocabulary logits; softmax produces the distribution P(x_t | x_<t).

Cross-entropy next-token lossLoss function

Mean negative log-likelihood of the true next token at each position. Defines the CLM optimization objective.

Implementation

Reference implementations

Hugging Face Transformers — Causal language modeling tutorial

Python · Hugging Face

Official

nanoGPT

Python (PyTorch) · Andrej Karpathy

minGPT

Python (PyTorch) · Andrej Karpathy

Implementation pitfalls

Missing / incorrect causal maskCritical

If the look-ahead mask is misapplied in self-attention, the model "peeks" at future tokens — cross-entropy loss drops to near zero (perfect cheating), but generation is useless.

Fix:Unit-test that for position t logits do not change when permuting x_{>t}. Use vetted implementations (Flash/SDPA with is_causal=True).

Train/inference mismatch (exposure bias)Medium

Teacher forcing always feeds ground truth during training, but inference conditions on the model's own predictions. Errors compound autoregressively.

Fix:Model scaling, better tokenization, sampling strategies (top-p, temperature), RLHF and instruction tuning mitigate this; scheduled sampling is rarely used in modern LLMs.

Padding tokens included in lossMedium

Without masking pad tokens in the loss, the model "learns" to predict PAD tokens, distorting metrics and wasting capacity.

Fix:Set label = -100 (ignore_index in PyTorch) for pad positions or use the attention mask consistently.

Off-by-one shift between input and labelsHigh

CLM requires that the label at position t be token x_{t+1}. Getting the shift wrong makes the model learn an identity mapping.

Fix:Use vetted collators (e.g. DataCollatorForLanguageModeling with mlm=False) — they handle the shift automatically.

Evolution

Original paper · 2018 · OpenAI Technical Report 2018 · Alec Radford

Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

1951

Shannon — prediction and entropy of English

Claude Shannon formalizes next-letter / next-word prediction as a measure of language entropy — the information-theoretic root of language modeling.

2003

Bengio et al. — Neural Probabilistic Language Model

Inflection point

First neural language model based on word embeddings and a feedforward network, predicting the next word from an n-gram context.

A Neural Probabilistic Language Model (paper)

2010

Mikolov — RNN Language Model

Recurrent language models (RNN LM, later LSTM) dominate CLM for the next decade — fully autoregressive but sequential even during training.

2017

Transformer (Vaswani et al.) enables parallel CLM training

Inflection point

Masked self-attention computes the CLM loss for an entire sequence in a single forward pass — eliminating the recurrent bottleneck of RNNs.

Attention Is All You Need (paper)

2018

GPT-1 — CLM pretraining as a general-purpose objective

Inflection point

Radford et al. show that a decoder-only Transformer pretrained with CLM and then fine-tuned beats task-specific architectures.

Improving Language Understanding by Generative Pre-Training (paper)

2020

GPT-3 — scaling CLM to 175B parameters

Inflection point

Brown et al. demonstrate emergent few-shot abilities by extreme CLM scaling, establishing it as the default LLM pretraining objective.

Language Models are Few-Shot Learners (paper)

2023

LLaMA, Mistral, Qwen — open-weight CLM-based LLMs

CLM remains the dominant pretraining objective in open model families; variants (RWKV, Mamba) experiment with architecture but keep the CLM objective.