Text is first tokenized (e.g. BPE / SentencePiece) into a sequence x = (x₁,…,x_T). The model — typically a decoder-only Transformer — transforms input token embeddings through a stack of layers in which self-attention uses a causal mask: position t may attend only to positions ≤ t. At the output, a linear projection maps hidden states to vocabulary logits; a softmax yields the distribution P(x_t | x_<t). The loss is mean cross-entropy between this distribution and the ground-truth next token (teacher forcing — the context fed during training is always the gold x_<t, never the model's own predictions). The whole procedure is self-supervised: the label at position t is simply the corpus token x_t, so no manual annotation is required. At inference, generation is autoregressive: a token x_t is sampled from the distribution (greedy / top-k / top-p / temperature), appended to the context, and the step is repeated until an end-of-sequence token or length limit is reached.
How to train generative language models on unlabeled text corpora and how, at inference time, to generate new sequences token by token. CLM provides a simple self-supervised objective that scales with data and model size, while naturally matching autoregressive generation at deployment.
Converts raw text into discrete token IDs from the model vocabulary (typically BPE / WordPiece / SentencePiece).
Triangular mask that zeros out attention to positions greater than the current one — the key mechanism enforcing conditioning on the left context only.
Stack of masked self-attention + FFN layers that maps token embeddings to hidden states.
Linear projection from hidden states to vocabulary logits; softmax produces the distribution P(x_t | x_<t).
Mean negative log-likelihood of the true next token at each position. Defines the CLM optimization objective.
If the look-ahead mask is misapplied in self-attention, the model "peeks" at future tokens — cross-entropy loss drops to near zero (perfect cheating), but generation is useless.
Teacher forcing always feeds ground truth during training, but inference conditions on the model's own predictions. Errors compound autoregressively.
Without masking pad tokens in the loss, the model "learns" to predict PAD tokens, distorting metrics and wasting capacity.
CLM requires that the label at position t be token x_{t+1}. Getting the shift wrong makes the model learn an identity mapping.
Claude Shannon formalizes next-letter / next-word prediction as a measure of language entropy — the information-theoretic root of language modeling.
First neural language model based on word embeddings and a feedforward network, predicting the next word from an n-gram context.
Recurrent language models (RNN LM, later LSTM) dominate CLM for the next decade — fully autoregressive but sequential even during training.
Masked self-attention computes the CLM loss for an entire sequence in a single forward pass — eliminating the recurrent bottleneck of RNNs.
Radford et al. show that a decoder-only Transformer pretrained with CLM and then fine-tuned beats task-specific architectures.
Brown et al. demonstrate emergent few-shot abilities by extreme CLM scaling, establishing it as the default LLM pretraining objective.
CLM remains the dominant pretraining objective in open model families; variants (RWKV, Mamba) experiment with architecture but keep the CLM objective.
Time complexity: O(T² · d) per forward pass. Space complexity: O(T² + T · d).
A standard CLM Transformer is a dense model — all parameters are active for every token. MoE variants combine CLM with conditional computation but are not part of CLM's core definition.
Training is massively parallel — all sequence positions are computed simultaneously thanks to teacher forcing (the causal mask lets us compute logits for every position t in one pass and sum the loss). Autoregressive inference, however, is sequential in time — x_t must be produced before x_{t+1}.
Maximum context window length on which the model conditions its predictions — a fundamental parameter of CLM Transformers.
Number of unique tokens in the tokenizer vocabulary; determines embeddings and LM head size.
Subword algorithm (BPE, WordPiece, SentencePiece, Unigram) affects text compression and learning quality.
Standard CLM training always uses teacher forcing (context = ground truth). Disabling it / scheduled sampling = experiments addressing train/inference mismatch.
CLM Transformer training and inference are dominated by dense matrix multiplications (attention + MLP) that map perfectly onto tensor cores (FP16/BF16/FP8).
Google TPU pods were historically used to train LMs (T5, PaLM, Gemini); the systolic array excels at matmul in dense Transformers.
Small CLM LLMs (e.g. 1–7B quantized to 4-bit, llama.cpp) can run on CPU AVX/AVX-512, though slower than on GPU.