Architecture

AR Generation

2003ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

The factorization p(x) = ∏ₜ p(xₜ | x_<t) reduces sequence generation to a "next-element" prediction loop, enabling maximum-likelihood training and sequential sampling.

How it works

Training: for each training sequence x = (x₁, …, x_T) the model is given the entire sequence and predicts shifted targets (teacher forcing). A causal mask in self-attention ensures position t only sees positions <t. Loss: ℒ = −∑ₜ log p_θ(xₜ | x_<t). The entire token batch is processed in parallel. Inference: starts from a context (or BOS); in a loop the model computes p(xₜ | x_<t), samples xₜ via a chosen strategy (greedy / temperature / top-k / top-p / beam), appends it to the sequence, and re-runs with the new context. KV-cache eliminates re-computation of attention over already-processed tokens, reducing cost from O(t²) to O(t) per step. Speculative decoding parallelizes inference by drafting with a small model and verifying with the large one. Generation ends at a stop / EOS token or upon reaching max length.

Problem solved

Modeling high-dimensional data distributions in a single step is hard. Autoregressive factorization reduces the problem to a sequence of easy next-element prediction subproblems, for which cross-entropy with teacher forcing provides a stable, well-scaling training signal.

Components

Causal maskEnforces autoregressive dependency at training time

Triangular mask in self-attention blocking position t from accessing positions ≥ t. Enables parallel training over the entire sequence without future leakage.

Conditional distribution headOutputs p(xₜ | x_<t)

Output layer (typically linear + softmax) producing a probability distribution over the token / category vocabulary.

Sampling strategySelects xₜ from the distribution p(xₜ | x_<t)

Sampling decoder — greedy, temperature, top-k, top-p, beam search, contrastive, min-p. Affects quality, diversity, and hallucinations.

Greedyargmax — deterministic, prone to loops.

Top-kSampling from the k most likely tokens.

Top-p (nucleus)Sampling from the smallest set with cumulative probability ≥ p.

Beam searchMaintains b best sequence hypotheses — common in translation.

Official

KV cacheInference optimization

Cache of K and V tensors for all previous positions, eliminating recomputation at every generation step.

Official

TokenizerMaps raw data sequences to discrete tokens

For text: BPE / SentencePiece / Unigram. For audio and image: VQ-VAE / RVQ. Determines sequence length and the compression-quality trade-off.

Implementation

Reference implementations

Hugging Face Transformers (generate API)

Python

Official

nanoGPT (referencyjny GPT od Karpathy)

Python

llama.cpp (efektywna AR inferencja CPU/GPU)

C/C++

vLLM (high-throughput AR serving)

Implementation pitfalls

Exposure biasMedium

Training on ground truth (teacher forcing) differs from inference on the model's own predictions — errors compound.

Fix:Scheduled sampling, RL fine-tuning (RLHF), DPO, full-sequence scoring, larger scale.

Generation loops and repetitionsMedium

Greedy/beam can fall into "X X X X" loops at low-temperature sampling.

Fix:Repetition penalty, no-repeat-ngram, top-p, contrastive search.

Hallucinations and low factualityHigh

The model samples from a distribution — it can produce grammatically correct but factually wrong text.

Fix:RAG, RLHF, constrained decoding, lower temperature, verification tools.

Sequential inference latencyHigh

AR generation is naturally sequential and memory-bandwidth bound.

Fix:KV-cache, paged attention (vLLM), speculative decoding, batching, quantization.

KV-cache memory growth with lengthHigh

KV-cache grows linearly with context length and number of layers — at 128k+ it becomes the main VRAM consumer.

Fix:GQA / MQA, sliding window attention, KV-cache quantization, paged attention, KV compression.

Evolution

Original paper · 2003 · JMLR 2003 · Yoshua Bengio

A Neural Probabilistic Language Model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin

2003

Neural Probabilistic Language Model — first neural AR LM

Inflection point

Bengio et al. introduce a neural autoregressive language model as an alternative to n-grams.

2014

Seq2seq — encoder-decoder AR

Sutskever et al. and Cho et al. demonstrate autoregressive RNN decoders for machine translation.

Sequence to Sequence Learning with Neural Networks (paper)

2016

PixelRNN / PixelCNN — image autoregression

van den Oord et al. extend AR to pixels with masked convolutions.

Pixel Recurrent Neural Networks (paper)

2016

WaveNet — sample-by-sample AR audio

DeepMind shows raw audio generation via AR causal dilated convolutions.

WaveNet: A Generative Model for Raw Audio (paper)

2018

GPT — Transformer AR as the LLM foundation

Inflection point

OpenAI combines AR with the Transformer architecture and large-corpus pre-training.

Transformer (concept)

2020

GPT-3 — emergence at scale

Inflection point

Brown et al. show that a 175B-parameter AR LM exhibits few-shot in-context learning.

LLM (concept)

2020

ImageGPT — pixel AR as visual pretraining

OpenAI demonstrates that a Transformer trained AR on pixels produces useful representations.

2021

Decision Transformer — RL as AR

Chen et al. cast reinforcement learning as autoregressive (return, state, action) sequences.

Decision Transformer: Reinforcement Learning via Sequence Modeling (paper)

2023

Speculative Decoding — parallelizing AR inference

Leviathan et al. and Chen et al. introduce draft+verify to reduce AR LLM latency.

Speculative Decoding (concept)Fast Inference from Transformers via Speculative Decoding (paper)

2024

VAR / MAR — challenging diffusion in image generation

Inflection point

Tian et al. (VAR) and He et al. (MAR) show that AR with the right scale ordering surpasses diffusion on ImageNet.

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (paper)

Sources

A Neural Probabilistic Language Model

Paper

JMLR 2003

Sequence to Sequence Learning with Neural Networks

Paper

arXiv / NeurIPS 2014

Pixel Recurrent Neural Networks

Paper

arXiv / ICML 2016

WaveNet: A Generative Model for Raw Audio

Paper

arXiv

Improving Language Understanding by Generative Pre-Training (GPT)

Paper

OpenAI

Language Models are Few-Shot Learners (GPT-3)

Paper

arXiv / NeurIPS 2020

Decision Transformer: Reinforcement Learning via Sequence Modeling

Paper

arXiv / NeurIPS 2021

Fast Inference from Transformers via Speculative Decoding

Paper

arXiv

Visual Autoregressive Modeling (VAR)

Paper

arXiv / NeurIPS 2024

Hugging Face Transformers — text generation strategies

Documentation

Hugging Face

AR Generation

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements