Training: for each training sequence x = (x₁, …, x_T) the model is given the entire sequence and predicts shifted targets (teacher forcing). A causal mask in self-attention ensures position t only sees positions <t. Loss: ℒ = −∑ₜ log p_θ(xₜ | x_<t). The entire token batch is processed in parallel. Inference: starts from a context (or BOS); in a loop the model computes p(xₜ | x_<t), samples xₜ via a chosen strategy (greedy / temperature / top-k / top-p / beam), appends it to the sequence, and re-runs with the new context. KV-cache eliminates re-computation of attention over already-processed tokens, reducing cost from O(t²) to O(t) per step. Speculative decoding parallelizes inference by drafting with a small model and verifying with the large one. Generation ends at a stop / EOS token or upon reaching max length.
Modeling high-dimensional data distributions in a single step is hard. Autoregressive factorization reduces the problem to a sequence of easy next-element prediction subproblems, for which cross-entropy with teacher forcing provides a stable, well-scaling training signal.
Triangular mask in self-attention blocking position t from accessing positions ≥ t. Enables parallel training over the entire sequence without future leakage.
Output layer (typically linear + softmax) producing a probability distribution over the token / category vocabulary.
Sampling decoder — greedy, temperature, top-k, top-p, beam search, contrastive, min-p. Affects quality, diversity, and hallucinations.
Official
Cache of K and V tensors for all previous positions, eliminating recomputation at every generation step.
Official
For text: BPE / SentencePiece / Unigram. For audio and image: VQ-VAE / RVQ. Determines sequence length and the compression-quality trade-off.
Training on ground truth (teacher forcing) differs from inference on the model's own predictions — errors compound.
Greedy/beam can fall into "X X X X" loops at low-temperature sampling.
The model samples from a distribution — it can produce grammatically correct but factually wrong text.
AR generation is naturally sequential and memory-bandwidth bound.
KV-cache grows linearly with context length and number of layers — at 128k+ it becomes the main VRAM consumer.
Bengio et al. introduce a neural autoregressive language model as an alternative to n-grams.
Sutskever et al. and Cho et al. demonstrate autoregressive RNN decoders for machine translation.
van den Oord et al. extend AR to pixels with masked convolutions.
DeepMind shows raw audio generation via AR causal dilated convolutions.
OpenAI combines AR with the Transformer architecture and large-corpus pre-training.
Brown et al. show that a 175B-parameter AR LM exhibits few-shot in-context learning.
OpenAI demonstrates that a Transformer trained AR on pixels produces useful representations.
Chen et al. cast reinforcement learning as autoregressive (return, state, action) sequences.
Leviathan et al. and Chen et al. introduce draft+verify to reduce AR LLM latency.
Tian et al. (VAR) and He et al. (MAR) show that AR with the right scale ordering surpasses diffusion on ImageNet.
Time complexity: O(T · C(t)) inferencja, gdzie C(t) = koszt jednego forward passa po t krokach (z KV-cache O(t·d), bez O(t²·d)).
Maximum sequence length visible to the model (4k, 32k, 128k, 1M).
Vocabulary token count (typically 32k-256k for text, 8k-65k for image VQ).
Greedy / top-k / top-p / beam / contrastive / min-p — strongly affects quality vs diversity.
Logit scaling factor: τ→0 → deterministic, τ>1 → more creative.
Hard cap on generation length.
BPE / SentencePiece / Unigram / VQ — affects sequence length and quality.
The entire network is active for each generated token (except MoE-AR variants, which are sparse / conditional).
Training is fully token-parallel thanks to teacher forcing + causal mask. Inference is inherently sequential across tokens (each new token depends on the previous), although speculative decoding and parallel sampling allow limited parallelization.
AR LM training is matmul-bound and ideally fits tensor cores. AR inference is memory-bandwidth-bound — GPUs with fast HBM (H100/H200/MI300) are preferred.
Google uses TPUs to train Gemini and PaLM. JAX/XLA parallelizes teacher forcing well.
llama.cpp with quantization (Q4-Q8) enables practical AR inference on CPU (comment: throughput is limited).
Custom FPGA accelerators for AR inference exist in niches but are not mainstream.