Step 1 — Patching: an H × W × C image is split into N patches of P × P (e.g. 224 × 224 → 14 × 14 = 196 patches of 16 × 16). Implementation-wise this is a Conv2d(in=C, out=d_model, kernel=P, stride=P) — efficient on GPU. Step 2 — Patch embedding: each patch (P²·C dims) is linearly projected to d_model. Step 3 — [CLS] token: a learned d_model vector is prepended as the classification token (analogous to BERT). Step 4 — Positional embedding: a learned 1D position vector (length N+1) is added to each token to carry spatial information (self-attention itself is permutation-invariant). Step 5 — Transformer encoder stack: L layers, each with LayerNorm → Multi-Head Self-Attention → residual → LayerNorm → FFN (gelu) → residual. ViT uses pre-norm (LN before attention). Step 6 — Classification: the last [CLS] representation goes through an MLP / linear head → softmax over classes. Standard training uses supervised cross-entropy; modern variants use masked image modeling (MAE), contrastive learning (CLIP/DINO) or self-distillation (DINOv2) as pretraining. Inference at a new resolution requires interpolating positional embeddings.
How to reach state-of-the-art image classification without relying on the hand-engineered inductive biases of convolutions (locality, translation equivariance, hierarchical receptive fields), and how to unify NLP and vision architectures, enabling multimodal models with a single backbone.
Split of the image into N non-overlapping P × P patches and their linear projection to d_model. Typically implemented as Conv2d(C, d_model, kernel=P, stride=P).
Official
A learned d_model vector prepended to the sequence. Its last-layer representation is used as the global image descriptor for classification.
Official
A learned [N+1, d_model] tensor added to the tokens, since self-attention is permutation-invariant and does not know patch positions on its own.
Official
L layers — each: LN → MHSA → residual → LN → FFN(GELU) → residual. Identical to BERT/GPT, without a causal mask (all-to-all attention).
MLP or linear layer mapping the [CLS] representation to class logits. Replaced by a projection head in self-supervised pretraining (e.g. DINO MLP).
Official
Training ViT from scratch on ImageNet-1k yields lower accuracy than ResNet — without the convolutional inductive biases the model needs far more data.
Learned 1D positional embeddings are specific to the pretraining N. Fine-tuning at 384×384 after pretraining at 224×224 requires 2D interpolation, otherwise performance drops.
Dense tasks (segmentation, detection) require high resolutions; standard ViT scales O(N²) in number of patches.
CNNs naturally build a hierarchy of feature maps from local to global; standard ViT operates at a single scale, which can be problematic for detection of differently sized objects.
Very deep / large ViTs (ViT-H/22B) suffer from attention divergence in deep layers.
Self-attention without recurrence emerges in NLP — the foundation of the later ViT.
First influential demonstration of a pure Transformer on images (at the pixel level), a precursor to ViT.
Full formulation of ViT: 16×16 patching, pure Transformer, pretraining on JFT-300M. ImageNet result beats the best CNNs.
Shows that ViT can be trained on ImageNet-1k without massive pretraining via distillation and improved augmentation.
Local self-attention windows + shifted windows + multi-resolution hierarchy — make ViT competitive as a general-purpose backbone (detection, segmentation).
ViT becomes the standard backbone for contrastive image-text pretraining; opens the era of zero-shot vision.
Masking ~75% of patches and reconstructing — a highly efficient self-supervised pretraining for ViT.
Self-supervised pretraining of ViT reveals emergent segmentation properties in attention maps.
Shows that ViT scales analogously to LLMs; reveals new behavioral properties at large scale.
ViT becomes the foundation of open vision foundation models: general-purpose features (DINOv2) and promptable segmentation (SAM).
Time complexity: O(N² · d) + O(N · d²) per layer. Space complexity: O(N² + N · d).
Standard ViT is a dense model — all parameters active for every patch. MoE variants (V-MoE) introduce conditional computation but are not part of the core definition.
ViT is an encoder (no causal mask) — all patches are processed in parallel during both training and inference. Ideal for tensor and sequence parallelism in very large models (ViT-22B).
Pixels per patch (16, 14, 8). Smaller P → more tokens → quadratically more expensive attention, but better spatial resolution.
Standard variants: ViT-Ti, ViT-S, ViT-B (Base, ~86M), ViT-L (Large, ~307M), ViT-H (Huge, ~632M), ViT-g/G, ViT-22B.
Most commonly 224×224 (pretraining), 384×384 (fine-tuning). Changing resolution requires positional-embedding interpolation.
Standard: 12 (ViT-B), 16 (ViT-L), 16 (ViT-H). Head dimension is d_model / num_heads.
A critical axis from the original paper: ViT loses to ResNet on ImageNet-1k but wins on ImageNet-21k and JFT-300M.
1D learned (original), 2D learned, sinusoidal, relative, RoPE — affects the ability to change resolution.
ViT is a dense Transformer — all ops (patch embedding, MHSA, FFN) map to matmul and are ideal for tensor cores (FP16/BF16/FP8).
ViT originated at Google on TPU and is trained there up to 22B scale; the systolic array handles MHSA and FFN excellently.
ViT-B/S inference on CPU AVX/AVX-512 (ONNX Runtime, OpenVINO) is practical for batch use cases, though slower than on GPU.
Academic FPGA accelerators for ViT exist, but a broad production ecosystem is lacking.