Architecture

ViT

2020ActivePublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Applying a pure Transformer architecture to images by splitting them into a sequence of flat patches (e.g. 16×16 px) treated as tokens — showing that the inductive biases of convolutions (locality, translation equivariance) are not required if the model is pretrained on sufficiently large datasets.

How it works

Step 1 — Patching: an H × W × C image is split into N patches of P × P (e.g. 224 × 224 → 14 × 14 = 196 patches of 16 × 16). Implementation-wise this is a Conv2d(in=C, out=d_model, kernel=P, stride=P) — efficient on GPU. Step 2 — Patch embedding: each patch (P²·C dims) is linearly projected to d_model. Step 3 — [CLS] token: a learned d_model vector is prepended as the classification token (analogous to BERT). Step 4 — Positional embedding: a learned 1D position vector (length N+1) is added to each token to carry spatial information (self-attention itself is permutation-invariant). Step 5 — Transformer encoder stack: L layers, each with LayerNorm → Multi-Head Self-Attention → residual → LayerNorm → FFN (gelu) → residual. ViT uses pre-norm (LN before attention). Step 6 — Classification: the last [CLS] representation goes through an MLP / linear head → softmax over classes. Standard training uses supervised cross-entropy; modern variants use masked image modeling (MAE), contrastive learning (CLIP/DINO) or self-distillation (DINOv2) as pretraining. Inference at a new resolution requires interpolating positional embeddings.

Problem solved

How to reach state-of-the-art image classification without relying on the hand-engineered inductive biases of convolutions (locality, translation equivariance, hierarchical receptive fields), and how to unify NLP and vision architectures, enabling multimodal models with a single backbone.

Components

Patch embeddingImage tokenization

Split of the image into N non-overlapping P × P patches and their linear projection to d_model. Typically implemented as Conv2d(C, d_model, kernel=P, stride=P).

INImage tensor — batch, channels (typically 3 for RGB), height, width.

OUTSequence of N patch embeddings of dimension d_model.

Official

[CLS] tokenClassification placeholder aggregating information

A learned d_model vector prepended to the sequence. Its last-layer representation is used as the global image descriptor for classification.

Official

Positional embedding (1D learned)Injecting spatial information

A learned [N+1, d_model] tensor added to the tokens, since self-attention is permutation-invariant and does not know patch positions on its own.

1D learnedDefault in original ViT.

2D learnedSeparate embeddings for x and y axes.

SinusoidalStatic, as in NLP Transformer.

Relative / RoPEIntroduced in newer variants (e.g. ViT-22B).

Official

Transformer encoder block (pre-norm)Modeling global dependencies between patches

L layers — each: LN → MHSA → residual → LN → FFN(GELU) → residual. Identical to BERT/GPT, without a causal mask (all-to-all attention).

Classification headOutput

MLP or linear layer mapping the [CLS] representation to class logits. Replaced by a projection head in self-supervised pretraining (e.g. DINO MLP).

Official

Implementation

Reference implementations

google-research/vision_transformer (official ViT)

Python (JAX/Flax) · Google Research

Official

Hugging Face Transformers — ViTModel

Python (PyTorch) · Hugging Face

Official

timm (rwightman) — PyTorch Image Models

Python (PyTorch) · Ross Wightman / Hugging Face

lucidrains/vit-pytorch (educational)

Python (PyTorch) · Phil Wang (lucidrains)

facebookresearch/dinov2

Python (PyTorch) · Meta AI

Official

Implementation pitfalls

Poor results without large-scale pretrainingHigh

Training ViT from scratch on ImageNet-1k yields lower accuracy than ResNet — without the convolutional inductive biases the model needs far more data.

Fix:Pretrain on ImageNet-21k / JFT or distill (DeiT). Strong augmentation (RandAugment, Mixup, CutMix), stochastic depth.

Positional-embedding interpolation when changing resolutionHigh

Learned 1D positional embeddings are specific to the pretraining N. Fine-tuning at 384×384 after pretraining at 224×224 requires 2D interpolation, otherwise performance drops.

Fix:Reshape to 2D, bilinear / biquadratic interpolation, then flatten back. Or use RoPE / relative position.

Quadratic cost at high resolutionsHigh

Dense tasks (segmentation, detection) require high resolutions; standard ViT scales O(N²) in number of patches.

Fix:Swin (local windows), FlashAttention (better constant), adaptive tokens (Token Merging), hierarchical backbones.

No hierarchical receptive fieldsMedium

CNNs naturally build a hierarchy of feature maps from local to global; standard ViT operates at a single scale, which can be problematic for detection of differently sized objects.

Fix:Swin Transformer, MViT, PVT introduce a hierarchy. ViTDet shows that for detection a "simple feature pyramid" suffices.

Unstable training of large ViTsMedium

Very deep / large ViTs (ViT-H/22B) suffer from attention divergence in deep layers.

Fix:QK-norm (query/key normalization), more careful LN, gradient clipping, learning-rate warm-up, freezing patch embedding initially.

Evolution

Original paper · 2020 · ICLR 2021 · Alexey Dosovitskiy

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

2017

Transformer (Vaswani et al.) — source architecture

Self-attention without recurrence emerges in NLP — the foundation of the later ViT.

Attention Is All You Need (paper)

2020

iGPT (Chen et al., OpenAI) — autoregressive generative pretraining on pixels

First influential demonstration of a pure Transformer on images (at the pixel level), a precursor to ViT.

2020

ViT — "An Image is Worth 16x16 Words" (Dosovitskiy et al.)

Inflection point

Full formulation of ViT: 16×16 patching, pure Transformer, pretraining on JFT-300M. ImageNet result beats the best CNNs.

An Image is Worth 16x16 Words (paper)

2021

DeiT (Touvron et al., Meta) — data-efficient ViT

Inflection point

Shows that ViT can be trained on ImageNet-1k without massive pretraining via distillation and improved augmentation.

Training data-efficient image transformers & distillation through attention (paper)

2021

Swin Transformer (Liu et al., Microsoft) — hierarchical windowed ViT

Inflection point

Local self-attention windows + shifted windows + multi-resolution hierarchy — make ViT competitive as a general-purpose backbone (detection, segmentation).

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper)

2021

CLIP (Radford et al., OpenAI) — ViT as the visual encoder in multimodality

Inflection point

ViT becomes the standard backbone for contrastive image-text pretraining; opens the era of zero-shot vision.

Learning Transferable Visual Models From Natural Language Supervision (paper)

2021

MAE (He et al., Meta) — masked autoencoder pretraining for ViT

Inflection point

Masking ~75% of patches and reconstructing — a highly efficient self-supervised pretraining for ViT.

Masked Autoencoders Are Scalable Vision Learners (paper)

2021