Before Transformers: The Sequential Bottleneck
For most of the 2010s, the dominant paradigm for natural language processing was recurrent neural networks (RNN) and their more advanced variants โ LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks. All of these architectures shared one fundamental flaw: they processed data sequentially.
To compute the hidden state for a token at step โฆ, the network first had to process all previous tokens from โฆ down to โฆ. For a sequence of 512 tokens, this meant 512 sequential computing steps โ parallel GPU processors couldn't leverage their potential.
A second problem was the information bottleneck. In sequence-to-sequence models, the entire input context had to be compressed into a single fixed-length vector. Information from the first token had to "survive" hundreds of intermediate steps before influencing the last one โ leading to information degradation and difficulty capturing long-range dependencies.
Researchers tried to alleviate these limitations (including Bahdanau attention in 2014), but sequential processing remained a hard physical limit on scalability.
"Attention Is All You Need" โ The 2017 Breakthrough
In June 2017, eight researchers from Google Brain and the University of Toronto published a paper titled "Attention Is All You Need". They proposed a radical hypothesis: recurrence and convolutions could be eliminated entirely. In their place โ self-attention alone.
The Transformer processed all tokens simultaneously, reducing the path between any two tokens to โฆ. Removing sequentiality enabled massive computational parallelism and training far larger models on far larger datasets. By early 2026, the paper had accumulated over 173,000 citations โ making it among the most cited machine learning papers of the 21st century.
The Self-Attention Mechanism: The Heart of the Transformer
The key innovation of the Transformer is self-attention โ a mechanism that allows the model to simultaneously weigh the relevance of every token against all other tokens in the sequence.
Take the sentence: "I sat on the bank watching the river flow." Self-attention allows the model to look simultaneously at "sat" and "river" to determine whether "bank" refers to a financial institution or a riverbank.
Technically, for each token in the sequence, the model generates three vectors: Query (Q) โ what the token is looking for, Key (K) โ what the token offers to others, Value (V) โ the actual content the token "contributes." The full scaled dot-product attention formula:
Dividing by (where Softmax function then normalizes these scores into a probability distribution summing to 1, producing the final attention weights.โฆ, producing the final attention weights.
Multi-Head Attention: Multiple Experts Reading the Same Text
Rather than a single self-attention operation, the Transformer uses Multi-Head Attention โ the Q, K, V matrices are linearly projected into multiple smaller subspaces (called "heads"). Each head specializes in a different aspect: one might track grammatical dependencies, another focus on semantic meaning, a third might focus on long-range pronoun references.
The results of all heads are then concatenated and passed through a final linear transformation to produce a comprehensive representation.
How to use: Click a head tab (Grammar, Semantics, etc.) to see a different perspective on the same sentence. Arc thickness represents attention strength โ thicker means stronger dependency between tokens. Glowing arcs are the strongest connections within that head. Brightly highlighted tokens are the ones this head considers most relevant.
Encoder and Decoder: Two Faces of the Transformer
The original 2017 architecture had an encoder-decoder structure designed for machine translation.
- The encoder processes the input sequence (e.g., an English sentence) and builds a rich, bidirectional contextual representation โ each token can "look" simultaneously left and right.
- The decoder is responsible for generating the output sequence (e.g., a Polish translation) token by token. It contains an additional cross-attention layer: Queries (Q) come from the decoder, while Keys (K) and Values (V) come from the encoder's output. This allows the decoder to dynamically focus on the most relevant parts of the source text.
To prevent the decoder from "cheating" during training by peeking at future tokens, masked self-attention is used, blocking access to tokens at positions โฆ.
K, V from encoder
Frames labeled ร N wrap layers repeated N times (typically 6, 12, or more). Thin grey side lines are residual connections โ the sub-block's input bypasses it and is added to its output at "Add & Norm". The violet curve is cross-attention: the encoder's output (as K and V) feeds every decoder layer. Both stacks flow bottom-up.
Three Model Families: BERT, GPT, and T5
Researchers quickly realized the architecture could be split and optimized for specific tasks:
- Encoder-only models (BERT, 2018) โ Google dropped the decoder, keeping full bidirectional attention. BERT is trained by masking 15% of tokens and predicting them from context. It specializes in language understanding: sentiment analysis, named entity recognition, extractive question answering. It powers Google Search.
- Decoder-only models (the GPT family) โ OpenAI dropped the encoder, keeping sequential left-to-right processing. GPT is trained by predicting the next token (Causal Language Modeling). This architecture gave us ChatGPT, GitHub Copilot and the entire generative AI revolution.
- Encoder-decoder models (T5, BART) โ preserve the original structure and work best for conditional generation tasks: automatic text summarization, machine translation.
Scaling Laws and the LLM Era
The Transformer unlocked AI scaling laws: as the number of model parameters and training data size increase, model performance scales reliably. The original Transformer had around 100 million parameters. GPT-3 โ 175 billion. Modern models reach into the trillions.
This massive scalability allowed models to absorb vast amounts of human knowledge from internet corpora (Wikipedia, Common Crawl, BooksCorpus). As a result, LLMs transitioned from simple statistical pattern matchers to systems exhibiting emergent capabilities: zero-shot learning, complex logical reasoning, advanced coding, and dialogue.
The entire AI paradigm shifted away from task-specific algorithms toward generalized foundation models adaptable to thousands of different tasks through simple text prompts.
Vision Transformers: Convolutions Become Optional
In 2020, researchers at Google published "An Image is Worth 16x16 Words", introducing the Vision Transformer (ViT). They demonstrated that a pure Transformer โ designed for text โ could achieve state-of-the-art results on visual tasks without a single convolutional layer.
The trick: images are divided into a grid of non-overlapping patches โ typically 16ร16 pixels. Each patch is flattened into a 1D vector and projected into the model's vector space (an internal numerical representation where similar concepts have similar vectors), creating "visual tokens." A special classification token [CLS], prepended to the sequence, aggregates information from all patches through self-attention and is used for the final image classification.
ViT outperforms CNNs when trained on gigantic datasets, offering a global receptive field from the very first layer โ a pixel in the top-left corner can immediately "see" a pixel in the bottom-right. This opened the door to multimodal architectures integrating text and images.
Why the Transformer Will Survive the Next Decade
The Transformer is today the foundational blueprint for nearly all modern AI. Its success stems from several mutually reinforcing properties:
- Computational parallelism โ eliminating sequentiality enables training on thousands of GPUs simultaneously.
- Scalability โ performance grows predictably with model size and data.
- Lack of strong inductive biases โ the same architecture works for text, images, audio, proteins, code, and even robot trajectories.
- Transfer learning โ a single pretrained model can be quickly adapted to dozens of specialized tasks.
While new architectures (Mamba, RWKV, Hyena) are attempting to replace the quadratic complexity of self-attention, the Transformer remains the reference point and the foundation for experiments in scaling, multimodality, and reasoning.
Sources
- Vaswani et al., Attention Is All You Need (2017), arXiv:1706.03762
- Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), arXiv:1810.04805
- Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020), arXiv:2005.14165
- Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv:2010.11929
- IBM โ What is a Transformer Model? (ibm.com/topics/transformer-model)
- Wikipedia โ Transformer (deep learning architecture)
