What is self-attention?
Self-attention (sometimes called intra-attention) is a mathematical operation that, for each element of a sequence — usually a token, a fragment of text — computes how strongly it should "attend to" the other elements of the same sequence. The result is a new, contextual representation of every token, enriched with information from the entire sentence.
It is worth stating right away what self-attention is not. It is not a model or a neural network in itself — it is a single layer, a computational block that repeats dozens of times inside the Transformer architecture. Nor is it identical to "attention" in general: an attention mechanism existed earlier (Bahdanau, 2014) as an add-on to recurrent networks. Self-attention is a special variant in which the queries, keys and values all come from the same sequence — hence the prefix "self".
The intuition that "the model reads itself" refers to exactly this: instead of processing a sentence word by word, the model ingests the whole sequence at once and lets every token "look around" at the others to better understand its own meaning in context.
Who is behind it?
Self-attention in its current form appeared in 2017 in the paper “Attention Is All You Need”, published by a team from Google Brain and Google Research. Its eight authors — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin — proposed discarding recurrent and convolutional layers and basing the entire architecture solely on the attention mechanism. They named the architecture the Transformer; the name was reportedly suggested by Jakob Uszkoreit because of how it sounded.
The paper’s influence is hard to overstate — it has been cited more than 170,000 times (as of 2025) and became the foundation for BERT, the GPT family, T5 and Llama. The attention mechanism itself had been developed earlier, notably by Dzmitry Bahdanau in 2014, but it was the Google team that showed attention could be not an add-on but the whole thing.
How does it work?
Three vectors: Query, Key and Value
At the heart of self-attention lies a variant called scaled dot-product attention. Each input token is projected — using learned weight matrices — into three separate vectors: Query (Q), Key (K) and Value (V). The metaphor is borrowed from database lookup: the Query is what the token is "searching for", the Key is a label describing what a given token offers, and the Value is the actual content.
The computation runs in four steps, elegantly captured by a single formula:
Symbol meaning
- …
- query matrix — what the token is looking for in other positions
- …
- key matrix — what the token advertises to others
- …
- value matrix — content passed on when there is a strong match
- …
- key-vector dimension; dividing by \sqrt{d_k} keeps the softmax numerically stable
- First, the model multiplies the query matrix by the transposed key matrix (…), producing a raw similarity score for every pair of tokens.
- Then it divides those scores by the square root of the key dimension (…, often …) — without this scaling, large dimensions push the values so high that the softmax?softmax: A function that maps a vector of real numbers to a probability distribution — each value lies in (0, 1) and they sum to one. falls into a region of near-zero gradient?gradient: A value measuring how much the loss changes with each parameter; a near-zero gradient prevents the network from learning effectively in deeper layers., which hampers training.
- In the third step, softmax turns the scores into attention weights — numbers between 0 and 1 that sum to one.
- Finally those weights multiply the Value vectors: relevant tokens contribute a large share of their content to the updated vector, while irrelevant ones contribute almost nothing.
A worked example
Take the sentence "the dog fetches the stick". When the model updates the vector for "fetches", that word sends its Query to all tokens. A high match between the Query "fetches" and the Key "dog" routes part of the Value vector of "dog" into the new representation of "fetches" — the model learns that it is the dog performing the action.
What are its key components?
Multi-head attention — multiple heads, multiple perspectives
Imagine reading the sentence “The banker the lawyer advised won the case.” One attention head might track the subject–verb relation, while another tracks that the lawyer appears in an embedded clause and is not the main subject. This is exactly why the Transformer’s authors proposed multi-head attention: instead of a single attention mechanism the model runs … independent “heads”, each learning to detect a different type of linguistic dependency with its own weight matrices for Q, K and V. For a model with … and 8 heads, each head operates on vectors of dimension 64.
After the independent computations, the outputs of all heads are concatenated?concatenated: Joined end-to-end into a single longer vector — e.g. 8 vectors of dimension 64 yield a single vector of dimension 512. and passed through a final projection matrix?projection matrix: A learned linear layer — multiplication by a weight matrix that transforms a vector from one dimension to another. Here it merges the outputs of all attention heads back into the model’s original dimension. …, which restores the original dimension. Because the heads are smaller, the total compute cost stays close to that of a single attention head, while the model gains the ability to capture many linguistic nuances in parallel.
Three variants in the Transformer architecture
| Variant | Where | Mechanism |
|---|---|---|
| Self-attention | Encoder | Bidirectional — every token sees all the others in the sequence |
| Masked self-attention | Decoder | Causal mask — each position sees only earlier tokens |
| Cross-attention | Encoder–decoder bridge | Query from the decoder, Key and Value from the encoder |
What can it be used for?
Self-attention powers two main families of language models. Encoder-only architectures, such as BERT from Google, serve deep text understanding — search, classification, masked-word prediction. Decoder-only architectures, with the GPT family from OpenAI and Llama from Meta, specialise in generating fluent text and underpin today’s chatbots.
The mechanism quickly moved beyond text. The Vision Transformer (ViT) showed that if you cut an image into square "patches" and treat them as sequence tokens, self-attention handles image recognition very well. Today the same mechanism underlies multimodal models, image generators and audio-processing systems.
How does it differ from other approaches?
Before 2017, recurrent networks (RNN, LSTM, GRU) dominated; they processed tokens sequentially, one after another. This had two serious drawbacks: training was hard to parallelise on GPUs, and learning dependencies between distant words was hampered by the vanishing-gradient problem?vanishing-gradient problem: A phenomenon in deep neural networks where error gradients shrink exponentially as they are backpropagated through layers — early-layer parameters barely update, making it hard to learn dependencies between distant words..
Self-attention solves both at once. First, all tokens are processed in parallel as matrix operations — a perfect fit for GPU architecture, which enabled unprecedented model scaling. Second, the information path between any two positions in the sequence has a constant length (on the order of a single operation), instead of growing with distance as in an RNN. It is precisely this combination — parallelism plus short paths — that made the Transformer the dominant architecture.
Key limitations and challenges
Self-attention has one serious flaw: quadratic complexity with respect to sequence length. Because the mechanism computes relations between every pair of tokens, the compute and memory cost grows as …. For 2,000 tokens the attention matrix?attention matrix: An n×n matrix storing the attention weight between every pair of tokens in the sequence — its size grows quadratically with context length. weighs about 8 MB per head — harmless. But for a 128,000-token window, now standard, the same matrix swells to over 32 GB per head, and there are dozens of heads and layers. Engineers call this the "memory wall".
The answer is a series of optimisations. Flash Attention, developed by Tri Dao's team at Stanford, does not change the maths but splits the computation into small tiles that fit in the very fast SRAM cache, instead of materialising the whole matrix in slower HBM memory — yielding a 2–4× speed-up with no loss of precision. Sparse attention goes further and skips computing irrelevant relations, accepting a small drop in quality. Meanwhile Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) shrink the KV cache during text generation by sharing keys and values across heads. GQA — the compromise variant — is now standard in Llama 3, Mistral and Gemma.
Why does it matter?
Self-attention is not just another technical curiosity but a single idea that redefined the trajectory of the entire field of machine learning. It is hard to point to another mechanism that became, in such a short time, the common denominator of nearly every breakthrough model of the past decade — from language understanding, through image generation, to multimodal and robotic systems.
Its significance rests on three things. First, simplicity: the whole mechanism fits in a single formula, yet it is general enough to work for text, images and audio. Second, scalability: because self-attention is essentially matrix multiplication, it exploits GPU parallelism perfectly, which opened the door to models with hundreds of billions of parameters. Third, the fact that its main limitation — quadratic complexity — turned out to be an engineering problem rather than a conceptual barrier: successive variants (Flash Attention, GQA) keep pushing the context-length frontier without changing the foundation. Understanding self-attention is today a prerequisite for understanding how modern AI systems actually work.
Self-attention shows that the biggest progress sometimes lies not in adding more components but in finding one operation general enough to replace many earlier ones. "Attention is all you need" was a provocative claim — eight years on, it still proves largely true.
Sources
- Bahdanau et al. (2014) — Neural Machine Translation by Jointly Learning to Align and Translate — arXiv:1409.0473
- Vaswani et al. (2017) — Attention Is All You Need — arXiv:1706.03762
- Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers — arXiv:1810.04805
- Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention — arXiv:2205.14135
- Ainslie et al. (2023) — GQA: Training Generalized Multi-Query Transformer Models — arXiv:2305.13245
