AI Architecture

Self-Attention — how a model "reads itself"?

Sir Robot25 June 2026 · 8 min read

Sir Robot

25 June 2026 · 8 min readAI-assisted · editorial review

self-attention-jak-model-czyta-samego-siebie-cover

Self-attention is the mechanism that lets a language model assess how every word in a sentence relates to all the other words at once. It is the foundation of the Transformer architecture and of every modern large language model — without it there would be no GPT, no Llama, no BERT.

What is self-attention?

Self-attention (sometimes called intra-attention) is a mathematical operation that, for each element of a sequence — usually a token, a fragment of text — computes how strongly it should "attend to" the other elements of the same sequence. The result is a new, contextual representation of every token, enriched with information from the entire sentence.

It is worth stating right away what self-attention is not. It is not a model or a neural network in itself — it is a single layer, a computational block that repeats dozens of times inside the Transformer architecture. Nor is it identical to "attention" in general: an attention mechanism existed earlier (Bahdanau, 2014) as an add-on to recurrent networks. Self-attention is a special variant in which the queries, keys and values all come from the same sequence — hence the prefix "self".

The intuition that "the model reads itself" refers to exactly this: instead of processing a sentence word by word, the model ingests the whole sequence at once and lets every token "look around" at the others to better understand its own meaning in context.

Who is behind it?

Self-attention in its current form appeared in 2017 in the paper “Attention Is All You Need”, published by a team from Google Brain and Google Research. Its eight authors — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin — proposed discarding recurrent and convolutional layers and basing the entire architecture solely on the attention mechanism. They named the architecture the Transformer; the name was reportedly suggested by Jakob Uszkoreit because of how it sounded.

The paper’s influence is hard to overstate — it has been cited more than 170,000 times (as of 2025) and became the foundation for BERT, the GPT family, T5 and Llama. The attention mechanism itself had been developed earlier, notably by Dzmitry Bahdanau in 2014, but it was the Google team that showed attention could be not an add-on but the whole thing.

How does it work?

Three vectors: Query, Key and Value

At the heart of self-attention lies a variant called scaled dot-product attention. Each input token is projected — using learned weight matrices — into three separate vectors: Query (Q), Key (K) and Value (V). The metaphor is borrowed from database lookup: the Query is what the token is "searching for", the Key is a label describing what a given token offers, and the Value is the actual content.

The computation runs in four steps, elegantly captured by a single formula:

…▶Symbol meaning…query matrix — what the token is looking for in other positions…key matrix — what the token advertises to others…value matrix — content passed on when there is a strong match…key-vector dimension; dividing by \sqrt{d_k} keeps the softmax numerically stable

First, the model multiplies the query matrix by the transposed key matrix (…), producing a raw similarity score for every pair of tokens.
Then it divides those scores by the square root of the key dimension (…, often …) — without this scaling, large dimensions push the values so high that the softmaxsoftmax: A function that maps a vector of real numbers to a probability distribution — each value lies in (0, 1) and they sum to one. falls into a region of near-zero gradientgradient: A value measuring how much the loss changes with each parameter; a near-zero gradient prevents the network from learning effectively in deeper layers., which hampers training.
In the third step, softmax turns the scores into attention weights — numbers between 0 and 1 that sum to one.
Finally those weights multiply the Value vectors: relevant tokens contribute a large share of their content to the updated vector, while irrelevant ones contribute almost nothing.

A worked example

Take the sentence "the dog fetches the stick". When the model updates the vector for "fetches", that word sends its Query to all tokens. A high match between the Query "fetches" and the Key "dog" routes part of the Value vector of "dog" into the new representation of "fetches" — the model learns that it is the dog performing the action.

What are its key components?

Multi-head attention — multiple heads, multiple perspectives

Imagine reading the sentence “The banker the lawyer advised won the case.” One attention head might track the subject–verb relation, while another tracks that the lawyer appears in an embedded clause and is not the main subject. This is exactly why the Transformer’s authors proposed multi-head attention: instead of a single attention mechanism the model runs … independent “heads”, each learning to detect a different type of linguistic dependency with its own weight matrices for Q, K and V. For a model with … and 8 heads, each head operates on vectors of dimension 64.

After the independent computations, the outputs of all heads are concatenatedconcatenated: Joined end-to-end into a single longer vector — e.g. 8 vectors of dimension 64 yield a single vector of dimension 512. and passed through a final projection matrixprojection matrix: A learned linear layer — multiplication by a weight matrix that transforms a vector from one dimension to another. Here it merges the outputs of all attention heads back into the model’s original dimension. …, which restores the original dimension. Because the heads are smaller, the total compute cost stays close to that of a single attention head, while the model gains the ability to capture many linguistic nuances in parallel.

Three variants in the Transformer architecture

Variant	Where	Mechanism
Self-attention	Encoder	Bidirectional — every token sees all the others in the sequence
Masked self-attention	Decoder	Causal mask — each position sees only earlier tokens
Cross-attention	Encoder–decoder bridge	Query from the decoder, Key and Value from the encoder

What can it be used for?

Self-attention powers two main families of language models. Encoder-only architectures, such as BERT from Google, serve deep text understanding — search, classification, masked-word prediction. Decoder-only architectures, with the GPT family from OpenAI and Llama from Meta, specialise in generating fluent text and underpin today’s chatbots.

The mechanism quickly moved beyond text. The Vision Transformer (ViT) showed that if you cut an image into square "patches" and treat them as sequence tokens, self-attention handles image recognition very well. Today the same mechanism underlies multimodal models, image generators and audio-processing systems.

How does it differ from other approaches?

Before 2017, recurrent networks (RNN, LSTM, GRU) dominated; they processed tokens sequentially, one after another. This had two serious drawbacks: training was hard to parallelise on GPUs, and learning dependencies between distant words was hampered by the vanishing-gradient problemvanishing-gradient problem: A phenomenon in deep neural networks where error gradients shrink exponentially as they are backpropagated through layers — early-layer parameters barely update, making it hard to learn dependencies between distant words..

Self-attention solves both at once. First, all tokens are processed in parallel as matrix operations — a perfect fit for GPU architecture, which enabled unprecedented model scaling. Second, the information path between any two positions in the sequence has a constant length (on the order of a single operation), instead of growing with distance as in an RNN. It is precisely this combination — parallelism plus short paths — that made the Transformer the dominant architecture.

Key limitations and challenges

Self-attention has one serious flaw: quadratic complexity with respect to sequence length. Because the mechanism computes relations between every pair of tokens, the compute and memory cost grows as …. For 2,000 tokens the attention matrixattention matrix: An n×n matrix storing the attention weight between every pair of tokens in the sequence — its size grows quadratically with context length. weighs about 8 MB per head — harmless. But for a 128,000-token window, now standard, the same matrix swells to over 32 GB per head, and there are dozens of heads and layers. Engineers call this the "memory wall".

The answer is a series of optimisations. Flash Attention, developed by Tri Dao's team at Stanford, does not change the maths but splits the computation into small tiles that fit in the very fast SRAM cache, instead of materialising the whole matrix in slower HBM memory — yielding a 2–4× speed-up with no loss of precision. Sparse attention goes further and skips computing irrelevant relations, accepting a small drop in quality. Meanwhile Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) shrink the KV cache during text generation by sharing keys and values across heads. GQA — the compromise variant — is now standard in Llama 3, Mistral and Gemma.

Why does it matter?

Self-attention is not just another technical curiosity but a single idea that redefined the trajectory of the entire field of machine learning. It is hard to point to another mechanism that became, in such a short time, the common denominator of nearly every breakthrough model of the past decade — from language understanding, through image generation, to multimodal and robotic systems.

Its significance rests on three things. First, simplicity: the whole mechanism fits in a single formula, yet it is general enough to work for text, images and audio. Second, scalability: because self-attention is essentially matrix multiplication, it exploits GPU parallelism perfectly, which opened the door to models with hundreds of billions of parameters. Third, the fact that its main limitation — quadratic complexity — turned out to be an engineering problem rather than a conceptual barrier: successive variants (Flash Attention, GQA) keep pushing the context-length frontier without changing the foundation. Understanding self-attention is today a prerequisite for understanding how modern AI systems actually work.

Self-attention shows that the biggest progress sometimes lies not in adding more components but in finding one operation general enough to replace many earlier ones. "Attention is all you need" was a provocative claim — eight years on, it still proves largely true.

Sources

Bahdanau et al. (2014) — Neural Machine Translation by Jointly Learning to Align and Translate — arXiv:1409.0473
Vaswani et al. (2017) — Attention Is All You Need — arXiv:1706.03762
Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers — arXiv:1810.04805
Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention — arXiv:2205.14135
Ainslie et al. (2023) — GQA: Training Generalized Multi-Query Transformer Models — arXiv:2305.13245

Share this insight

01Course

Self-Attention — how a model "reads itself"?

What is self-attention?

Who is behind it?

How does it work?

Three vectors: Query, Key and Value

A worked example

What are its key components?

Multi-head attention — multiple heads, multiple perspectives

Three variants in the Transformer architecture

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Transformer from Scratch

Deep Learning

Neural Networks: From Fundamentals to Modern AI

Self-Attention

Transformer

MHA

Scaled Dot-Product Attention

FlashAttention

GQA

MQA

Bahdanau Attention

BERT

ViT

Recurrent Neural Network

Long Short-Term Memory

GRU (Gated Recurrent Unit)

Softmax

Context Window

LLM

Embeddings (vector representations)

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Neural Machine Translation by Jointly Learning to Align and Translate

Self-Attention — how a model "reads itself"?

What is self-attention?

Who is behind it?

How does it work?

Three vectors: Query, Key and Value

A worked example

What are its key components?

Multi-head attention — multiple heads, multiple perspectives

Three variants in the Transformer architecture

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Go deeper

Transformer from Scratch

Deep Learning

Neural Networks: From Fundamentals to Modern AI

Self-Attention

Transformer

MHA

Scaled Dot-Product Attention

FlashAttention

GQA

MQA

Bahdanau Attention

BERT

ViT

Recurrent Neural Network

Long Short-Term Memory

GRU (Gated Recurrent Unit)

Softmax

Context Window

LLM

Embeddings (vector representations)

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Neural Machine Translation by Jointly Learning to Align and Translate