AI Architecture

The context window — what it is and why it defines what an LLM can do

Sir Robot9 June 2026 · 9 min read

Sir Robot

9 June 2026 · 9 min readAI-assisted · editorial review

The context window — what it is and why it defines what an LLM can do

The context window is a language model's working memory — it sets how much text the AI can take into account at once. Understanding how it works and where its limits lie explains both the cost and the surprising mistakes of modern models.

What is a context window?

A context window is the maximum amount of information a large language model (LLM) can process within a single request. It holds everything the model "sees" at a given moment: the system instruction, the user's prompt, any attached documents, the conversation history, and the response the model is still generating. It is neither a knowledge base nor permanent memory — it is more like a workbench on which the model lays out the material it needs for one specific answer.

The key consequence: once content exceeds the window's limit, the model irreversibly "forgets" whatever falls outside it — usually the oldest parts of the conversation. There is no going back to earlier messages the way a person mentally returns to the start of a chat. Only what currently fits inside the window counts.

A window's capacity is measured not in words or characters but in tokens. A token is the smallest unit of text the model splits the input into during tokenization. In English, one token is on average about 0.75 of a word (roughly four characters), because tokenizers are trained largely on English, Latin-script text. The same meaning costs far more tokens in languages the tokenizer saw less of — non-Latin scripts such as Chinese, Japanese, Arabic or Hindi, and morphologically rich languages like Polish or Turkish — and in source code, which makes every such request longer and more expensive. Because model providers bill per token, the window's size and how efficiently it is used translate directly into the cost of every request. We break down exactly how models split and count text in our dedicated explainer, AI Tokens — How Language Models Break Down and Count Text.

Who is behind it?

The notion of a context window follows directly from the Transformer architecture, described in the 2017 paper "Attention Is All You Need" by a team of Google researchers. It was the attention mechanism that defined how a model looks at all tokens at once — and at the same time imposed practical limits on window size.

Today the largest labs compete to enlarge and streamline the window. OpenAI expanded the window of its GPT-4 line from an initial 8,000 to 128,000 tokens. Anthropic's Claude models offer 200,000 tokens as standard, and even more in enterprise tiers. Google's Gemini models pushed the boundary to 1, and then 2 million tokens. In parallel, the research and open-source community (including the authors of FlashAttention and the YaRN method) supplies the techniques that make long windows computable on available hardware at all.

How does it work?

At the heart of the Transformer is the self-attention mechanism. To understand a sentence, the model computes, for each token, how strongly it relates to every other token in the sequence. This gives the model a picture of the dependencies in the text, but it comes at a price: the number of these comparisons grows quadratically with input length. The computational and memory complexity of full attention is O(n²).

In practice, a tenfold increase in text does not raise the workload tenfold but a hundredfold. Doubling the context quadruples the number of operations. For hundreds of thousands of tokens, computing full attention becomes extremely expensive and hungry for GPU memory — for years this was the "iron barrier" of scaling.

The second pillar is the KV cache (key–value cache). Autoregressive models, like the GPT family, generate text token by token. To avoid recomputing the same historical values over and over, they store previously computed key and value vectors in GPU memory. As a result, the cost of adding each new token drops locally to linear. The catch is that the KV cache grows with sequence length, the number of network layers, and the number of parallel requests. At very long contexts it is the cache that can fill all of the GPU memory — the model becomes limited by memory bandwidth rather than compute.

What are its key components?

The contents of the window consist of several layers that together consume the available token budget:

System instruction — hidden directives defining the model's role and behaviour.
User prompt — the current question or task.
Attached material — documents, code fragments, data pasted into the request.
Conversation history — earlier exchanges within the same session.
Generated response — the text the model is producing also takes up window space.

Each of these layers competes for the same finite limit. The more space history and attachments take, the less remains for the answer — and the faster the oldest fragments fall out of the window.

Technically, the window also relies on how token positions are encoded. The RoPE technique (Rotary Positional Encoding) stores a word's position as a vector rotation, letting the model understand distances between tokens. Models trained on a limit of, say, 4,000 tokens cannot on their own handle longer text, because they meet positional values they never saw during training.

What can it be used for?

A large context window unlocks use cases that were unworkable with a short window or required artificially cutting up the data. A model with a window of 128,000–200,000 tokens can take in an entire lengthy contract, a several-hundred-page report, or a large codebase and answer questions in the context of the whole rather than a torn-out fragment.

In practice this means analysing legal and financial documents without manually splitting them, reviewing and refactoring code spanning many files at once, summarising long transcripts, or holding multi-hour conversations in which the model remembers earlier decisions. Million-token windows, as in Gemini, allow loading gigabytes of logs or many hours of material in one go. The key advantage of a long window over chunking is the ability to spot relationships scattered across the whole document — links that vanish when the material is cut into independent fragments.

How does it differ from other approaches?

When a model needs access to a large body of external knowledge, two approaches compete: RAG (Retrieval-Augmented Generation) and long context.

RAG does not push the whole collection into the window. Documents are split into chunks and indexed in a vector database, and only the chunks that semantically best match the question are sent to the model. This is fast and cheap — few tokens are sent — and excellent for dynamic, frequently updated data such as news or logs. RAG's weakness shows when the meaning of an answer is spread across fragments that the retrieval algorithm cuts and separates.

Long context works the other way: the entire content enters the window, so the model sees it all at once and captures non-linear dependencies between distant fragments. The price is cost and latency — with million-token inputs, time to first token is measured in seconds, and the bill grows with the quadratic complexity of attention.

In practice these approaches increasingly combine. A hybrid uses RAG to roughly narrow down a huge corpus, then passes the selected documents to a model with a moderately long window, where the actual in-depth analysis happens.

Key limitations and challenges

The biggest trap of long windows is the gap between their advertised size and their actual usefulness. The "Lost in the Middle" study (Liu et al., 2023) showed that models make best use of information placed at the beginning and end of the input, while facts buried in the middle of a long document tend to be missed. Performance follows a U-shaped curve: a strong recency and primacy effect, a weak middle. This means a model with a two-million-token window does not use those tokens evenly.

Related to this is the phenomenon known as "context rot" — a decline in quality as the window fills up. Engineers observe that although providers boast huge maximum window sizes, a model's competence can drop sharply at only a small fraction of capacity, especially on harder tasks. Hence the notion of a maximum effective context window (MECW), which is often far smaller than the advertised one. An excess of irrelevant text acts as noise and distracts the model.

On top of this come hard infrastructure limits: the quadratic cost of attention, pressure on GPU memory from the KV cache, and latency with long inputs. Techniques such as FlashAttention (optimising GPU memory management), sliding window and sparse attention (limiting the scope of attention), or YaRN (extending positional reach) ease these barriers but do not remove them.

Why does it matter?

Context window size has become one of the main figures labs use to market their models — which is exactly why it is so easy to misread. The number "two million tokens" sounds like a promise that the model will read and understand any amount of content at all. Research on "lost in the middle" and context rot shows this is an oversimplification: a window's capacity is not the same as the ability to use it effectively. For anyone building products on LLMs, this has direct consequences — dumping everything into the window "just in case" raises costs and can lower answer quality.

That is why the focus is shifting from raw window enlargement to context engineering: deliberately selecting, compressing, and ordering what reaches the model. Instead of treating the window as an unlimited data dump, mature systems filter the material, place the most important information at the edges of the window, and prune unnecessary history in long agentic scenarios. The context window will remain a measure of raw capacity, but a model's real intelligence increasingly depends on how well that capacity is managed — not simply on how large it is.

The context window is best understood not as a size of memory but as a field of attention with uneven sharpness. The better we grasp its mechanics — tokens, quadratic cost, the KV cache, and the middle effect — the more accurately we design the prompts and systems that rely on these models.

Sources

IBM — "What is a context window?" — link
arXiv — Vaswani et al., "Attention Is All You Need" (2017) — link
arXiv — Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023) — link
arXiv — Dao et al., "FlashAttention" (2022) — link
arXiv — Peng et al., "YaRN: Efficient Context Window Extension" (2023) — link
Google AI for Developers — "Long context" (Gemini API) — link
Anthropic — "Context windows" — link

Share this insight

01Course

The context window — what it is and why it defines what an LLM can do

What is a context window?

Who is behind it?

How does it work?

What are its key components?

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Transformer from Scratch

Vector Databases — How They Work and Why

Prompt Engineering in Practice

Context Window

LLM

Transformer

Self-Attention

Tokenization

RoPE

RAG

KV Cache

FlashAttention

SWA

MSA

YaRN

Attention Is All You Need

Lost in the Middle: How Language Models Use Long Contexts

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

YaRN: Efficient Context Window Extension of Large Language Models

The context window — what it is and why it defines what an LLM can do

What is a context window?

Who is behind it?

How does it work?

What are its key components?

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Go deeper

Transformer from Scratch

Vector Databases — How They Work and Why

Prompt Engineering in Practice

Context Window

LLM

Transformer

Self-Attention

Tokenization

RoPE

RAG

KV Cache

FlashAttention

SWA

MSA

YaRN

Attention Is All You Need

Lost in the Middle: How Language Models Use Long Contexts

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

YaRN: Efficient Context Window Extension of Large Language Models