Attention Is All You Need — But It's Not All You Control

There’s a phrase that’s been floating around since 2017: “Attention is all you need.” It comes from the paper that started everything — the Google research team’s transformer architecture that replaced every previous approach to language AI and became the foundation of ChatGPT, Claude, Gemini, and every other model you’re using right now.

Then there’s a newer phrase gaining traction: “context engineering.” Andrej Karpathy called it “the delicate art and science of filling the context window with just the right information for the next step.” Shopify’s CEO Tobi Lutke called it “the art of providing all the context for the task to be plausibly solvable by the LLM.”

These sound like they’re about the same thing. They’re not. And if you’re building with AI agents — not just chatting with them, but building real systems — confusing them will cost you time, money, and accuracy.

Two layers, one stack

Here’s the simplest way I can put it:

Attention is how the model thinks. It’s the internal mechanism — the engine under the hood — that determines how the model processes whatever information it receives. You don’t control it. You didn’t build it. It was designed by researchers at Google, Anthropic, OpenAI, and DeepMind and baked into the model weights during training.

Context engineering is what the model sees. It’s the external discipline of designing, selecting, and arranging the information that gets fed into that engine. You control all of it. Every system prompt, every retrieved document, every tool description, every piece of conversation history — that’s your context pipeline, and how well you design it determines whether the model’s attention has anything useful to work with.

One is architecture. The other is information design. They’re two different layers of the same stack.

What “attention” actually does

Before 2017, language models processed words one at a time, in sequence. If you wanted the model to connect the word “bank” in sentence 1 to “deposit” in sentence 47, it had to carry that signal through every intervening step — and the signal degraded along the way. Long-range connections were fragile.

The transformer changed this with self-attention: every token can directly “look at” every other token and decide how much to care about it. The word “bank” doesn’t need to pass a signal through 46 sentences. It directly attends to “deposit” — and to every other token in the input — in parallel.

That’s what “attention is all you need” means. The researchers showed you could throw away the sequential processing entirely and replace it with this single mechanism: tokens attending to tokens. Multi-head attention lets the model learn different types of relationships simultaneously — one head might track syntax, another might track meaning, another might track co-reference across paragraphs.

The result is the architecture behind every frontier model. GPT-4, Claude, Gemini, Llama — they’re all transformers. They all use attention. The differences between them are in scale, training data, and post-training alignment — not in the fundamental mechanism.

The key point: you don’t touch this. When you use Claude or GPT-4, the attention mechanism is fixed. The weights are set. The architecture is done. You can’t make the model “pay more attention” to something by wanting it to. You can only change what’s in front of it — which is where context engineering begins.

What “context engineering” actually does

Context engineering is the discipline of designing what the model sees before it generates a response. If attention is the engine, context engineering is choosing what fuel to put in and how to arrange it.

A model’s context window has at least eight components:

System prompt — identity, role, constraints, reasoning behavior
Conversation history — prior turns in the current session
Retrieved documents — chunks pulled from external knowledge (RAG)
Tool descriptions — schemas for available tools and APIs
Few-shot examples — input-output pairs that demonstrate desired behavior
Memory — facts persisted across sessions
Metadata — timestamps, user info, session state
User message — the actual prompt

Most people optimize only component 8 — the prompt. That’s prompt engineering. Context engineering is optimizing all eight, because the model’s attention doesn’t distinguish between them. It processes the entire window as one sequence of tokens. A poorly written system prompt eats the same attention budget as a carefully tuned one. A wall of irrelevant retrieved documents dilutes the signal that matters.

A February 2026 peer-reviewed study across 9,649 experiments confirmed what practitioners already suspected: context quality matters more than prompt phrasing for frontier models. You can perfect your prompt all day. If the surrounding context is noise, the model’s attention is spent processing noise.

Where the two layers meet

There’s a specific place where attention and context engineering collide, and it’s the most practical insight in all of this: the U-shaped attention curve.

Researchers at Stanford (Liu et al., 2023) documented that LLMs pay disproportionate attention to information at the beginning and end of the context window. Information buried in the middle gets significantly less weight. They called the paper “Lost in the Middle.”

This is an attention behavior — it’s how the model’s internal mechanism distributes its processing across the token sequence. But the practical response is a context engineering decision: put your most critical instructions at the start. Put your most critical user context at the end. Don’t bury key facts in the middle of a long document dump.

That’s the handshake between the two layers. You can’t change the attention curve. You can design your context to work with it instead of against it.

Another collision point: effective vs. advertised context length. Models advertise windows of 128K, 200K, even 2 million tokens. But empirical research consistently shows degradation well before the published limit. The attention mechanism doesn’t fall off a cliff — it degrades gradually, losing retrieval accuracy as the window fills. Context engineering responds to this by keeping the active context lean: load what’s needed, summarize what’s old, compress what’s verbose, and don’t treat the advertised limit as an operating target.

Why this distinction matters in practice

If you’re building with AI agents, confusing these two layers leads to specific, expensive mistakes:

Mistake 1: “The model should just pay attention to what matters.” It can’t choose. Attention distributes across whatever you put in the window. If you dump 50 loosely relevant documents into the context and hope the model figures out which 3 matter, you’re asking the attention mechanism to do context engineering’s job. It won’t. A well-ranked top-5 chunk set outperforms a dump of 50 every time.

Mistake 2: “My prompt is good enough — the model is just bad.” Maybe. But more likely, the 7 other context components are working against you. Your system prompt contradicts your few-shot examples. Your retrieved documents are stale. Your conversation history is 30 turns of irrelevant back-and-forth eating tokens. The model’s attention is fine — it’s processing exactly what you gave it.

Mistake 3: “I need a bigger context window.” Sometimes. But a bigger window with the same noisy context just gives the model more noise to attend to. Anthropic’s research shows that structured context reduces hallucination rates by over 40%. The fix isn’t more space — it’s better curation of what goes in.

Mistake 4: “I’ll just use RAG — retrieval solves everything.” RAG is one of eight context components. If your retrieval is returning 20 chunks when 3 would do, or if the chunks aren’t reranked for relevance, you’re creating a context engineering problem that no amount of attention can fix. Retrieval quality beats retrieval volume.

What you control and what you don’t

	Attention	Context engineering
What it is	Internal mechanism — how the model weighs relationships between tokens	External discipline — what information the model sees at inference time
Who controls it	Model architects at research labs	You
When it’s set	Training time	Every single request
Your lever	None (it’s baked into the weights)	System prompt, RAG pipeline, memory, tools, history, metadata
Failure mode	Can’t fix — it’s the model’s architecture	Can fix — redesign your context pipeline
Analogy	How a brain processes what it sees	What briefing packet you put on someone’s desk

The liberating thing about this distinction: almost everything that goes wrong in AI-assisted work is a context engineering problem, not an attention problem. And context engineering is entirely in your hands.

What this looks like in my workspace

I work in Cursor with an AI agent every day, building real software. My workspace has 17 always-on rules, 11 skills, 7 slash commands, daily journal entries, session handoff files, and an agent accomplishment log. Every single one of these is a context engineering artifact.

My activeContext.md tells the agent where we left off — that’s short-term memory injected into the context window. My .cursor/rules/ folder contains persistent instructions — that’s the system prompt layer. My cursor-knowledge/ directory is long-term memory available for retrieval. My skills and commands are tool descriptions that load on demand.

None of this changes how the model’s attention works. All of it changes what the attention has to work with. And the difference between a first-turn response that nails it and one that asks me to re-explain everything is almost always a context engineering difference, not a model capability difference.

I’ve been doing context engineering for weeks without calling it that. The discipline just gave it a name and a science.

The real insight

“Attention is all you need” was a statement about model architecture. It was true in 2017 and it’s true now — the attention mechanism is the foundation of everything.

But for practitioners — people building with these models, not building the models themselves — the operational truth is different: context is all you control.

You can’t retrain the model. You can’t rewire the attention heads. You can’t change how it distributes processing across the token sequence. What you can do is design what it sees. And that design — the curation, compression, sequencing, and timing of information — is what determines whether the model’s extraordinary attention mechanism works for you or wastes itself on noise.

The transformer paper told us how AI thinks. Context engineering tells us what to put in front of it. If you’re only thinking about one of these layers, you’re leaving performance on the table.

Context engineering is a thread running through most of what I write about. For how I set up my workspace to manage context: Building With AI Agents. Questions? Get in touch.