Attention Is All You Need — re-read, 2026 edition

It’s 11:47 PM on a Wednesday. A production model is returning different answers to the same prompt — not randomly, not always, but reliably past a certain context length. The support ticket has been open for six hours. Two of us are on a call, screen-shared into a terminal, watching logits behave like they’ve forgotten the beginning of their own input.

My colleague types: “Is it the positional encoding?”

I pause. I think I used to know the answer to that. I open a browser tab. I reach for the paper.

That’s why I re-read Attention Is All You Need this week — not nostalgia, not discipline. It’s on my list of things worth returning to every year because nine years of hindsight changes what you notice, and a live bug changes what you need. Here’s what I took away this round.

What it replaced

Before June 2017, the state of the art in sequence modeling was recurrence. LSTMs and GRUs processed tokens one at a time, each step conditioned on the hidden state from the previous step. That sequential dependency was the bottleneck: you couldn’t parallelize training across a sequence, you struggled to propagate information across long distances, and the architectures — however much we loved them — had hit a plateau.

The paper proposed something almost irresponsibly simple: drop recurrence entirely. Replace it with a mechanism that computes, for every pair of positions in a sequence, how much one should “attend to” the other. No hidden state. No step-by-step recursion. Just matrix multiplications, done in parallel.

That simplicity is the reason this paper mattered. Attention itself wasn’t new — it had existed as an auxiliary mechanism in seq2seq models since Bahdanau et al. (2014). What this paper showed was that attention was sufficient. Everything else could go.

The idea in one sentence

Attention is a weighted sum of values, where the weights come from comparing queries to keys.

Scaled dot-product attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, with labeled Q, K, V as query, key, value. — The scaled dot-product attention formula. Queries and keys produce the weights; values carry the content.

Three matrices are produced by multiplying the input by three learned projection matrices: one for queries (what each position is looking for), one for keys (what each position offers), one for values (what each position carries). The query-key dot product produces a similarity matrix; softmax normalizes it; that weight matrix is applied to the values.

Every architectural choice in the paper is downstream of this. Multi-head attention runs the operation in parallel with different projection matrices. Positional encodings exist because attention is otherwise permutation-invariant. The feed-forward layers between attention blocks exist because attention alone is not enough — you need some per-position nonlinearity to mix features within a token’s representation.

The architecture

The transformer as a stack: tokens become vectors, vectors get positional information, then N identical blocks refine them. The paper used N=6 for base, N=12 for big; today’s frontier models use N in the hundreds.

The paper presents an encoder-decoder architecture for translation: six encoder blocks, six decoder blocks. Each block is the same small set of operations — multi-head attention, residual connection, layer norm, feed-forward network, another residual, another layer norm. The decoder has an additional cross-attention step that looks at the encoder’s output.

What’s remarkable looking back is that almost every successful transformer since has been some subset of this diagram. BERT is the encoder stack alone. GPT-2 and its descendants are the decoder stack alone, with self-attention masked so no token sees the future. T5 kept the full encoder-decoder. The original paper’s architecture diagram turned out to be a kind of menu.

What held up

After nine years and several orders of magnitude in model scale, most of what the paper said is still true in practice:

Scaled dot-product attention. The formula softmax(QK^T / √d_k) V still works. Nobody has replaced it at the frontier with something qualitatively different, though plenty of people have tried.
Multi-head attention as a useful default. Running many attention ops in parallel with different projections turned out to be robust across every model scale we’ve seen. Nobody sets heads = 1.
Decomposition into attention + FFN. Alternating attention with a per-token feed-forward network is still the canonical block shape. You can make the FFN bigger, you can make it a mixture-of-experts, but the shape is the shape.
Residual connections + layer norm. Unsexy structural plumbing that makes deep stacks trainable. Some models moved to RMSNorm or pre-norm vs post-norm, but the principle is unchanged.
Adam + warmup. The optimizer and schedule they used is roughly what most teams still start with. The specific warmup curve is less common now, but the general “ramp up then decay” shape stuck.

What didn’t

A good paper doesn’t have to be right about everything. This one got specific things wrong:

Sinusoidal positional encodings. The paper proposed a clever closed-form positional encoding based on sines and cosines. It shipped, it worked, and then rotary positional embeddings (RoPE) came along in 2021 and ate its lunch. RoPE generalizes better to sequences longer than the model was trained on and has cleaner theoretical properties. If you’re implementing attention from scratch in 2026, reach for RoPE first. Sinusoidal is a historical curiosity.
The scaling story. The paper’s biggest model is the “Transformer (big)” at 213M parameters. That number would be a footnote two years later. Two years after publication, GPT-2 was 1.5B. Three years after, GPT-3 was 175B. The paper gave us the architecture; it didn’t predict what would happen when we fed it the public internet. In fairness, nothing in 2017 did.
Specific layer counts and dimensions. The numbers in the “base” and “big” tables (d_model = 512, h = 8, N = 6) got blown past immediately. Don’t anchor on them. Whatever your model size, the ratios between d_model, number of heads, and FFN width are worth tuning — but not by staring at the 2017 defaults.
WMT as the primary benchmark. The paper evaluated on English-German and English-French machine translation. Translation turned out to be a secondary application of transformers. Language modeling — the boring pretext task from the introduction — turned out to be the main event.

What’s actually worth remembering as a practitioner

If you’re shipping things with attention-based models — not someone training foundation models — the paper’s lasting lessons are simpler than you’d expect:

Attention is a function of context. The model’s answer depends on what’s in the context window. This obvious point governs 80% of prompt engineering: you can change the output substantially just by rearranging what the model sees.
Positional information is non-trivial. Attention is permutation-invariant by default. Something has to put order back in. If your model does something weird with ordering — ignores the start of a list, treats duplicates as identical — the positional encoding is usually the suspect.
Heads can learn specialized behaviors. Some heads attend to syntax, some to semantics, some to rare tokens. You don’t get to choose which head learns what, but knowing they specialize helps you reason about failure modes. Mechanistic interpretability work (A Mathematical Framework for Transformer Circuits) is what you read next if this interests you.
Attention is O(n²) in sequence length. Every “longer context” announcement since 2023 has been about sidestepping this structural cost. FlashAttention reorders memory access patterns without changing the math. Sparse attention throws out most of the matrix. State-space models like Mamba change the mechanism entirely. The cost didn’t disappear. Someone moved it.
Decoder-only is the default for generation. When the industry consolidated, it wasn’t around the encoder-decoder in the paper — it was around the decoder-only stack GPT popularized. The encoder got repurposed for retrieval and representation learning. If you’re building a chatbot, you want decoder-only. If you’re building a semantic search index, you want an encoder. Both descend directly from this paper.

How to actually read it

The paper is short — eight pages of core content plus a few pages of ablations. A working engineer can get what they need in under an hour. A few suggestions for a productive re-read:

Start with Figure 1 (the architecture diagram) and Section 3.2.1 (the attention formula). Those are the two mental models you’ll carry afterward.
Skip Section 3.1 unless you want the encoder-decoder derivation. If you already know what a transformer is, it’s review.
Read Section 4 (“Why self-attention?”) more carefully than you remember. The table comparing layer types by complexity and path length is the best justification in the paper — it answers “why this, why now” concretely.
The ablations in Section 6.2 age well. Notice that the “big” model’s only real superpower over the “base” model is size. The paper’s modesty about scale is, in retrospect, the most interesting thing about it.

The paper is also a very confident piece of writing. It doesn’t apologize, doesn’t over-qualify, doesn’t hedge. Read it once for the architecture, twice for how to write a paper that changes the field.

Why I re-read it

The bug that night turned out to be something more mundane — a context truncation in our preprocessing pipeline, not the positional encoding at all. But reading the paper helped me eliminate the suspect quickly and think clearly about where else to look. A clean mental model of how attention handles position is faster than any blog post.

It’s 11:47 PM, two colleagues are tired, and the logs still look wrong. What you want in that moment isn’t a tutorial. You want the source.

Papers don’t compound unless you come back to them. This one does.

References

Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473
Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
Raffel, C. et al. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). arXiv:1910.10683
Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165
Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752