Attention Is All You Need — Explained Simply
Attention Is
All You Need
01 · The problem
Why old AI read like a slow reader
Before this paper, the best language AIs read a sentence the way you might read in a noisy room — one word at a time, in strict order, holding everything in your head as you go.
These models were called RNNs (recurrent neural networks). To understand word number 50, the model first had to process words 1 through 49, in sequence. It could not look at word 50 until everything before it was done.
Relatable example Imagine reading a book where each page is glued shut until you finish the previous one. You can never skip ahead or glance at two pages side by side. That is fine for a short note, but painful for a long chapter — and you can never use more than one pair of hands.
This caused two big headaches: it was slow (no skipping ahead, no doing things at the same time), and it had a bad memory for things far apart. By the time the model reached the end of a long sentence, it had often half-forgotten the beginning.
Reading one word at a time vs. all at once
Watch how an old model (RNN) must crawl through a sentence in order, while a Transformer looks at every word in the same instant.
The Transformer’s trick was to throw out the “one word at a time” rule completely. But if you read every word at once, how does the model know which words are related to which? That is where attention comes in.
02 · The big idea
Attention: letting words look at each other
“Attention” is just a fancy word for a simple human habit: when you read a word, you instinctively glance at other words that help explain it.
Relatable example Read this sentence: “The animal didn’t cross the street because it was too tired.” What does “it” mean — the animal or the street? You instantly look back at “animal.” Your brain paid attention to the right earlier word. That glance is the entire idea.
The Transformer gives every word the ability to do this glancing, all at once. For each word, the model asks: “Which other words in this sentence should I pay attention to in order to understand myself?”
What is each word looking at?
Click a word below. The darker another word glows, the more this word "pays attention" to it. Notice that it strongly attends to animal — the model has learned what "it" refers to.
These attention strengths are illustrative, chosen to match the kind of patterns the real trained model learns — the paper shows real examples of exactly this “it → animal” behavior.
03 · Words into numbers
First, turn every word into a list of numbers
Computers can’t do math on the letters c-a-t. So the very first step is to turn each word into a list of numbers called a vector (the paper uses lists of 512 numbers each).
You can picture each word as a point in space. Words with similar meanings end up near each other. “King” sits near “queen”; “cat” sits near “dog”; “Monday” sits far from “banana.” The model learns these positions on its own by reading huge amounts of text.
Relatable example Think of a map of a country. Cities that are culturally similar tend to be close together. A “word map” works the same way: the direction and distance between word-points carries meaning. Famously, “king − man + woman” lands you right next to “queen.”
This list of numbers is called an embedding. Everything that follows — all the attention magic — is just arithmetic on these number-lists.
04 · Word order
Reminding the model where each word sits
There’s a catch. Because the Transformer reads all words at once instead of in order, it has no built-in sense of which word came first.
But order matters enormously:
Relatable example “The dog bit the man” and “The man bit the dog” use the exact same words. Only the order tells you who is in trouble. Without order, both sentences look identical to the model.
The fix is clever: before the words enter the model, each one gets a unique “position fingerprint” added to its number-list. Position 1 gets one fingerprint, position 2 a slightly different one, and so on.
The paper builds these fingerprints out of sine and cosine waves of different speeds — like a row of clocks all ticking at different rates. Each position lights up a unique combination of fast and slow waves, giving it a signature the model can recognize. A nice bonus: this method lets the model handle sentences longer than any it saw during training.
Each position has its own wave-fingerprint
Drag to move through positions. Each vertical slice is one word-position; the colored stripes are the sine/cosine values that make its unique signature. No two columns look the same.
05 · The mechanism
Query, Key, and Value: the heart of attention
Here is the one idea that the whole paper is built on. To decide what to pay attention to, every word produces three little number-lists, with three jobs.
Relatable example · a library search Imagine searching a library. Query is what you’re looking for (“books about space”). Key — each book has a label on its spine describing what it’s about. Value — the actual content inside. You compare your query against every key, find the best matches, and then collect mostly the values of the books that matched well.
Attention does exactly this. For one word (the Query), the model compares it against the Key of every word. Strong matches get a high score; weak matches get a low score. Those scores become percentages (they add up to 100%), and the word’s new representation is a blend of all the Values, weighted by those percentages.
The “comparison” is a dot product — a quick way to measure how aligned two number-lists are. Then a function called softmax squashes the scores into clean percentages.
Output = a weighted blend of Values, where the weights come from Query-meets-Key.
Pick a query word and watch attention compute
Choose which word is "asking." We compare its Query against every word's Key (a real dot product), scale and softmax the scores into percentages, then blend. The math below is actually being computed live.
| Word (Key) | Q·K score | ÷√d | Attention % |
|---|
06 · A small but vital tweak
Why divide by the square root of the size
The paper’s name for its method is “Scaled Dot-Product Attention.” The word that matters there is scaled.
When the number-lists are long, the dot-product scores can get huge. And when scores are huge, softmax becomes too confident — it slams almost all the attention onto a single word and ignores everything else. That makes the model hard to train.
The fix is tiny: divide every score by the square root of the list length (√d) before softmax. This keeps the scores in a sensible range so attention stays smooth and balanced.
Relatable example It’s like turning down an over-sensitive microphone. Too sensitive, and the loudest voice drowns out everyone in the room. Dial it back, and you can hear the whole conversation in proportion.
Watch softmax go from balanced to all-or-nothing
Same scores, different scaling. Slide right (less scaling / bigger numbers) and watch attention collapse onto one bar. Slide left and it spreads out fairly.
07 · Multi-head attention
Looking at the sentence from several angles at once
One round of attention captures one type of relationship. But language has many at the same time — grammar, meaning, who-did-what-to-whom.
So the Transformer runs attention several times in parallel (the paper uses 8), each with its own learned way of looking. These are called heads. One head might track which adjective describes which noun; another might track which pronoun points to which name. Their findings are then combined.
Relatable example It’s like having 8 expert proofreaders read the same sentence. One checks grammar, one checks tone, one checks facts, one checks pronouns. Each notices different things, and you merge all their notes into one richer understanding.
Eight named heads, each doing a different job
First click any word in the sentence to choose who is "looking." Then switch heads on and off below. Each head is one expert with one specialty; the green glow shows where the selected experts look. Turn several on to see their combined view, or hit "All experts" to see everything at once.
08 · The full machine
Putting it together: encoder and decoder
The Transformer has two halves. The encoder reads and understands the input. The decoder writes the output, one word at a time.
Think of translating English to French. The encoder fully digests the English sentence into rich number-lists. The decoder then produces French words, at each step glancing both at the English meaning (via attention) and at the French words it has already written.
The architecture, one piece at a time
Encoder (reads input) ×6
Decoder (writes output) ×6
Two quiet helpers wrap every block: a residual connection (a shortcut that keeps the original input around so nothing important gets lost) and layer normalization (keeps the numbers tidy and stable). The whole stack is repeated 6 times so understanding deepens layer by layer.
Why “no peeking” matters
When the decoder writes a sentence, it must predict the next word using only the words before it — exactly like you do when finishing someone’s sentence. The paper enforces this with a mask that hides all future words during training, so the model can’t cheat by looking at the answer.
09 · The results
Faster to train, and better than everything before it
The payoff was dramatic. On standard translation tests, the Transformer beat every previous model — while training in a fraction of the time, because it could finally do all that work in parallel.
Translation quality is scored with BLEU (higher is better). On English-to-German, the big Transformer set a new record of 28.4, beating even ensembles — combinations of many models — by more than 2 points.
The Transformer vs. the previous best
And the cost? The big model trained in about 3.5 days on 8 GPUs — a small fraction of what competing models needed. The base model reached state-of-the-art in roughly 12 hours. Crucially, it also generalized: applied to a totally different task (English grammar parsing), it again performed near the top without special tuning.
10 · Why it changed everything
The paper that quietly started a revolution
In 2017 this looked like a smart improvement to machine translation. Today we know it was the foundation of nearly all modern AI.
Because the Transformer dropped the “one word at a time” rule, it could be trained on enormous amounts of text using thousands of processors at once. That scalability is exactly what made today’s large language models possible.
The “GPT” in ChatGPT stands for Generative Pre-trained Transformer. BERT, the model behind years of Google Search improvements, is a Transformer. So are the models that generate images, write code, transcribe speech, and fold proteins. They are all descendants of the simple idea in this paper: let the pieces of your input pay attention to each other, and that is enough.
Attention, it turned out, really was all you needed.
11 · Build it yourself · zero to hero
From raw text to a question-answering model
This is the whole journey, running for real in your browser on text you provide. A genuine tiny Transformer: it tokenizes your text, learns embeddings, runs multi-head masked attention, pre-trains itself to predict the next word, and then gets fine-tuned into a model that answers questions. Nothing is pre-loaded — every weight is learned live, in front of you.
Go through the steps in order. Each one unlocks the next.
Give the model something to read
This is the raw material the model learns language from (its "pre-training data"). Keep sentences simple and repetitive so a tiny model can find the patterns. Edit it freely, then build the dataset.
Turn words into numbered tokens
The model can't read letters, so every unique word becomes a numbered token. The full list of tokens is the vocabulary. Three special tokens are added too: <q> and <a> (to mark questions and answers later) and <end>.
Give every token a vector of numbers
Each token id is mapped to a row of numbers (its embedding). Below is the entire embedding table: one row per word, one column per dimension, colour = value. Right now it is random noise — the model knows nothing yet. Watch this same picture after training to see meaning appear.
Stamp each slot with its position
Because the model reads all words at once, we add a unique wave-based "position fingerprint" to each word's vector so order is preserved. Type a sentence; each column below is the fingerprint added at that position.
Watch the words look at each other
Here is one real forward pass: the model computes Query, Key and Value for every word, scores them, masks the future, and softmaxes into the attention grid below (row looks at column). This model has 8 heads running in parallel. Pick any one to see it alone, or select several to see their combined view. Each head is named by what it actually does (the names become meaningful after you pre-train in Step 6).
Design your own head
Instead of letting the model learn where to look, you can force a focusing rule and test it. Pick a preset or build your own with the sliders, preview the pattern, then score it: the model runs with every head forced to use your rule, and we compare its prediction error to the model's own learned attention. Lower loss = better head.
Teach it to predict the next word
Now the model reads your text over and over, each time nudging its weights to better predict the next word. The loss (how wrong it is) should drop fast. On the right, watch it try to continue a sentence and get better.
It can autocomplete — but it can't answer
Your pre-trained "base model" is a talented autocomplete. Give it a few words and it continues them. But ask it a direct question and it just rambles, because it was only ever taught to continue text, not to answer. That is exactly why the next step exists.
Show it examples of good answers
To turn the autocomplete into an assistant, we give it example question | answer pairs. The model is wrapped in a format it can recognise: <q> question <a> answer <end>. One pair per line, with a vertical bar between question and answer.
Continue training — on the task this time
Fine-tuning does not start from scratch. It keeps all the language the base model already learned and gently nudges it (a smaller learning rate) to follow the question-answer format. The loss on the answers should plunge toward zero.
Ask it a question
The same model that could only autocomplete now answers in the format it was fine-tuned on. Try the questions you trained it on, and then try slight variations to see how much (or how little) a tiny model can generalise.
This is a real but deliberately small model (32-dim embeddings, 2 heads, one Transformer block) trained on a handful of sentences, so it memorises more than it generalises. The exact same machinery — tokenize, embed, attend, pre-train, fine-tune — scaled to billions of weights and much of the internet, is what produces ChatGPT and its peers. You just did the whole pipeline, end to end.
In one breath
The whole paper, summarized
- The problem: Old models read text one word at a time — slow, and forgetful over long distances.
- The idea: Let every word directly “attend” to every other word, all at once, instead of in sequence.
- The mechanism: Each word makes a Query, Key, and Value. Match queries to keys, turn matches into percentages, blend the values.
- The refinements: Scale the scores for stability; use several attention “heads” for different relationships; add position fingerprints so word order survives.
- The result: Better translations, far faster training — and the blueprint for ChatGPT, BERT, and modern AI.
A plain-language, interactive retelling of “Attention Is All You Need” (Vaswani et al., 2017). For the precise math, figures, and experiments, read the original at arxiv.org/abs/1706.03762.