The Transformer is a sequence-to-sequence model — it takes a sequence in one language and produces one in another. It has two halves: an Encoder that reads and understands the input, and a Decoder that writes the output word by word.
Encoder — runs once
Input embeddingwords → vectors
+ Pos encodinginject order
Self-attention×6 layers
Add & Normstabilise
Feed-Forward512→2048→512
Add & Normstabilise
Output: H (context vectors)
Decoder — runs per token
Output embeddingprev tokens
+ Pos encodinginject order
Masked self-attnno peeking
Add & Normstabilise
Cross-attentionQ←dec, KV←enc
Add & Normstabilise
Feed-Forward×6 layers
Linear + Softmax→ word
Mental model: Encoder = a reader deeply absorbing a French sentence. Decoder = a writer producing English words one at a time, consulting what the reader understood.
Step 1 — Embeddings
Words become numbers
Computers cannot understand words. We must convert each word into a list of 512 numbers — a vector. We do this using a learned lookup table called the Embedding Matrix W.
Matrix W shape
50K × 512
50,000 rows (vocab) × 512 columns (dims)
Each word gets
1 row
Its unique 512-number identity vector
Values after training
semantic
Similar words end up with similar rows
1
One-hot encoding
Every word gets a unique position in a dictionary of 50,000 words. "cat" = position 1. Its one-hot vector is 49,999 zeros and one 1 at position 1.
2
Matrix lookup
Multiply the one-hot vector by W. This just picks out the corresponding row — like reading row 1 from a giant spreadsheet. Result: a 512-number vector.
3
Semantic geometry emerges from training
After training on billions of words, similar-meaning words cluster together in the 512-dimensional space. "king" – "man" + "woman" ≈ "queen" — purely from the numbers.
W shape: [50,000 × 512] ← starts random, learned during training
To embed "cat" (position 1):
one_hot = [0, 1, 0, 0, … 0] ← 50,000 numbers
embedding = one_hot × W = W[1]← just row 1
Result X: table of shape [num_words × 512]
Why 512 dimensions? Powers of 2 run fast on GPUs. 512 is rich enough to capture meaning but small enough to train on 2017 hardware. GPT-4 uses 12,288 dimensions.
Step 2 — Positional Encoding
Injecting word order
After embedding, the model sees all words simultaneously with no sense of order. "cat ate fish" and "fish ate cat" look identical. We fix this by adding a unique position fingerprint to every word vector using sine and cosine waves.
sin alone repeats — two positions look identical. (sin,cos) at same frequency = a unique point on the unit circle. sin²+cos²=1 guarantees uniqueness.
Why 256 pairs?
512 ÷ 2
512 dims / 2 = 256 clocks. Each pair has a different frequency. Fast pairs → nearby positions. Slow pairs → distant positions. All distances covered.
⊕ means element-wise addition:
word_embedding (512 nums)
+ pos_encoding (512 nums)
─────────────────────
final_vector (512 nums) = WHAT it means + WHERE it sits
The ⊕ symbol in the Transformer diagram just means addition — we literally add the positional encoding numbers to the embedding numbers, element by element.
Step 3 — Self-Attention
Words looking at each other
After steps 1–2, each word knows its own meaning and position — but nothing about the other words. Attention lets every word ask: which other words in this sentence are relevant to understanding me?
Q — Query
What am I?
"Which other words matter to me right now?"
K — Key
What do I offer?
Every word's "label" — how it might be relevant to others.
V — Value
What's inside me?
The actual content shared once a word is "selected."
The attention formula
Attention(Q,K,V) = softmax( Q·KT / √d_k ) · V
Q · K^T → score every query vs every key (how relevant?) / √d_k → scale down (prevents softmax from saturating) softmax() → convert scores to weights summing to 1.0 · V → blend values weighted by attention scores
Click a word to see its attention weights
je
suis
étudiant
"je" attends most to itself, some to "suis" (verb it's the subject of).
The 5 stages of one attention head:
1
Create Q, K, V
Multiply word vector x (512 dims) by three learned weight matrices W_Q, W_K, W_V — each [512×64] — producing 64-dim vectors Q, K, V for every word.
2
Compute raw scores
Score(i→j) = Q_i · K_j — the dot product measures how relevant word j is to word i. Higher = more relevant.
3
Scale by √64 = 8
Dividing by √d_k prevents the scores from growing too large, which would push softmax to extreme values and kill the gradients.
4
Softmax → attention weights
Converts scores to probabilities (all positive, each row sums to 1.0). These are the "percentages" of attention paid to each word.
5
Weighted blend of Values
Multiply each V by its weight and sum up. The result for word i now contains borrowed information from all words it attended to — weighted by relevance.
Multi-head attention (8 heads × 64 dims = 512): Run the entire Q/K/V process 8 times in parallel with different weight matrices. Each head learns different relationships — syntax, coreference, semantics. Concatenate and project back to 512.
Step 4 — Add & Norm
Stability in deep networks
Applied after every sub-layer (attention and FFN). Two operations working together to keep the network stable and trainable at 6 layers deep.
ADD — residual connection
x + f(x)
Adds the original input back to the sub-layer output. Each layer only learns the correction, not the whole representation. Prevents vanishing gradients.
NORM — layer normalisation
(x − μ)/σ
Rescales every word's 512 values to have mean=0 and std=1, then applies learned γ and β. Prevents exploding/collapsing activations.
The complete formula
output = LayerNorm( x + sublayer(x) )
sublayer can be attention OR feed-forward LN(x) = γ × (x − μ) / √(σ² + ε) + β μ = mean of the 512 values σ = standard deviation γ, β = learned scale and shift (both start at 1, 0)
Mental model for ADD: An editor marks corrections in red pen — the original essay stays intact, only changes are layered on top. The network learns adjustments, not transformations from scratch.
Step 5 — Feed-Forward Network
Individual deep thinking
After attention (words sharing information), each word processes what it learned — alone. No mixing between positions. Each word gets its own two-layer neural network applied independently.
The encoder ran once and produced H — a deep understanding of the French. Now the decoder uses H to generate English, one token per run, feeding each new word back as input for the next step. This is called autoregressive generation.
Autoregressive generation — "je suis étudiant"
Click to animate generation
Masked self-attention
No peeking
When predicting word N, future positions are set to −∞ before softmax → 0% attention. Matches how generation actually works: you never know the future.
Cross-attention
The bridge
Q comes from the decoder. K and V come from the encoder output H. The decoder asks: which French words matter for my current English word?
When generating "student":
decoder_query → cross-attention → encoder H
Score(query vs "je") = 0.1 → 10%
Score(query vs "suis") = 0.1 → 10%
Score(query vs "étudiant") = 0.8 → 80%
Nobody programmed this translation rule. The model learned to align "étudiant" with "student" purely by seeing millions of French-English sentence pairs during training.
Step 7 — Output Layer
Vector → Word
After 6 decoder layers, we have a 512-number vector. Two operations turn it into an actual word from the vocabulary.
1
Linear projection: 512 → 50,000
Multiply by W_out [512×50,000]. One raw score (logit) per word in vocabulary. The highest logit corresponds to the most likely next word.
2
Softmax: logits → probabilities
P(word_j) = exp(logit_j) / Σ exp(logit_k). All values become positive and sum to 1. Pick the word with highest probability.
Example output logits → probabilities:
logit("I") = 1.2 → P = 12%
logit("am") = 0.8 → P = 8%
logit("student") = 3.9 → P = 68%← pick this
logit("cat") =−1.2 → P = 2%
logit("pizza") =−2.4 → P = 1%
→ output: "student" ✓
Bonus — Training
How the model learns everything
Every weight matrix (W_e, W_Q, W_K, W_V, W1, W2, W_out…) starts as random numbers. Training adjusts all of them simultaneously using a single objective: predict the next word correctly.
Training loop
1
Forward pass
Feed a French-English sentence pair through the entire Transformer.
2
Compute loss
L = −log(P(correct_word)). If the model was 99% sure of the right word, loss ≈ 0.01. If 0.1% sure, loss = 6.9. Large loss = bad prediction.
3
Backpropagation
Compute ∂L/∂W for every weight in the model — how much did each weight contribute to the error?
4
Adam optimizer update
W = W − lr × (gradient / √velocity). Nudge every weight slightly in the direction that reduces loss. Repeat billions of times.
Training loss curve
● train loss● val loss
Loss function
Cross-entropy
L = −log(P_correct). Punishes wrong answers exponentially.
Optimizer
Adam
Adaptive per-parameter learning rates with momentum. State of the art for Transformers.
All knowledge from
one signal
Grammar, facts, reasoning — all emerge from just "predict next word correctly."
The magic of emergence: Nobody programs "étudiant means student." The model sees millions of French-English pairs and discovers the correspondence purely by minimising prediction error. All linguistic knowledge is implicit in the weights.
Meaning is context. Context is statistics. Statistics become geometry. Geometry enables intelligence.
Reference — Cheatsheet
All steps at a glance
Step
Component
What it does
Key formula
1
Embeddings
Word → 512-dim vector via lookup in W[50K×512]
E = one_hot × W
2
Pos Encoding
Add sin/cos position fingerprint to every vector
sin(pos/10000^(2i/512))
3
Self-Attention
Each word gathers info from all others, weighted by relevance
softmax(QK^T/√dk)·V
4
Add & Norm
Preserve signal + stabilise scale
LN(x + sublayer(x))
5
Feed-Forward
Per-word deep processing: 512→2048→512
ReLU(xW1+b1)W2+b2
6
Decoder
Masked self-attn + cross-attn; generates one token per run
Q←dec, KV←enc
7
Output
512 → 50K logits → softmax → pick word
P = softmax(x·W_out)
★
Training
Minimise cross-entropy loss via Adam over billions of examples