Transformer Architecture · Complete Guide

The Transformer
Architecture

From raw words to output tokens — every component explained from first principles. No prior ML knowledge assumed.

Step 1 · Embeddings Step 2 · Pos Encoding Step 3 · Attention Step 5 · FFN
scroll

00 — Overview

The Big Picture

The Transformer is a sequence-to-sequence model — it takes a sequence in one language and produces one in another. It has two halves: an Encoder that reads and understands the input, and a Decoder that writes the output word by word.

Encoder — runs once
Input embeddingwords → vectors
+ Pos encodinginject order
Self-attention×6 layers
Add & Normstabilise
Feed-Forward512→2048→512
Add & Normstabilise
Output: H (context vectors)
Decoder — runs per token
Output embeddingprev tokens
+ Pos encodinginject order
Masked self-attnno peeking
Add & Normstabilise
Cross-attentionQ←dec, KV←enc
Add & Normstabilise
Feed-Forward×6 layers
Linear + Softmax→ word
Mental model: Encoder = a reader deeply absorbing a French sentence. Decoder = a writer producing English words one at a time, consulting what the reader understood.

Step 1 — Embeddings

Words become numbers

Computers cannot understand words. We must convert each word into a list of 512 numbers — a vector. We do this using a learned lookup table called the Embedding Matrix W.

Matrix W shape
50K × 512
50,000 rows (vocab) × 512 columns (dims)
Each word gets
1 row
Its unique 512-number identity vector
Values after training
semantic
Similar words end up with similar rows
1
One-hot encoding
Every word gets a unique position in a dictionary of 50,000 words. "cat" = position 1. Its one-hot vector is 49,999 zeros and one 1 at position 1.
2
Matrix lookup
Multiply the one-hot vector by W. This just picks out the corresponding row — like reading row 1 from a giant spreadsheet. Result: a 512-number vector.
3
Semantic geometry emerges from training
After training on billions of words, similar-meaning words cluster together in the 512-dimensional space. "king" – "man" + "woman" ≈ "queen" — purely from the numbers.
W shape: [50,000 × 512] ← starts random, learned during training

To embed "cat" (position 1):
one_hot = [0, 1, 0, 0, … 0] ← 50,000 numbers
embedding = one_hot × W = W[1] ← just row 1

Result X: table of shape [num_words × 512]
Why 512 dimensions? Powers of 2 run fast on GPUs. 512 is rich enough to capture meaning but small enough to train on 2017 hardware. GPT-4 uses 12,288 dimensions.

Step 2 — Positional Encoding

Injecting word order

After embedding, the model sees all words simultaneously with no sense of order. "cat ate fish" and "fish ate cat" look identical. We fix this by adding a unique position fingerprint to every word vector using sine and cosine waves.

Positional encoding formula
PE(pos, 2i)   = sin( pos / 100002i/512 )
PE(pos, 2i+1) = cos( pos / 100002i/512 )
Why sin+cos pairs?
Unit circle
sin alone repeats — two positions look identical. (sin,cos) at same frequency = a unique point on the unit circle. sin²+cos²=1 guarantees uniqueness.
Why 256 pairs?
512 ÷ 2
512 dims / 2 = 256 clocks. Each pair has a different frequency. Fast pairs → nearby positions. Slow pairs → distant positions. All distances covered.
⊕ means element-wise addition:

word_embedding (512 nums)
+ pos_encoding (512 nums)
─────────────────────
final_vector    (512 nums) = WHAT it means + WHERE it sits
The ⊕ symbol in the Transformer diagram just means addition — we literally add the positional encoding numbers to the embedding numbers, element by element.

Step 3 — Self-Attention

Words looking at each other

After steps 1–2, each word knows its own meaning and position — but nothing about the other words. Attention lets every word ask: which other words in this sentence are relevant to understanding me?

Q — Query
What am I?
"Which other words matter to me right now?"
K — Key
What do I offer?
Every word's "label" — how it might be relevant to others.
V — Value
What's inside me?
The actual content shared once a word is "selected."
The attention formula
Attention(Q,K,V) = softmax( Q·KT / √d_k ) · V

Q · K^T   → score every query vs every key (how relevant?)
/ √d_k    → scale down (prevents softmax from saturating)
softmax() → convert scores to weights summing to 1.0
· V       → blend values weighted by attention scores
Click a word to see its attention weights
je
suis
étudiant
"je" attends most to itself, some to "suis" (verb it's the subject of).
The 5 stages of one attention head:
1
Create Q, K, V
Multiply word vector x (512 dims) by three learned weight matrices W_Q, W_K, W_V — each [512×64] — producing 64-dim vectors Q, K, V for every word.
2
Compute raw scores
Score(i→j) = Q_i · K_j — the dot product measures how relevant word j is to word i. Higher = more relevant.
3
Scale by √64 = 8
Dividing by √d_k prevents the scores from growing too large, which would push softmax to extreme values and kill the gradients.
4
Softmax → attention weights
Converts scores to probabilities (all positive, each row sums to 1.0). These are the "percentages" of attention paid to each word.
5
Weighted blend of Values
Multiply each V by its weight and sum up. The result for word i now contains borrowed information from all words it attended to — weighted by relevance.
Multi-head attention (8 heads × 64 dims = 512): Run the entire Q/K/V process 8 times in parallel with different weight matrices. Each head learns different relationships — syntax, coreference, semantics. Concatenate and project back to 512.

Step 4 — Add & Norm

Stability in deep networks

Applied after every sub-layer (attention and FFN). Two operations working together to keep the network stable and trainable at 6 layers deep.

ADD — residual connection
x + f(x)
Adds the original input back to the sub-layer output. Each layer only learns the correction, not the whole representation. Prevents vanishing gradients.
NORM — layer normalisation
(x − μ)/σ
Rescales every word's 512 values to have mean=0 and std=1, then applies learned γ and β. Prevents exploding/collapsing activations.
The complete formula
output = LayerNorm( x + sublayer(x) )

sublayer can be attention OR feed-forward
LN(x) = γ × (x − μ) / √(σ² + ε) + β
μ = mean of the 512 values
σ = standard deviation
γ, β = learned scale and shift (both start at 1, 0)
Numerical example of Layer Norm:

x = [2.0, 4.0, 1.0, 3.0]
μ = (2+4+1+3)/4 = 2.5
σ = √mean[(x−μ)²] = √1.25 = 1.118
x_norm = (x2.5) / 1.118
       = [−0.45, +1.34, −1.34, +0.45]
→ mean=0, std=1 ✓
Mental model for ADD: An editor marks corrections in red pen — the original essay stays intact, only changes are layered on top. The network learns adjustments, not transformations from scratch.

Step 5 — Feed-Forward Network

Individual deep thinking

After attention (words sharing information), each word processes what it learned — alone. No mixing between positions. Each word gets its own two-layer neural network applied independently.

FFN formula
FFN(x) = ReLU( x·W1 + b1 ) · W2 + b2

Shapes:
W1: [512 × 2048] b1: [2048] ← EXPAND (4×)
W2: [2048 × 512] b2: [512] ← CONTRACT back
Input
512
Hidden
2048
Output
512
ReLU
max(0, x)
Kills negative values. Breaks linearity so stacked layers genuinely learn different things. Without it, 6 layers collapse into 1.
Why expand to 2048?
4× working space
Like spreading books on a large table before organising — more space means richer, more nuanced processing before packing back to 512.
Attention = group discussion (words talk to each other)
FFN      = individual private study (each word alone)

ReLU examples:
ReLU( 2.5) = 2.5 ← positive → keep
ReLU(−0.3) = 0.0 ← negative → kill

Step 6 — The Decoder

Generating words
one at a time

The encoder ran once and produced H — a deep understanding of the French. Now the decoder uses H to generate English, one token per run, feeding each new word back as input for the next step. This is called autoregressive generation.

Autoregressive generation — "je suis étudiant"
Click to animate generation
Masked self-attention
No peeking
When predicting word N, future positions are set to −∞ before softmax → 0% attention. Matches how generation actually works: you never know the future.
Cross-attention
The bridge
Q comes from the decoder. K and V come from the encoder output H. The decoder asks: which French words matter for my current English word?
When generating "student":

decoder_query → cross-attention → encoder H

Score(query vs "je") = 0.1 → 10%
Score(query vs "suis") = 0.1 → 10%
Score(query vs "étudiant") = 0.8 → 80%

output = 0.80×V_étudiant + 0.10×V_suis + 0.10×V_je
→ generates "student" ✓
Nobody programmed this translation rule. The model learned to align "étudiant" with "student" purely by seeing millions of French-English sentence pairs during training.

Step 7 — Output Layer

Vector → Word

After 6 decoder layers, we have a 512-number vector. Two operations turn it into an actual word from the vocabulary.

1
Linear projection: 512 → 50,000
Multiply by W_out [512×50,000]. One raw score (logit) per word in vocabulary. The highest logit corresponds to the most likely next word.
2
Softmax: logits → probabilities
P(word_j) = exp(logit_j) / Σ exp(logit_k). All values become positive and sum to 1. Pick the word with highest probability.
Example output logits → probabilities:

logit("I")       = 1.2 → P = 12%
logit("am")      = 0.8 → P = 8%
logit("student") = 3.9 → P = 68% ← pick this
logit("cat")     =−1.2 → P = 2%
logit("pizza")   =−2.4 → P = 1%

→ output: "student" ✓

Bonus — Training

How the model
learns everything

Every weight matrix (W_e, W_Q, W_K, W_V, W1, W2, W_out…) starts as random numbers. Training adjusts all of them simultaneously using a single objective: predict the next word correctly.

Training loop
1
Forward pass
Feed a French-English sentence pair through the entire Transformer.
2
Compute loss
L = −log(P(correct_word)). If the model was 99% sure of the right word, loss ≈ 0.01. If 0.1% sure, loss = 6.9. Large loss = bad prediction.
3
Backpropagation
Compute ∂L/∂W for every weight in the model — how much did each weight contribute to the error?
4
Adam optimizer update
W = W − lr × (gradient / √velocity). Nudge every weight slightly in the direction that reduces loss. Repeat billions of times.
Training loss curve
● train loss ● val loss
Loss function
Cross-entropy
L = −log(P_correct). Punishes wrong answers exponentially.
Optimizer
Adam
Adaptive per-parameter learning rates with momentum. State of the art for Transformers.
All knowledge from
one signal
Grammar, facts, reasoning — all emerge from just "predict next word correctly."
The magic of emergence: Nobody programs "étudiant means student." The model sees millions of French-English pairs and discovers the correspondence purely by minimising prediction error. All linguistic knowledge is implicit in the weights.

Meaning is context. Context is statistics. Statistics become geometry. Geometry enables intelligence.
Reference — Cheatsheet

All steps at a glance

StepComponentWhat it doesKey formula
1EmbeddingsWord → 512-dim vector via lookup in W[50K×512]E = one_hot × W
2Pos EncodingAdd sin/cos position fingerprint to every vectorsin(pos/10000^(2i/512))
3Self-AttentionEach word gathers info from all others, weighted by relevancesoftmax(QK^T/√dk)·V
4Add & NormPreserve signal + stabilise scaleLN(x + sublayer(x))
5Feed-ForwardPer-word deep processing: 512→2048→512ReLU(xW1+b1)W2+b2
6DecoderMasked self-attn + cross-attn; generates one token per runQ←dec, KV←enc
7Output512 → 50K logits → softmax → pick wordP = softmax(x·W_out)
TrainingMinimise cross-entropy loss via Adam over billions of examplesL = −log(P_correct)