The Transformer Architecture

00 — Overview

The Big Picture

The Transformer is a sequence-to-sequence model — it takes a sequence in one language and produces one in another. It has two halves: an Encoder that reads and understands the input, and a Decoder that writes the output word by word.

Encoder — runs once

Input embeddingwords → vectors

+ Pos encodinginject order

Self-attention×6 layers

Add & Normstabilise

Feed-Forward512→2048→512

Add & Normstabilise

Output: H (context vectors)

Decoder — runs per token

Output embeddingprev tokens

+ Pos encodinginject order

Masked self-attnno peeking

Add & Normstabilise

Cross-attentionQ←dec, KV←enc

Add & Normstabilise

Feed-Forward×6 layers

Linear + Softmax→ word

Mental model: Encoder = a reader deeply absorbing a French sentence. Decoder = a writer producing English words one at a time, consulting what the reader understood.

Step 1 — Embeddings

Words become numbers

Computers cannot understand words. We must convert each word into a list of 512 numbers — a vector. We do this using a learned lookup table called the Embedding Matrix W.

Matrix W shape

50K × 512

50,000 rows (vocab) × 512 columns (dims)

Each word gets

1 row

Its unique 512-number identity vector

Values after training

semantic

Similar words end up with similar rows

1

One-hot encoding

Every word gets a unique position in a dictionary of 50,000 words. "cat" = position 1. Its one-hot vector is 49,999 zeros and one 1 at position 1.

2

Matrix lookup

Multiply the one-hot vector by W. This just picks out the corresponding row — like reading row 1 from a giant spreadsheet. Result: a 512-number vector.

3

Semantic geometry emerges from training

After training on billions of words, similar-meaning words cluster together in the 512-dimensional space. "king" – "man" + "woman" ≈ "queen" — purely from the numbers.

W shape: [50,000 × 512] ← starts random, learned during training

To embed "cat" (position 1):
one_hot = [0, 1, 0, 0, … 0] ← 50,000 numbers
embedding = one_hot × W = W[1] ← just row 1

Result X: table of shape [num_words × 512]

Why 512 dimensions? Powers of 2 run fast on GPUs. 512 is rich enough to capture meaning but small enough to train on 2017 hardware. GPT-4 uses 12,288 dimensions.

Step 2 — Positional Encoding

Injecting word order

After embedding, the model sees all words simultaneously with no sense of order. "cat ate fish" and "fish ate cat" look identical. We fix this by adding a unique position fingerprint to every word vector using sine and cosine waves.

Positional encoding formula

PE(pos, 2i) = sin( pos / 10000^2i/512 )
PE(pos, 2i+1) = cos( pos / 10000^2i/512 )

Why sin+cos pairs?

Unit circle

sin alone repeats — two positions look identical. (sin,cos) at same frequency = a unique point on the unit circle. sin²+cos²=1 guarantees uniqueness.

Why 256 pairs?

512 ÷ 2

512 dims / 2 = 256 clocks. Each pair has a different frequency. Fast pairs → nearby positions. Slow pairs → distant positions. All distances covered.

⊕ means element-wise addition:

word_embedding (512 nums)
+ pos_encoding (512 nums)
─────────────────────
final_vector (512 nums) = WHAT it means + WHERE it sits

The ⊕ symbol in the Transformer diagram just means addition — we literally add the positional encoding numbers to the embedding numbers, element by element.

Step 3 — Self-Attention

Words looking at each other

After steps 1–2, each word knows its own meaning and position — but nothing about the other words. Attention lets every word ask: which other words in this sentence are relevant to understanding me?

Q — Query

What am I?

"Which other words matter to me right now?"

K — Key

What do I offer?

Every word's "label" — how it might be relevant to others.

V — Value

What's inside me?

The actual content shared once a word is "selected."

The attention formula

Attention(Q,K,V) = softmax( Q·K^T / √d_k ) · V

Q · K^T   → score every query vs every key (how relevant?)
/ √d_k    → scale down (prevents softmax from saturating)
softmax() → convert scores to weights summing to 1.0
· V       → blend values weighted by attention scores

      Click a word to see its attention weights
    

je

suis

étudiant

      "je" attends most to itself, some to "suis" (verb it's the subject of).
    

The 5 stages of one attention head:

1

Create Q, K, V

Multiply word vector x (512 dims) by three learned weight matrices W_Q, W_K, W_V — each [512×64] — producing 64-dim vectors Q, K, V for every word.

2

Compute raw scores

Score(i→j) = Q_i · K_j — the dot product measures how relevant word j is to word i. Higher = more relevant.

3

Scale by √64 = 8

Dividing by √d_k prevents the scores from growing too large, which would push softmax to extreme values and kill the gradients.

4

Softmax → attention weights

Converts scores to probabilities (all positive, each row sums to 1.0). These are the "percentages" of attention paid to each word.

5

Weighted blend of Values

Multiply each V by its weight and sum up. The result for word i now contains borrowed information from all words it attended to — weighted by relevance.

Multi-head attention (8 heads × 64 dims = 512): Run the entire Q/K/V process 8 times in parallel with different weight matrices. Each head learns different relationships — syntax, coreference, semantics. Concatenate and project back to 512.

Step 4 — Add & Norm

Stability in deep networks

Applied after every sub-layer (attention and FFN). Two operations working together to keep the network stable and trainable at 6 layers deep.

ADD — residual connection

x + f(x)

Adds the original input back to the sub-layer output. Each layer only learns the correction, not the whole representation. Prevents vanishing gradients.

NORM — layer normalisation

(x − μ)/σ

Rescales every word's 512 values to have mean=0 and std=1, then applies learned γ and β. Prevents exploding/collapsing activations.

The complete formula

output = LayerNorm( x + sublayer(x) )

sublayer can be attention OR feed-forward
LN(x) = γ × (x − μ) / √(σ² + ε) + β
μ = mean of the 512 values
σ = standard deviation
γ, β = learned scale and shift (both start at 1, 0)

Numerical example of Layer Norm:

x = [2.0, 4.0, 1.0, 3.0]
μ = (2+4+1+3)/4 = 2.5
σ = √mean[(x−μ)²] = √1.25 = 1.118
x_norm = (x − 2.5) / 1.118
= [−0.45, +1.34, −1.34, +0.45]
→ mean=0, std=1 ✓

Mental model for ADD: An editor marks corrections in red pen — the original essay stays intact, only changes are layered on top. The network learns adjustments, not transformations from scratch.

Step 5 — Feed-Forward Network

Individual deep thinking

After attention (words sharing information), each word processes what it learned — alone. No mixing between positions. Each word gets its own two-layer neural network applied independently.

FFN formula

FFN(x) = ReLU( x·W1 + b1 ) · W2 + b2

Shapes:
W1: [512 × 2048] b1: [2048] ← EXPAND (4×)
W2: [2048 × 512] b2: [512] ← CONTRACT back

Input
512

→

Hidden
2048

→

Output
512

ReLU

max(0, x)

Kills negative values. Breaks linearity so stacked layers genuinely learn different things. Without it, 6 layers collapse into 1.

Why expand to 2048?

4× working space

Like spreading books on a large table before organising — more space means richer, more nuanced processing before packing back to 512.

Attention = group discussion (words talk to each other)
FFN = individual private study (each word alone)

ReLU examples:
ReLU( 2.5) = 2.5 ← positive → keep
ReLU(−0.3) = 0.0 ← negative → kill

Step 6 — The Decoder

Generating words
one at a time

The encoder ran once and produced H — a deep understanding of the French. Now the decoder uses H to generate English, one token per run, feeding each new word back as input for the next step. This is called autoregressive generation.

      Autoregressive generation — "je suis étudiant"
    

Click to animate generation

Masked self-attention

No peeking

When predicting word N, future positions are set to −∞ before softmax → 0% attention. Matches how generation actually works: you never know the future.

Cross-attention

The bridge

Q comes from the decoder. K and V come from the encoder output H. The decoder asks: which French words matter for my current English word?

When generating "student":

decoder_query → cross-attention → encoder H

Score(query vs "je") = 0.1 → 10%
Score(query vs "suis") = 0.1 → 10%
Score(query vs "étudiant") = 0.8 → 80%

output = 0.80×V_étudiant + 0.10×V_suis + 0.10×V_je
→ generates "student" ✓

Nobody programmed this translation rule. The model learned to align "étudiant" with "student" purely by seeing millions of French-English sentence pairs during training.

Step 7 — Output Layer

Vector → Word

After 6 decoder layers, we have a 512-number vector. Two operations turn it into an actual word from the vocabulary.

1

Linear projection: 512 → 50,000

Multiply by W_out [512×50,000]. One raw score (logit) per word in vocabulary. The highest logit corresponds to the most likely next word.

2

Softmax: logits → probabilities

P(word_j) = exp(logit_j) / Σ exp(logit_k). All values become positive and sum to 1. Pick the word with highest probability.

Example output logits → probabilities:

logit("I")       = 1.2 → P = 12%
logit("am")      = 0.8 → P = 8%
logit("student") = 3.9 → P = 68% ← pick this
logit("cat")     =−1.2 → P = 2%
logit("pizza")   =−2.4 → P = 1%

→ output: "student" ✓

Bonus — Training

How the model
learns everything

Every weight matrix (W_e, W_Q, W_K, W_V, W1, W2, W_out…) starts as random numbers. Training adjusts all of them simultaneously using a single objective: predict the next word correctly.

Training loop

1

Forward pass

Feed a French-English sentence pair through the entire Transformer.

2

Compute loss

L = −log(P(correct_word)). If the model was 99% sure of the right word, loss ≈ 0.01. If 0.1% sure, loss = 6.9. Large loss = bad prediction.

3

Backpropagation

Compute ∂L/∂W for every weight in the model — how much did each weight contribute to the error?

4

Adam optimizer update

W = W − lr × (gradient / √velocity). Nudge every weight slightly in the direction that reduces loss. Repeat billions of times.

Training loss curve

● train loss ● val loss

Loss function

Cross-entropy

L = −log(P_correct). Punishes wrong answers exponentially.

Optimizer

Adam

Adaptive per-parameter learning rates with momentum. State of the art for Transformers.

All knowledge from

one signal

Grammar, facts, reasoning — all emerge from just "predict next word correctly."

The magic of emergence: Nobody programs "étudiant means student." The model sees millions of French-English pairs and discovers the correspondence purely by minimising prediction error. All linguistic knowledge is implicit in the weights.

Reference — Cheatsheet

All steps at a glance

Step	Component	What it does	Key formula
1	Embeddings	Word → 512-dim vector via lookup in W[50K×512]	E = one_hot × W
2	Pos Encoding	Add sin/cos position fingerprint to every vector	sin(pos/10000^(2i/512))
3	Self-Attention	Each word gathers info from all others, weighted by relevance	softmax(QK^T/√dk)·V
4	Add & Norm	Preserve signal + stabilise scale	LN(x + sublayer(x))
5	Feed-Forward	Per-word deep processing: 512→2048→512	ReLU(xW1+b1)W2+b2
6	Decoder	Masked self-attn + cross-attn; generates one token per run	Q←dec, KV←enc
7	Output	512 → 50K logits → softmax → pick word	P = softmax(x·W_out)
★	Training	Minimise cross-entropy loss via Adam over billions of examples	L = −log(P_correct)

The Transformer Architecture

The Big Picture

Words become numbers

Injecting word order

Words looking at each other

Stability in deep networks

Individual deep thinking

Generating wordsone at a time

Vector → Word

How the modellearns everything

All steps at a glance

The Transformer
Architecture

Generating words
one at a time

How the model
learns everything