Inside the Transformer
Demystify the Transformer architecture — the engine behind every modern LLM. Understand the forward pass, attention mechanisms, generation strategies, and the architectural innovations that enable models like Llama 3.
Why take this course?
Every LLM is a Transformer. This course opens the black box: autoregressive generation, self-attention, multi-head attention, decoding strategies, KV caching, positional encodings (RoPE), Flash Attention, grouped-query attention, and the modern architectural improvements that make state-of-the-art models possible.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
Demystify how Transformers work: from autoregressive token-by-token generation, through the three-component architecture (tokenizer, stack, LM head), to the core mechanism of self-attention and the KV cache optimization.
Learning Goals
- Explain autoregressive generation and why LLMs produce one token at a time.
- Describe the three components of a Transformer LLM: tokenizer, stack, and LM head.
- Compare greedy decoding versus sampling strategies and the role of temperature.
- Understand how the KV cache speeds up generation and why it limits context length.
- Explain self-attention: Query-Key scoring and Value combination.
- Describe multi-head attention and how heads specialize in different patterns.
- Understand positional encodings (RoPE) and why they are necessary.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The One-Token-at-a-Time Surprise
In the previous course, we explored how LLMs work at a high level — tokenization, embeddings, and the basics of how mode…
The Three-Component Machine
A Transformer LLM isn't a single thing — it's three separate components working in sequence.
1. Tokenizer — We cove…
Choosing the Next Token: Greedy vs. Sampling
The LM head outputs probabilities for every token. 'Paris' might have 40% probability, 'the' might have 15%, 'France' mi…
Explore the efficiency innovations that power modern LLMs: sparse attention patterns, grouped-query attention (GQA), Flash Attention memory optimization, and architectural improvements like pre-normalization, RMSNorm, and SwiGLU activations.
Learning Goals
- Explain the O(N²) complexity problem of full attention.
- Describe sparse attention patterns and their tradeoffs.
- Compare multi-head, multi-query, and grouped-query attention.
- Explain how Flash Attention optimizes memory access for 2-4× speedups.
- Describe modern architectural improvements: pre-normalization, RMSNorm, SwiGLU.
- Understand residual connections and why they enable deep networks.
- Explain document packing and how RoPE enables efficient training.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Hidden Cost of Full Attention
Nina wanted to process a 100,000-token document. Her LLM API charges skyrocketed. She dug into the pricing and found som…
Sparse Attention: Looking at Less
If every token attending to every other token is too expensive, what's the solution? **Don't let every token see everyth…
Grouped-Query Attention: The Sweet Spot
Multi-query attention is fast but hurts quality. Multi-head is high-quality but slow. Is there a middle ground?
**Group…
The attention mechanism is the transformer's core innovation. Walk through Q/K/V projections step-by-step, understand how tokens find and combine information from each other, why causal masking separates encoders from decoders, how residual connections enable deep networks, and what different attention heads learn to specialize in.
Learning Goals
- Explain how Q/K/V projection matrices transform tokens into three different representations.
- Walk through the attention calculation: Q*K^T scoring, softmax, weighted sum of V.
- Understand causal masking and why it distinguishes decoders from encoders.
- Describe residual connections and why they are essential for deep transformer networks.
- Recognize that attention heads specialize in different patterns without explicit programming.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Three Lenses on Every Token
Module 1 introduced attention as 'tokens finding relevant context.' Now let's open the hood and see exactly how.
Before…

How Tokens Find Each Other: Q*K Scoring
Now that each token has a Query and every token has a Key, how does the model determine relevance?
Dot product. Mul…
From Scores to Weights: Softmax
Raw dot-product scores can be any number — positive, negative, huge, tiny. We need them to be **probabilities that sum t…
The feedforward network stores the transformer's learned knowledge while the attention mechanism retrieves it. Understand the FFN's dual role as memory bank and pattern interpolator, how RoPE gives transformers a sense of position, and practical generation controls (top-k, top-p) that let you tune model output for your application.
Learning Goals
- Explain the feedforward network's role as the transformer's knowledge store (memorization + interpolation).
- Describe how attention and FFN work together in each transformer block.
- Understand why positional encodings (RoPE) are necessary and how they work.
- Configure top-k and top-p sampling for different use cases.
- Trace the complete forward pass from input text to generated output token.
- Read a model card and understand what architectural parameters mean in practice.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Feedforward Network: The Transformer's Memory Bank
Each transformer block has two components: attention and the feedforward network (FFN). We've covered attention. Now…
Memorization vs. Interpolation: The FFN's Dual Role
The FFN isn't just a lookup table of memorized facts. It plays two distinct roles:
Memorization: Direct factual rec…
Attention + FFN: The Two-Step Dance
Every transformer block performs the same two-step sequence:
Step 1 — Attention: Each token looks at all previous t…