Inside the Transformer
Demystify the Transformer architecture — the engine behind every modern LLM. Understand the forward pass, attention mechanisms, generation strategies, and the architectural innovations that enable models like Llama 3.
Why take this course?
Every LLM is a Transformer. This course opens the black box: autoregressive generation, self-attention, multi-head attention, decoding strategies, KV caching, positional encodings (RoPE), Flash Attention, grouped-query attention, and the modern architectural improvements that make state-of-the-art models possible.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
Demystify how Transformers work: from autoregressive token-by-token generation, through the three-component architecture (tokenizer, stack, LM head), to the core mechanism of self-attention and the KV cache optimization.
Learning Goals
- Explain autoregressive generation and why LLMs produce one token at a time.
- Describe the three components of a Transformer LLM: tokenizer, stack, and LM head.
- Compare greedy decoding versus sampling strategies and the role of temperature.
- Understand how the KV cache speeds up generation and why it limits context length.
- Explain self-attention: Query-Key scoring and Value combination.
- Describe multi-head attention and how heads specialize in different patterns.
- Understand positional encodings (RoPE) and why they are necessary.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
One Token at a Time
A chat model may look like it writes a full answer and streams it to the screen. Internally, it does something stricter:…
From Token IDs to the Transformer Stack
The model does not read text directly. The tokenizer first converts text into token IDs. Those IDs index into the embedd…
Parallel Context, One Output Token
During one forward pass, the model processes all prompt positions in parallel. If the prompt has 100 tokens, the model c…
Explore the efficiency innovations that power modern LLMs: sparse attention patterns, grouped-query attention (GQA), Flash Attention memory optimization, and architectural improvements like pre-normalization, RMSNorm, and SwiGLU activations.
Learning Goals
- Explain the O(N²) complexity problem of full attention.
- Describe sparse attention patterns and their tradeoffs.
- Compare multi-head, multi-query, and grouped-query attention.
- Explain how Flash Attention optimizes memory access for 2-4× speedups.
- Describe modern architectural improvements: pre-normalization, RMSNorm, SwiGLU.
- Understand residual connections and why they enable deep networks.
- Explain document packing and how RoPE enables efficient training.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Why Long Context Gets Expensive
Self-attention is powerful because each token can compare itself with other tokens in the context. The cost is that full…
Sparse Attention: Looking at Less
One way to reduce attention cost is simple: do not let every token attend to every other token.
Sliding-window attentio…
Grouped-Query Attention
Attention heads need queries, keys, and values. In classic multi-head attention, every head has its own key/value projec…
The attention mechanism is the transformer's core innovation. Walk through Q/K/V projections step-by-step, understand how tokens find and combine information from each other, why causal masking separates encoders from decoders, how residual connections enable deep networks, and what different attention heads learn to specialize in.
Learning Goals
- Explain how Q/K/V projection matrices transform tokens into three different representations.
- Walk through the attention calculation: Q*K^T scoring, softmax, weighted sum of V.
- Understand causal masking and why it distinguishes decoders from encoders.
- Describe residual connections and why they are essential for deep transformer networks.
- Recognize that attention heads specialize in different patterns without explicit programming.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Q, K, and V: Three Views of One Token
Attention starts with the current residual-stream vectors. For each token, the model creates three new vectors by multip…

Relevance Scoring with QK^T
Once every token has a query and a key, attention scores relevance with dot products.
For one token, the model compares…
Softmax Turns Scores into a Blend
Raw attention scores are not yet useful. They can be negative, large, or uneven. Softmax converts each row of scores int…
The feedforward network stores the transformer's learned knowledge while the attention mechanism retrieves it. Understand the FFN's dual role as memory bank and pattern interpolator, how RoPE gives transformers a sense of position, and practical generation controls (top-k, top-p) that let you tune model output for your application.
Learning Goals
- Explain the feedforward network's role as the transformer's knowledge store (memorization + interpolation).
- Describe how attention and FFN work together in each transformer block.
- Understand why positional encodings (RoPE) are necessary and how they work.
- Configure top-k and top-p sampling for different use cases.
- Trace the complete forward pass from input text to generated output token.
- Read a model card and understand what architectural parameters mean in practice.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Feedforward Network
Every Transformer block has two major parts. Attention lets tokens exchange information. The feedforward network, or FFN…
Attention Routes, FFN Computes
A Transformer block repeats a compact pattern.
First, attention lets each token pull information from other visible tok…
RoPE: Giving Attention a Sense of Position
Attention compares token vectors. By itself, that comparison does not automatically know word order. The model needs pos…