No Image

Inside the Transformer

Demystify the Transformer architecture — the engine behind every modern LLM. Understand the forward pass, attention mechanisms, generation strategies, and the architectural innovations that enable models like Llama 3.

Why take this course?

Every LLM is a Transformer. This course opens the black box: autoregressive generation, self-attention, multi-head attention, decoding strategies, KV caching, positional encodings (RoPE), Flash Attention, grouped-query attention, and the modern architectural improvements that make state-of-the-art models possible.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

How LLMs Work

Course Modules

1Module 1: Transformer Foundations

Demystify how Transformers work: from autoregressive token-by-token generation, through the three-component architecture (tokenizer, stack, LM head), to the core mechanism of self-attention and the KV cache optimization.

Learning Goals

Explain autoregressive generation and why LLMs produce one token at a time.
Describe the three components of a Transformer LLM: tokenizer, stack, and LM head.
Compare greedy decoding versus sampling strategies and the role of temperature.
Understand how the KV cache speeds up generation and why it limits context length.
Explain self-attention: Query-Key scoring and Value combination.
Describe multi-head attention and how heads specialize in different patterns.
Understand positional encodings (RoPE) and why they are necessary.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

One Token at a Time

A chat model may look like it writes a full answer and streams it to the screen. Internally, it does something stricter:…

From Token IDs to the Transformer Stack

The model does not read text directly. The tokenizer first converts text into token IDs. Those IDs index into the embedd…

Parallel Context, One Output Token

During one forward pass, the model processes all prompt positions in parallel. If the prompt has 100 tokens, the model c…

2Module 2: Architecture & Decoding

Explore the efficiency innovations that power modern LLMs: sparse attention patterns, grouped-query attention (GQA), Flash Attention memory optimization, and architectural improvements like pre-normalization, RMSNorm, and SwiGLU activations.

Learning Goals

Explain the O(N²) complexity problem of full attention.
Describe sparse attention patterns and their tradeoffs.
Compare multi-head, multi-query, and grouped-query attention.
Explain how Flash Attention optimizes memory access for 2-4× speedups.
Describe modern architectural improvements: pre-normalization, RMSNorm, SwiGLU.
Understand residual connections and why they enable deep networks.
Explain document packing and how RoPE enables efficient training.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Why Long Context Gets Expensive

Self-attention is powerful because each token can compare itself with other tokens in the context. The cost is that full…

Sparse Attention: Looking at Less

One way to reduce attention cost is simple: do not let every token attend to every other token.

Sliding-window attentio…

Grouped-Query Attention

Attention heads need queries, keys, and values. In classic multi-head attention, every head has its own key/value projec…

3Module 3: Inside the Attention Head

The attention mechanism is the transformer's core innovation. Walk through Q/K/V projections step-by-step, understand how tokens find and combine information from each other, why causal masking separates encoders from decoders, how residual connections enable deep networks, and what different attention heads learn to specialize in.

Learning Goals

Explain how Q/K/V projection matrices transform tokens into three different representations.
Walk through the attention calculation: Q*K^T scoring, softmax, weighted sum of V.
Understand causal masking and why it distinguishes decoders from encoders.
Describe residual connections and why they are essential for deep transformer networks.
Recognize that attention heads specialize in different patterns without explicit programming.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Q, K, and V: Three Views of One Token

Attention starts with the current residual-stream vectors. For each token, the model creates three new vectors by multip…

Relevance Scoring with QK^T

Once every token has a query and a key, attention scores relevance with dot products.

For one token, the model compares…

Softmax Turns Scores into a Blend

Raw attention scores are not yet useful. They can be negative, large, or uneven. Softmax converts each row of scores int…

4Module 4: The Feedforward Network & Generation Control

The feedforward network stores the transformer's learned knowledge while the attention mechanism retrieves it. Understand the FFN's dual role as memory bank and pattern interpolator, how RoPE gives transformers a sense of position, and practical generation controls (top-k, top-p) that let you tune model output for your application.

Learning Goals

Explain the feedforward network's role as the transformer's knowledge store (memorization + interpolation).
Describe how attention and FFN work together in each transformer block.
Understand why positional encodings (RoPE) are necessary and how they work.
Configure top-k and top-p sampling for different use cases.
Trace the complete forward pass from input text to generated output token.
Read a model card and understand what architectural parameters mean in practice.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

The Feedforward Network

Every Transformer block has two major parts. Attention lets tokens exchange information. The feedforward network, or FFN…

Attention Routes, FFN Computes

A Transformer block repeats a compact pattern.

First, attention lets each token pull information from other visible tok…

RoPE: Giving Attention a Sense of Position

Attention compares token vectors. By itself, that comparison does not automatically know word order. The model needs pos…