Menu
← Back to Courses
No Image

Inside the Transformer

Demystify the Transformer architecture — the engine behind every modern LLM. Understand the forward pass, attention mechanisms, generation strategies, and the architectural innovations that enable models like Llama 3.

Why take this course?

Every LLM is a Transformer. This course opens the black box: autoregressive generation, self-attention, multi-head attention, decoding strategies, KV caching, positional encodings (RoPE), Flash Attention, grouped-query attention, and the modern architectural improvements that make state-of-the-art models possible.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

Course Modules

1Module 1: Transformer Foundations

Demystify how Transformers work: from autoregressive token-by-token generation, through the three-component architecture (tokenizer, stack, LM head), to the core mechanism of self-attention and the KV cache optimization.

Learning Goals

  • Explain autoregressive generation and why LLMs produce one token at a time.
  • Describe the three components of a Transformer LLM: tokenizer, stack, and LM head.
  • Compare greedy decoding versus sampling strategies and the role of temperature.
  • Understand how the KV cache speeds up generation and why it limits context length.
  • Explain self-attention: Query-Key scoring and Value combination.
  • Describe multi-head attention and how heads specialize in different patterns.
  • Understand positional encodings (RoPE) and why they are necessary.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The One-Token-at-a-Time Surprise

In the previous course, we explored how LLMs work at a high level — tokenization, embeddings, and the basics of how mode…

Loading diagram...

The Three-Component Machine

A Transformer LLM isn't a single thing — it's three separate components working in sequence.

1. Tokenizer — We cove…

Loading diagram...

Choosing the Next Token: Greedy vs. Sampling

The LM head outputs probabilities for every token. 'Paris' might have 40% probability, 'the' might have 15%, 'France' mi…

2Module 2: Architecture & Decoding

Explore the efficiency innovations that power modern LLMs: sparse attention patterns, grouped-query attention (GQA), Flash Attention memory optimization, and architectural improvements like pre-normalization, RMSNorm, and SwiGLU activations.

Learning Goals

  • Explain the O(N²) complexity problem of full attention.
  • Describe sparse attention patterns and their tradeoffs.
  • Compare multi-head, multi-query, and grouped-query attention.
  • Explain how Flash Attention optimizes memory access for 2-4× speedups.
  • Describe modern architectural improvements: pre-normalization, RMSNorm, SwiGLU.
  • Understand residual connections and why they enable deep networks.
  • Explain document packing and how RoPE enables efficient training.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Hidden Cost of Full Attention

Nina wanted to process a 100,000-token document. Her LLM API charges skyrocketed. She dug into the pricing and found som…

Loading diagram...

Sparse Attention: Looking at Less

If every token attending to every other token is too expensive, what's the solution? **Don't let every token see everyth…

Loading diagram...

Grouped-Query Attention: The Sweet Spot

Multi-query attention is fast but hurts quality. Multi-head is high-quality but slow. Is there a middle ground?

**Group…

3Module 3: Inside the Attention Head

The attention mechanism is the transformer's core innovation. Walk through Q/K/V projections step-by-step, understand how tokens find and combine information from each other, why causal masking separates encoders from decoders, how residual connections enable deep networks, and what different attention heads learn to specialize in.

Learning Goals

  • Explain how Q/K/V projection matrices transform tokens into three different representations.
  • Walk through the attention calculation: Q*K^T scoring, softmax, weighted sum of V.
  • Understand causal masking and why it distinguishes decoders from encoders.
  • Describe residual connections and why they are essential for deep transformer networks.
  • Recognize that attention heads specialize in different patterns without explicit programming.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

Three Lenses on Every Token

Module 1 introduced attention as 'tokens finding relevant context.' Now let's open the hood and see exactly how.

Before…

How Tokens Find Each Other: Q*K Scoring

How Tokens Find Each Other: Q*K Scoring

Now that each token has a Query and every token has a Key, how does the model determine relevance?

Dot product. Mul…

Loading diagram...

From Scores to Weights: Softmax

Raw dot-product scores can be any number — positive, negative, huge, tiny. We need them to be **probabilities that sum t…

4Module 4: The Feedforward Network & Generation Control

The feedforward network stores the transformer's learned knowledge while the attention mechanism retrieves it. Understand the FFN's dual role as memory bank and pattern interpolator, how RoPE gives transformers a sense of position, and practical generation controls (top-k, top-p) that let you tune model output for your application.

Learning Goals

  • Explain the feedforward network's role as the transformer's knowledge store (memorization + interpolation).
  • Describe how attention and FFN work together in each transformer block.
  • Understand why positional encodings (RoPE) are necessary and how they work.
  • Configure top-k and top-p sampling for different use cases.
  • Trace the complete forward pass from input text to generated output token.
  • Read a model card and understand what architectural parameters mean in practice.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Feedforward Network: The Transformer's Memory Bank

Each transformer block has two components: attention and the feedforward network (FFN). We've covered attention. Now…

Loading diagram...

Memorization vs. Interpolation: The FFN's Dual Role

The FFN isn't just a lookup table of memorized facts. It plays two distinct roles:

Memorization: Direct factual rec…

Loading diagram...

Attention + FFN: The Two-Step Dance

Every transformer block performs the same two-step sequence:

Step 1 — Attention: Each token looks at all previous t…

    Inside the Transformer | Synapse