No Image

Retrieval-Augmented Generation

Build RAG systems: chunking strategies, retrieval pipelines, prompt integration, and evaluation.

Why take this course?

The most practical pattern in AI engineering. Learn why RAG exists, how the retrieve-augment-generate pipeline works, how to measure and improve retrieval quality, diagnose failure modes, and ship production-ready systems with caching and observability.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

Embeddings & Semantic Search

Course Modules

1Module 1: Why RAG Exists

LLMs hallucinate, have knowledge cutoffs, and cannot access your private data. RAG solves all three by separating retrieval from reasoning: a knowledge base provides current, authoritative facts; the LLM provides the reasoning engine that synthesizes them into answers.

Learning Goals

Explain why LLMs hallucinate and why it is systematic rather than random.
Understand training cutoffs and why fine-tuning is not a practical solution to stale knowledge.
Describe the parametric vs. external memory distinction and how RAG bridges them.
Identify which use cases are better suited to RAG vs. fine-tuning vs. both.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

The Hallucination Problem

Nina ships a customer-support chatbot powered by GPT-4. The first week goes well — until a user asks about the company's…

Loading diagram...

Knowledge Cutoffs and Stale Facts

Nina's hallucination problem has a cousin: staleness. Every LLM has a training cutoff — a date after which it kn…

Loading diagram...

RAG as External Memory

Hallucinations and staleness share a root cause: the model is forced to answer from parametric memory — knowledge co…

2Module 2: The RAG Pipeline

RAG has two pipelines: an offline ingestion pipeline that prepares the knowledge base, and an online query pipeline that retrieves and generates answers at request time. Understanding how each stage works — and how they must mirror each other — is the foundation for building a system that actually works.

Learning Goals

Walk through all three RAG stages: Retrieve, Augment, Generate — and explain what each produces.
Understand context window injection: how retrieved chunks are assembled into the augmented prompt.
Describe the offline ingestion pipeline: load → clean → chunk → embed → upsert.
Explain why preprocessing consistency between ingestion and query time is critical.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

Retrieve, Augment, Generate

Nina knows why RAG exists — now she needs to understand how it works. The name spells out the three stages: **Retrie…

Context Window Injection

The augmented prompt must fit within the LLM's context window — the maximum number of tokens the model can process i…

Loading diagram...

Lost in the Middle

Nina runs an experiment: she places the correct answer in different positions within a 20-chunk context and measures acc…

3Module 3: Retrieval Quality

Poor retrieval quality is the most common RAG failure mode. Learn how to measure retrieval with recall and precision, build evaluation test sets, and improve quality using cross-encoder reranking — so you can tell whether your changes make things better or worse.

Learning Goals

Explain the recall vs. precision tradeoff in retrieval and how top-k affects each.
Build a retrieval evaluation test set using manual annotation or synthetic LLM-generated pairs.
Interpret retrieval metrics: Recall@k, Precision@k, MRR, and NDCG.
Explain two-stage retrieval (bi-encoder + cross-encoder reranker) and when reranking is worth the latency.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

Recall vs. Precision in Retrieval

Nina's RAG prototype returns answers, but some are wrong despite the correct chunks existing in her corpus. She needs a…

Key Retrieval Metrics

Knowing recall and precision exist is not enough — Nina needs specific metrics she can compute and track over time.

**R…

Loading diagram...

End-to-End Evaluation with RAGAS

Retrieval metrics tell Nina whether the retriever works. But users care about the final answer, not whether chunk 4 was…

4Module 4: RAG Failure Modes

RAG can fail in ways that are hard to detect: the right chunk retrieved but ignored, too much context confusing the model, or information existing in the corpus but never found. Learn to diagnose each failure mode and apply targeted fixes.

Learning Goals

Explain the "lost in the middle" phenomenon and why context position affects LLM answer quality.
Identify context stuffing symptoms and understand why more retrieved chunks can hurt answer quality.
Diagnose retrieval misses: vocabulary gaps, concept splitting, low-frequency information, format mismatch.
Recognize LLM override — when the model ignores retrieved context in favor of parametric knowledge.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

Lost in the Middle

Nina's retrieval metrics look solid — Recall@5 is 0.85. But her users still get wrong answers. She digs into the logs an…

Loading diagram...

Context Stuffing

Knowing about positional bias, Nina tries a different approach: she increases top-k from 5 to 15, reasoning that retriev…

Loading diagram...

Retrieval Misses

Nina fixes context ordering and reduces top-k. Most queries work now. But she notices a pattern in the remaining failure…

5Module 5: Production RAG

A working prototype and a production system are very different things. Learn how to manage latency budgets (hint: the LLM dominates), implement caching strategies with appropriate risk management, combine dense and sparse retrieval for robust search, and build observability into your system so you know when it fails.

Learning Goals

Break down RAG latency by component and identify the highest-ROI optimization targets.
Implement embedding, retrieval, and semantic caches — and understand the risks of each.
Explain hybrid retrieval (dense + sparse) and why Reciprocal Rank Fusion (RRF) merges the results.
Design an observability strategy: what to log, what metrics to track, and how to detect silent failures.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The RAG Latency Budget

Nina's RAG prototype works — but it takes six seconds to answer a question. Her PM asks: "Can you make it faster?" Befor…

Loading diagram...

Optimization Priorities — Highest ROI First

Now that Nina knows where time goes, she needs to decide what to optimize first. Not all optimizations are equal — some…

Loading diagram...

Caching Layers in RAG Systems

With latency priorities set, Nina turns to a complementary strategy: avoiding redundant work entirely. RAG systems have…

6Apply It: Shipping a Support Bot That Doesn't Hallucinate

Put everything together. Nina builds a customer support RAG chatbot that resolves 60% of tickets — making decisions about retrieval strategy, failure modes, trust calibration, and production observability.

Learning Goals

Design a RAG architecture under real production constraints (latency, trust, freshness).
Diagnose retrieval failures in production (exact ID search, stale data, lost in the middle).
Implement trust calibration — knowing when the bot should escalate to a human.
Build an observability strategy that catches quality drift before users do.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Brief

Nina's VP pulls her into a meeting: "Our support team handles 3,000 tickets per day. Average resolution time is 4 hours.…

Loading diagram...

The Architecture

Nina sketches the system on a whiteboard. The retrieval layer uses the embedding pipeline she already knows: chunk the h…

Loading diagram...

The First Failure

Nina launches the beta to 100 internal support agents. Day one feedback: "The bot answered a question about our old pric…