Retrieval-Augmented Generation
Build RAG systems: chunking strategies, retrieval pipelines, prompt integration, and evaluation.
Why take this course?
The most practical pattern in AI engineering. Learn why RAG exists, how the retrieve-augment-generate pipeline works, how to measure and improve retrieval quality, diagnose failure modes, and ship production-ready systems with caching and observability.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
LLMs hallucinate, have knowledge cutoffs, and cannot access your private data. RAG solves all three by separating retrieval from reasoning: a knowledge base provides current, authoritative facts; the LLM provides the reasoning engine that synthesizes them into answers.
Learning Goals
- Explain why LLMs hallucinate and why it is systematic rather than random.
- Understand training cutoffs and why fine-tuning is not a practical solution to stale knowledge.
- Describe the parametric vs. external memory distinction and how RAG bridges them.
- Identify which use cases are better suited to RAG vs. fine-tuning vs. both.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.

The Hallucination Problem
Nina ships a customer-support chatbot powered by GPT-4. The first week goes well — until a user asks about the company's…
Knowledge Cutoffs and Stale Facts
Nina's hallucination problem has a cousin: staleness. Every LLM has a training cutoff — a date after which it kn…
RAG as External Memory
Hallucinations and staleness share a root cause: the model is forced to answer from parametric memory — knowledge co…
RAG has two pipelines: an offline ingestion pipeline that prepares the knowledge base, and an online query pipeline that retrieves and generates answers at request time. Understanding how each stage works — and how they must mirror each other — is the foundation for building a system that actually works.
Learning Goals
- Walk through all three RAG stages: Retrieve, Augment, Generate — and explain what each produces.
- Understand context window injection: how retrieved chunks are assembled into the augmented prompt.
- Describe the offline ingestion pipeline: load → clean → chunk → embed → upsert.
- Explain why preprocessing consistency between ingestion and query time is critical.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Retrieve, Augment, Generate
Nina knows why RAG exists — now she needs to understand how it works. The name spells out the three stages: **Retrie…

Context Window Injection
The augmented prompt must fit within the LLM's context window — the maximum number of tokens the model can process i…
Lost in the Middle
Nina runs an experiment: she places the correct answer in different positions within a 20-chunk context and measures acc…
Poor retrieval quality is the most common RAG failure mode. Learn how to measure retrieval with recall and precision, build evaluation test sets, and improve quality using cross-encoder reranking — so you can tell whether your changes make things better or worse.
Learning Goals
- Explain the recall vs. precision tradeoff in retrieval and how top-k affects each.
- Build a retrieval evaluation test set using manual annotation or synthetic LLM-generated pairs.
- Interpret retrieval metrics: Recall@k, Precision@k, MRR, and NDCG.
- Explain two-stage retrieval (bi-encoder + cross-encoder reranker) and when reranking is worth the latency.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Recall vs. Precision in Retrieval
Nina's RAG prototype returns answers, but some are wrong despite the correct chunks existing in her corpus. She needs a…

Key Retrieval Metrics
Knowing recall and precision exist is not enough — Nina needs specific metrics she can compute and track over time.
**R…
End-to-End Evaluation with RAGAS
Retrieval metrics tell Nina whether the retriever works. But users care about the final answer, not whether chunk 4 was…
RAG can fail in ways that are hard to detect: the right chunk retrieved but ignored, too much context confusing the model, or information existing in the corpus but never found. Learn to diagnose each failure mode and apply targeted fixes.
Learning Goals
- Explain the "lost in the middle" phenomenon and why context position affects LLM answer quality.
- Identify context stuffing symptoms and understand why more retrieved chunks can hurt answer quality.
- Diagnose retrieval misses: vocabulary gaps, concept splitting, low-frequency information, format mismatch.
- Recognize LLM override — when the model ignores retrieved context in favor of parametric knowledge.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Lost in the Middle
Nina's retrieval metrics look solid — Recall@5 is 0.85. But her users still get wrong answers. She digs into the logs an…
Context Stuffing
Knowing about positional bias, Nina tries a different approach: she increases top-k from 5 to 15, reasoning that retriev…
Retrieval Misses
Nina fixes context ordering and reduces top-k. Most queries work now. But she notices a pattern in the remaining failure…
A working prototype and a production system are very different things. Learn how to manage latency budgets (hint: the LLM dominates), implement caching strategies with appropriate risk management, combine dense and sparse retrieval for robust search, and build observability into your system so you know when it fails.
Learning Goals
- Break down RAG latency by component and identify the highest-ROI optimization targets.
- Implement embedding, retrieval, and semantic caches — and understand the risks of each.
- Explain hybrid retrieval (dense + sparse) and why Reciprocal Rank Fusion (RRF) merges the results.
- Design an observability strategy: what to log, what metrics to track, and how to detect silent failures.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The RAG Latency Budget
Nina's RAG prototype works — but it takes six seconds to answer a question. Her PM asks: "Can you make it faster?" Befor…
Optimization Priorities — Highest ROI First
Now that Nina knows where time goes, she needs to decide what to optimize first. Not all optimizations are equal — some…
Caching Layers in RAG Systems
With latency priorities set, Nina turns to a complementary strategy: avoiding redundant work entirely. RAG systems have…
Put everything together. Nina builds a customer support RAG chatbot that resolves 60% of tickets — making decisions about retrieval strategy, failure modes, trust calibration, and production observability.
Learning Goals
- Design a RAG architecture under real production constraints (latency, trust, freshness).
- Diagnose retrieval failures in production (exact ID search, stale data, lost in the middle).
- Implement trust calibration — knowing when the bot should escalate to a human.
- Build an observability strategy that catches quality drift before users do.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Brief
Nina's VP pulls her into a meeting: "Our support team handles 3,000 tickets per day. Average resolution time is 4 hours.…
The Architecture
Nina sketches the system on a whiteboard. The retrieval layer uses the embedding pipeline she already knows: chunk the h…
The First Failure
Nina launches the beta to 100 internal support agents. Day one feedback: "The bot answered a question about our old pric…