No Image

Fine-Tuning & Alignment

Go beyond prompting — adapt generative models to your domain using SFT, QLoRA, RLHF, and DPO.

Why take this course?

When prompting is not enough, fine-tuning lets you shape model behavior at the weight level. This course covers the full pipeline: supervised fine-tuning, parameter-efficient methods (LoRA/QLoRA), evaluation strategies, and preference alignment with RLHF and DPO.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

How LLMs Work

Course Modules

1Module 1: When to Fine-Tune

Fine-tuning is powerful but expensive. Learn the decision framework: when prompting is enough, when RAG is better, and when fine-tuning is the right tool. Covers cost-benefit analysis and the "prompting → RAG → fine-tuning" escalation ladder.

Learning Goals

Apply a decision framework to choose between prompting, RAG, and fine-tuning for a given task.
Estimate the cost and effort of fine-tuning vs. alternative approaches.
Identify tasks where fine-tuning provides irreplaceable value (style, domain language, latency).

2Module 2: Supervised Fine-Tuning (SFT)

The most common fine-tuning approach: teach a model new behaviors by training on curated input-output pairs. Covers dataset preparation, training pipelines, hyperparameter selection, and evaluation.

Learning Goals

Prepare training datasets with quality filtering, deduplication, and format requirements.
Configure SFT training runs with appropriate learning rates, batch sizes, and epoch counts.
Evaluate fine-tuned models against held-out test sets and production baselines.

3Module 3: Parameter-Efficient Methods (LoRA & QLoRA)

Full fine-tuning requires massive compute. LoRA and QLoRA let you adapt models by training a tiny fraction of parameters — making fine-tuning accessible on consumer hardware.

Learning Goals

Explain how LoRA reduces trainable parameters via low-rank decomposition.
Describe QLoRA and how quantization enables fine-tuning on consumer GPUs.
Choose appropriate rank, alpha, and target modules for LoRA configurations.

4Module 4: Preference Alignment (RLHF & DPO)

SFT teaches what to say; alignment teaches what not to say. Learn how RLHF and DPO shape model behavior toward human preferences, safety, and helpfulness — and the alignment tax you pay.

Learning Goals

Explain the RLHF pipeline: reward model training, PPO optimization, and the alignment tax.
Describe DPO as a simpler alternative to RLHF that eliminates the reward model.
Evaluate alignment quality and detect when alignment degrades task performance.