Fine-Tuning & Alignment
Go beyond prompting — adapt generative models to your domain using SFT, QLoRA, RLHF, and DPO.
Why take this course?
When prompting is not enough, fine-tuning lets you shape model behavior at the weight level. This course covers the full pipeline: supervised fine-tuning, parameter-efficient methods (LoRA/QLoRA), evaluation strategies, and preference alignment with RLHF and DPO.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
Fine-tuning is powerful but expensive. Learn the decision framework: when prompting is enough, when RAG is better, and when fine-tuning is the right tool. Covers cost-benefit analysis and the "prompting → RAG → fine-tuning" escalation ladder.
Learning Goals
- Apply a decision framework to choose between prompting, RAG, and fine-tuning for a given task.
- Estimate the cost and effort of fine-tuning vs. alternative approaches.
- Identify tasks where fine-tuning provides irreplaceable value (style, domain language, latency).
The most common fine-tuning approach: teach a model new behaviors by training on curated input-output pairs. Covers dataset preparation, training pipelines, hyperparameter selection, and evaluation.
Learning Goals
- Prepare training datasets with quality filtering, deduplication, and format requirements.
- Configure SFT training runs with appropriate learning rates, batch sizes, and epoch counts.
- Evaluate fine-tuned models against held-out test sets and production baselines.
Full fine-tuning requires massive compute. LoRA and QLoRA let you adapt models by training a tiny fraction of parameters — making fine-tuning accessible on consumer hardware.
Learning Goals
- Explain how LoRA reduces trainable parameters via low-rank decomposition.
- Describe QLoRA and how quantization enables fine-tuning on consumer GPUs.
- Choose appropriate rank, alpha, and target modules for LoRA configurations.
SFT teaches what to say; alignment teaches what not to say. Learn how RLHF and DPO shape model behavior toward human preferences, safety, and helpfulness — and the alignment tax you pay.
Learning Goals
- Explain the RLHF pipeline: reward model training, PPO optimization, and the alignment tax.
- Describe DPO as a simpler alternative to RLHF that eliminates the reward model.
- Evaluate alignment quality and detect when alignment degrades task performance.