Module 7 — Customization

Fine-Tuning LLMs

How to take a general-purpose LLM and teach it to do YOUR specific job — from SFT to LoRA to RLHF to DPO.

Pre-training vs Fine-tuning

Every powerful LLM goes through two phases. Pre-training gives the model general knowledge; fine-tuning makes it useful for a specific task.

ELI5

Pre-training = going to school for 12 years. You learn reading, math, science, history — a little bit of everything. Fine-tuning = going to medical school. Now you specialize. You already know how to read and think — you just need to learn the doctor stuff.

The Career Analogy

A pre-trained model is like a college graduate — smart, educated, but not yet specialized. Fine-tuning is the job training that turns them into a doctor, lawyer, or customer support agent. You don't re-teach them to read — you teach them the domain.

Two-Phase Training Pipeline

Phase 1: Pre-training Billions of tokens Books, Wikipedia, Web "Learn to read & think" Base Model GPT, LLaMA, etc. Phase 2: Fine-tuning Thousands of examples Task-specific data "Learn a specific job" Specialized Model Weeks on 1000s of GPUs Hours on 1–8 GPUs

Key Takeaway

  • Pre-training is expensive (millions of $) and gives general knowledge
  • Fine-tuning is cheap (few $100s) and gives specific capability
  • You almost never pre-train from scratch — you fine-tune an existing base model

Supervised Fine-Tuning (SFT)

The most straightforward approach: show the model instruction → response pairs and train it to produce the correct response for each instruction.

ELI5

Pre-training = going to school and learning everything. SFT = doing an internship at a hospital. The attending doctor says "When a patient has fever and cough, do X" and you learn by watching hundreds of these examples.

The Apprenticeship

SFT is like an apprentice watching a master chef. The master shows hundreds of recipes: "Given these ingredients → here's what you cook." After enough examples, the apprentice can cook new dishes on their own.

What Training Data Looks Like

Instruction:
"Summarize this article about climate change in 3 bullet points."
Response:
"• Global temps rose 1.1°C since 1900
• CO₂ levels highest in 800K years
• Ice sheets losing 150B tons/year"
Instruction:
"Write a Python function that reverses a string."
Response:
def reverse(s): return s[::-1]

SFT with HuggingFace

Python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# Load base model + tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load instruction-response dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Training config
args = TrainingArguments(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
)

trainer = SFTTrainer(model=model, args=args, train_dataset=dataset,
                     tokenizer=tokenizer, dataset_text_field="text")
trainer.train()

LoRA & QLoRA

Full fine-tuning updates all billions of parameters. LoRA (Low-Rank Adaptation) adds tiny "adapter" matrices and only trains those — making fine-tuning 10–100× cheaper.

ELI5

Imagine you have a giant textbook with 1,000 pages. Instead of rewriting the entire book, you just add sticky notes to the pages that need changes. The original book stays the same — the sticky notes are your LoRA adapters!

The Sticky Notes Analogy

Original weights = the textbook (frozen, untouched). LoRA matrices = tiny sticky notes (trainable). Rank (r) = how big the sticky notes are. Rank 4 = small Post-it. Rank 64 = full-page sticky note. QLoRA = same idea but the textbook is compressed (quantized to 4-bit) to save memory.

LoRA: Add Small Matrices Instead of Updating Everything

Full Fine-Tuning ALL weights updated 7B params 100% trainable ~28 GB VRAM needed LoRA Fine-Tuning Frozen weights (greyed out) 7B frozen +8M LoRA ~6 GB VRAM (QLoRA: ~4 GB) Math Behind LoRA W' = W + BA W: d×d (frozen) B: d×r (trainable) A: r×d (trainable) r ≪ d (rank is tiny!) e.g. d=4096, r=8

Interactive: LoRA Rank Explorer

Adjust the rank (r) to see how many parameters LoRA adds vs full fine-tuning.

8

LoRA with PEFT Library

Python
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,                     # rank — size of the "sticky notes"
    lora_alpha=32,            # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → "trainable: 4,194,304 / 6,738,415,616 (0.06%)"

RLHF — Reinforcement Learning from Human Feedback

SFT teaches the model what to say. RLHF teaches it how humans prefer it to say things — making outputs more helpful, harmless, and honest.

ELI5

Imagine training a dog. The dog does a trick → you say "Good boy!" (reward) or "Bad boy!" (penalty). Over time, the dog learns which tricks make you happy. RLHF works the same way — humans rate the model's outputs, and the model adjusts to get more "good boy" ratings.

The 3-Step RLHF Pipeline

Step 1 — Supervised Fine-Tuning (SFT)
Train on instruction→response pairs (what we learned above). This gives us a baseline model that can follow instructions.
Step 2 — Train a Reward Model
Show humans two model responses to the same prompt. They pick the better one. Train a separate model to predict which response humans prefer.
Step 3 — PPO Optimization
Use the reward model as a "judge." The LLM generates responses, the reward model scores them, and PPO (Proximal Policy Optimization) nudges the LLM toward higher-scoring outputs.

The RLHF Pipeline

Step 1: SFT Instruction tuning SFT Model Generates responses Step 2: Human Comparison "Response A or B — which is better?" 👤 Human Reward Model Predicts human preference Step 3: PPO Generate → Score → Update Maximize reward signal Loop RLHF-Tuned LLM Aligned with humans

Conceptual RLHF Code

Python — Conceptual
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# 1. Load SFT model + add value head for PPO
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")

# 2. Load reward model (trained on human preferences)
reward_model = load_reward_model("reward-model")

ppo_config = PPOConfig(batch_size=16, learning_rate=1.4e-5)
ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer)

for batch in dataloader:
    prompts = batch["prompt"]
    responses = model.generate(prompts)         # LLM generates
    rewards = reward_model.score(responses)      # Reward model judges
    ppo_trainer.step(prompts, responses, rewards) # PPO updates weights

Interactive: Be the Human Rater!

Pick which response is better — this is exactly what RLHF annotators do.

DPO — Direct Preference Optimization

RLHF works great but is complex — you need a separate reward model + PPO. DPO simplifies this: just show pairs of (better answer, worse answer) and the model learns directly, no reward model needed.

ELI5

RLHF = hiring a judge, then training the dog based on what the judge says. DPO = skipping the judge entirely. You just show the dog two tricks side by side: "This one was good, this one was bad." The dog figures it out directly.

The Shortcut

RLHF takes the scenic route: SFT → Reward Model → PPO → Final Model (3 models to train). DPO takes the highway: SFT → Direct training on preferences → Final Model (just 1 extra step). Same destination, less gas.

RLHF vs DPO Pipeline Comparison

RLHF (Complex) SFT Model Human PrefsA > B pairs Reward Model PPO Aligned Model 3 models to train • Complex • Unstable DPO (Simple) SFT Model Human Prefs(same data!) DPO TrainingDirect optimization Aligned Model 1 training step • Simple • Stable

RLHF vs DPO — When to Use What?

  • RLHF: better for very large models where you need fine-grained control; used by OpenAI for ChatGPT
  • DPO: simpler, more stable, increasingly popular; used by Meta for Llama-2-Chat
  • Both need human preference data — the difference is how they use it

Complete Fine-Tuning Code

A complete, runnable example: fine-tune a small model with LoRA on your own dataset using HuggingFace transformers + PEFT.

ELI5

We're going to take a pre-trained model, slap on some tiny LoRA adapters (sticky notes!), show it our custom instruction data, and train. The whole thing runs on a single GPU.

Python — Full Pipeline
import torch
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# --- 1. Prepare your custom dataset ---
data = [
    {"text": "### Instruction: Explain LoRA\n### Response: LoRA adds small adapter matrices..."},
    {"text": "### Instruction: What is RLHF?\n### Response: RLHF uses human feedback..."},
    # Add hundreds more examples here
]
dataset = Dataset.from_list(data)

# --- 2. Load model in 4-bit (QLoRA) ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# --- 3. Add LoRA adapters ---
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj",
    "k_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)

# --- 4. Train! ---
trainer = SFTTrainer(
    model=model, train_dataset=dataset, tokenizer=tokenizer,
    dataset_text_field="text", max_seq_length=512,
    args=TrainingArguments(
        output_dir="./lora-output", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, fp16=True, logging_steps=10,
        save_strategy="epoch", warmup_ratio=0.03,
    ),
)
trainer.train()

# --- 5. Save & use ---
model.save_pretrained("./my-lora-adapter")
print("Done! Adapter saved. Merge with base model for deployment.")

Interactive: Compare Fine-Tuning Methods

Select a method to see its cost, complexity, and use case.

Module Summary

  • Pre-training = general knowledge; Fine-tuning = specific skill
  • SFT = train on instruction-response pairs (the foundation)
  • LoRA/QLoRA = add tiny adapters instead of updating all weights (10–100× cheaper)
  • RLHF = use human preferences + reward model + PPO to align outputs
  • DPO = simpler alternative — learn from preference pairs directly

Test Your Knowledge

Let's see how well you understood fine-tuning! Answer these 6 questions.

Q1: What is the main difference between pre-training and fine-tuning an LLM?
Q2: What is the purpose of Supervised Fine-Tuning (SFT)?
Q3: Why is LoRA (Low-Rank Adaptation) so much more efficient than full fine-tuning?
Q4: What does QLoRA add on top of regular LoRA?
Q5: In RLHF, what is the role of the reward model?
Q6: What is the key advantage of DPO (Direct Preference Optimization) over RLHF?

Quiz — Test Your Knowledge

Q1: What is the main difference between pre-training and fine-tuning an LLM?
Q2: What does Supervised Fine-Tuning (SFT) train the model to do?
Q3: Why is LoRA more efficient than full fine-tuning?
Q4: What does QLoRA add on top of standard LoRA?
Q5: In RLHF, what is the role of the reward model?
Q6: How does DPO simplify the RLHF process?