How to take a general-purpose LLM and teach it to do YOUR specific job — from SFT to LoRA to RLHF to DPO.
Every powerful LLM goes through two phases. Pre-training gives the model general knowledge; fine-tuning makes it useful for a specific task.
Pre-training = going to school for 12 years. You learn reading, math, science, history — a little bit of everything. Fine-tuning = going to medical school. Now you specialize. You already know how to read and think — you just need to learn the doctor stuff.
A pre-trained model is like a college graduate — smart, educated, but not yet specialized. Fine-tuning is the job training that turns them into a doctor, lawyer, or customer support agent. You don't re-teach them to read — you teach them the domain.
The most straightforward approach: show the model instruction → response pairs and train it to produce the correct response for each instruction.
Pre-training = going to school and learning everything. SFT = doing an internship at a hospital. The attending doctor says "When a patient has fever and cough, do X" and you learn by watching hundreds of these examples.
SFT is like an apprentice watching a master chef. The master shows hundreds of recipes: "Given these ingredients → here's what you cook." After enough examples, the apprentice can cook new dishes on their own.
def reverse(s): return s[::-1]from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer from datasets import load_dataset # Load base model + tokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Load instruction-response dataset dataset = load_dataset("tatsu-lab/alpaca", split="train") # Training config args = TrainingArguments( output_dir="./sft-output", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-5, ) trainer = SFTTrainer(model=model, args=args, train_dataset=dataset, tokenizer=tokenizer, dataset_text_field="text") trainer.train()
Full fine-tuning updates all billions of parameters. LoRA (Low-Rank Adaptation) adds tiny "adapter" matrices and only trains those — making fine-tuning 10–100× cheaper.
Imagine you have a giant textbook with 1,000 pages. Instead of rewriting the entire book, you just add sticky notes to the pages that need changes. The original book stays the same — the sticky notes are your LoRA adapters!
Original weights = the textbook (frozen, untouched). LoRA matrices = tiny sticky notes (trainable). Rank (r) = how big the sticky notes are. Rank 4 = small Post-it. Rank 64 = full-page sticky note. QLoRA = same idea but the textbook is compressed (quantized to 4-bit) to save memory.
Adjust the rank (r) to see how many parameters LoRA adds vs full fine-tuning.
from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=8, # rank — size of the "sticky notes" lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # → "trainable: 4,194,304 / 6,738,415,616 (0.06%)"
SFT teaches the model what to say. RLHF teaches it how humans prefer it to say things — making outputs more helpful, harmless, and honest.
Imagine training a dog. The dog does a trick → you say "Good boy!" (reward) or "Bad boy!" (penalty). Over time, the dog learns which tricks make you happy. RLHF works the same way — humans rate the model's outputs, and the model adjusts to get more "good boy" ratings.
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # 1. Load SFT model + add value head for PPO model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") # 2. Load reward model (trained on human preferences) reward_model = load_reward_model("reward-model") ppo_config = PPOConfig(batch_size=16, learning_rate=1.4e-5) ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer) for batch in dataloader: prompts = batch["prompt"] responses = model.generate(prompts) # LLM generates rewards = reward_model.score(responses) # Reward model judges ppo_trainer.step(prompts, responses, rewards) # PPO updates weights
Pick which response is better — this is exactly what RLHF annotators do.
RLHF works great but is complex — you need a separate reward model + PPO. DPO simplifies this: just show pairs of (better answer, worse answer) and the model learns directly, no reward model needed.
RLHF = hiring a judge, then training the dog based on what the judge says. DPO = skipping the judge entirely. You just show the dog two tricks side by side: "This one was good, this one was bad." The dog figures it out directly.
RLHF takes the scenic route: SFT → Reward Model → PPO → Final Model (3 models to train). DPO takes the highway: SFT → Direct training on preferences → Final Model (just 1 extra step). Same destination, less gas.
A complete, runnable example: fine-tune a small model with LoRA on your own dataset using HuggingFace transformers + PEFT.
We're going to take a pre-trained model, slap on some tiny LoRA adapters (sticky notes!), show it our custom instruction data, and train. The whole thing runs on a single GPU.
import torch from datasets import Dataset from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer # --- 1. Prepare your custom dataset --- data = [ {"text": "### Instruction: Explain LoRA\n### Response: LoRA adds small adapter matrices..."}, {"text": "### Instruction: What is RLHF?\n### Response: RLHF uses human feedback..."}, # Add hundreds more examples here ] dataset = Dataset.from_list(data) # --- 2. Load model in 4-bit (QLoRA) --- bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model_name = "mistralai/Mistral-7B-v0.1" model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # --- 3. Add LoRA adapters --- model = prepare_model_for_kbit_training(model) lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM") model = get_peft_model(model, lora_config) # --- 4. Train! --- trainer = SFTTrainer( model=model, train_dataset=dataset, tokenizer=tokenizer, dataset_text_field="text", max_seq_length=512, args=TrainingArguments( output_dir="./lora-output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch", warmup_ratio=0.03, ), ) trainer.train() # --- 5. Save & use --- model.save_pretrained("./my-lora-adapter") print("Done! Adapter saved. Merge with base model for deployment.")
Select a method to see its cost, complexity, and use case.
Let's see how well you understood fine-tuning! Answer these 6 questions.