Take your LLM from a notebook prototype to a fast, safe, production-ready system that real users depend on.
Training a model is only half the battle. When real users hit your API, every millisecond counts. Inference optimization is the art of generating tokens as quickly and cheaply as possible.
Autoregressive generation means the model produces one token at a time, each depending on every previous token. For a 500-token response, the model runs forward 500 times. Without tricks, each pass recomputes attention over the entire sequence from scratch.
Imagine a sit-down restaurant where the waiter walks back to the kitchen, re-reads the entire order from scratch, and brings one dish at a time. That's naive inference.
A drive-through keeps your order on a sticky note (the KV-cache) so each new item is added instantly without re-reading everything. Way faster!
The Key-Value cache stores the K and V matrices from previous tokens so they never need recomputing. At step t, only the new token's K and V are computed and appended. This turns an O(n²) operation into O(n) per step.
GPUs are parallel beasts — running one request wastes most of their capacity. Continuous batching groups multiple user requests together so the GPU stays saturated. Unlike static batching, new requests join the batch as soon as a slot frees up.
Static batching is like an elevator that waits until it's completely full before moving. Continuous batching lets people hop on and off at every floor — the elevator never stops to wait.
A 7-billion-parameter model in FP32 needs 28 GB of memory just for the weights. Most GPUs can't hold that. Quantization shrinks each number's precision so the model fits on smaller hardware with minimal quality loss.
FP32 (32 bits per weight) → FP16 (16 bits) → INT8 (8 bits) → INT4 (4 bits). Each halving cuts memory roughly in half.
FP32 is measuring with a ruler that has marks every 0.001 mm — super precise but heavy to carry. INT4 is a pocket ruler with marks every centimetre — much lighter, and for most tasks, close enough!
Drag the slider to change bit precision and watch the model shrink!
from transformers import AutoModelForCausalLM, BitsAndBytesConfig quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype="float16", bnb_4bit_quant_type="nf4", # normalized float 4-bit bnb_4bit_use_double_quant=True, # quantize the quantization constants ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=quant_config, device_map="auto", ) # 7B model now fits in ~3.5 GB VRAM!
Quantization is like saving a photo as a JPEG instead of a raw BMP. The file is dramatically smaller and for most viewers, the image looks identical — only if you zoom to pixel level do you notice tiny differences.
LLMs have a knowledge cutoff and sometimes hallucinate facts. RAG solves this by fetching real documents at query time and stuffing them into the prompt so the model has reference material.
A closed-book exam forces you to rely on memory — you might mix things up. An open-book exam lets you flip to the right page and quote directly. RAG gives the LLM an open book!
1) Embed your documents into vectors and store them. 2) At query time, embed the user's question. 3) Retrieve the top-K most similar chunks. 4) Inject those chunks into the prompt. 5) Let the LLM generate an answer grounded in real data.
from langchain_community.document_loaders import TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import FAISS from langchain_community.embeddings import HuggingFaceEmbeddings from langchain_community.llms import HuggingFacePipeline from langchain.chains import RetrievalQA # 1. Load and chunk documents docs = TextLoader("knowledge_base.txt").load() chunks = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50 ).split_documents(docs) # 2. Embed and store in vector DB embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") vector_store = FAISS.from_documents(chunks, embeddings) # 3. Build retrieval chain qa_chain = RetrievalQA.from_chain_type( llm=HuggingFacePipeline.from_model_id("google/flan-t5-base"), retriever=vector_store.as_retriever(search_kwargs={"k": 3}), ) answer = qa_chain.run("What is retrieval augmented generation?") print(answer)
The same model can give wildly different answers depending on how you ask. Prompt engineering is the craft of structuring your input for the best output — no retraining required.
Asking a stranger "Where's the place?" (zero-shot) gets a confused look. Saying "I'm looking for the nearest coffee shop — last time I asked, someone said go past the park and turn left" (few-shot with context) gets a precise answer!
Zero-shot: Just ask the question directly. Few-shot: Provide examples of input → output pairs. Chain-of-thought (CoT): Ask the model to reason step by step before answering.
Click a strategy to see how the prompt and output change for the same question.
Deploying an LLM means putting a confident, fluent writer in front of your users. The problem? It can be confidently wrong, biased, or manipulated.
LLMs are like the student who always raises their hand — even when they don't know the answer. They'll give a beautifully-worded response that sounds right but might be completely made up. You need a teacher (guardrails) checking their work!
The model generates plausible-sounding but factually incorrect text. Mitigation: RAG (ground in real data), temperature reduction, and asking the model to say "I don't know" when uncertain.
Models inherit biases from training data. A model trained on internet text absorbs stereotypes. Mitigation: evaluation benchmarks (e.g., BBQ, WinoBias), human review, and diverse training data.
Clever prompts can trick models into bypassing safety guidelines. Examples: "Pretend you're an evil AI…" or role-play scenarios. Mitigation: input/output filters, red-teaming, and RLHF alignment.
You wouldn't deploy a self-driving car without lane markers, speed limits, and emergency braking. LLMs need the same: guardrails that keep outputs within safe boundaries, even when the model is "driving" on its own.
The final step: wrap your model in a REST API so any application can call it. We'll use FastAPI (async, fast, auto-docs) + Hugging Face Transformers.
Your model is a chef locked in a kitchen. An API is the vending machine window — users press a button (send a request), and food (generated text) comes out. They never need to see the kitchen!
from fastapi import FastAPI, HTTPException from pydantic import BaseModel from transformers import AutoTokenizer, AutoModelForCausalLM import torch app = FastAPI(title="LLM API") tokenizer = AutoTokenizer.from_pretrained("gpt2") model = AutoModelForCausalLM.from_pretrained("gpt2") model.eval() class GenerateRequest(BaseModel): prompt: str max_tokens: int = 150 temperature: float = 0.7 class GenerateResponse(BaseModel): text: str tokens_generated: int @app.post("/generate", response_model=GenerateResponse) async def generate(req: GenerateRequest): if not req.prompt.strip(): raise HTTPException(400, "Prompt cannot be empty") inputs = tokenizer(req.prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=req.max_tokens, temperature=req.temperature, do_sample=True, ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) n_gen = outputs.shape[1] - inputs["input_ids"].shape[1] return GenerateResponse(text=text, tokens_generated=n_gen) @app.get("/health") async def health(): return {"status": "ok"}
# Start the server uvicorn app:app --host 0.0.0.0 --port 8000 # In another terminal — send a request curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "The future of AI is", "max_tokens": 50}'
device_map="auto" and serve behind a load balancerBuilding the model is learning to cook. Serving via API is opening a restaurant: you need a menu (docs), a front door (endpoint), a health inspector (monitoring), and a fire exit (error handling). The cooking is the easy part!
Time to check what you learned about deploying LLMs to production! Answer all 6 questions.
Module 8 of the LLM Engineering Course | Built by Fakhruddin Khambaty