MODULE 8

Deployment & Production

Take your LLM from a notebook prototype to a fast, safe, production-ready system that real users depend on.

Part 1: Making the Model Fast

Training a model is only half the battle. When real users hit your API, every millisecond counts. Inference optimization is the art of generating tokens as quickly and cheaply as possible.

Why Inference Is Slow

Autoregressive generation means the model produces one token at a time, each depending on every previous token. For a 500-token response, the model runs forward 500 times. Without tricks, each pass recomputes attention over the entire sequence from scratch.

ELI5: Drive-Through vs Sit-Down Restaurant

Imagine a sit-down restaurant where the waiter walks back to the kitchen, re-reads the entire order from scratch, and brings one dish at a time. That's naive inference.

A drive-through keeps your order on a sticky note (the KV-cache) so each new item is added instantly without re-reading everything. Way faster!

KV-Cache

The Key-Value cache stores the K and V matrices from previous tokens so they never need recomputing. At step t, only the new token's K and V are computed and appended. This turns an O(n²) operation into O(n) per step.

Token Generation: Without vs With KV-Cache

Without KV-Cache (slow) Step 3: recompute K,V for t1,t2,t3 t1 t2 t3 → recompute ALL ⏱ Each step = O(n) recomputation With KV-Cache (fast) Step 3: read cached K,V + compute t3 t1✓ t2✓ t3★ → only NEW ⚡ Each step = O(1) new compute 🐌 100 tokens → 5,050 ops total ⚡ 100 tokens → 100 ops total KV-Cache = ~50× fewer redundant computations

Continuous Batching

GPUs are parallel beasts — running one request wastes most of their capacity. Continuous batching groups multiple user requests together so the GPU stays saturated. Unlike static batching, new requests join the batch as soon as a slot frees up.

Analogy: Elevator vs Stairs

Static batching is like an elevator that waits until it's completely full before moving. Continuous batching lets people hop on and off at every floor — the elevator never stops to wait.

Key Inference Frameworks

  • vLLM — PagedAttention for memory-efficient KV-cache management
  • TGI (Text Generation Inference) — Hugging Face's production server
  • TensorRT-LLM — NVIDIA's optimized runtime for maximum GPU throughput
  • llama.cpp — Run quantized models on CPUs and Apple Silicon

Part 2: Quantization

A 7-billion-parameter model in FP32 needs 28 GB of memory just for the weights. Most GPUs can't hold that. Quantization shrinks each number's precision so the model fits on smaller hardware with minimal quality loss.

The Precision Ladder

FP32 (32 bits per weight) → FP16 (16 bits) → INT8 (8 bits) → INT4 (4 bits). Each halving cuts memory roughly in half.

ELI5: Ruler Precision

FP32 is measuring with a ruler that has marks every 0.001 mm — super precise but heavy to carry. INT4 is a pocket ruler with marks every centimetre — much lighter, and for most tasks, close enough!

Precision vs Memory Trade-Off

FP32 32 bits / param 28 GB Max precision FP16 16 bits 14 GB INT8 8 bits 7 GB INT4 4 bits 3.5 GB ↑ Each step ≈ halves memory with small quality trade-off

Interactive: Quantization Slider

Drag the slider to change bit precision and watch the model shrink!

FP32 — 8 GB
Full precision. Maximum quality, maximum memory.

Code: 4-Bit Loading with bitsandbytes

Python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",         # normalized float 4-bit
    bnb_4bit_use_double_quant=True,  # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quant_config,
    device_map="auto",
)
# 7B model now fits in ~3.5 GB VRAM!

Analogy: JPEG Compression

Quantization is like saving a photo as a JPEG instead of a raw BMP. The file is dramatically smaller and for most viewers, the image looks identical — only if you zoom to pixel level do you notice tiny differences.

Part 3: Retrieval Augmented Generation

LLMs have a knowledge cutoff and sometimes hallucinate facts. RAG solves this by fetching real documents at query time and stuffing them into the prompt so the model has reference material.

ELI5: Open-Book vs Closed-Book Exam

A closed-book exam forces you to rely on memory — you might mix things up. An open-book exam lets you flip to the right page and quote directly. RAG gives the LLM an open book!

How RAG Works

1) Embed your documents into vectors and store them. 2) At query time, embed the user's question. 3) Retrieve the top-K most similar chunks. 4) Inject those chunks into the prompt. 5) Let the LLM generate an answer grounded in real data.

Animated RAG Pipeline

User Query "What is RAG?" 🔍 Retrieve Vector DB search Top-K chunks 📋 Stuff Prompt Context + Question → system message 🤖 Generate Grounded answer with citations The model answers from real documents, not just memory

Code: Simple RAG with LangChain

Python
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# 1. Load and chunk documents
docs = TextLoader("knowledge_base.txt").load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
).split_documents(docs)

# 2. Embed and store in vector DB
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)

# 3. Build retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=HuggingFacePipeline.from_model_id("google/flan-t5-base"),
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
)

answer = qa_chain.run("What is retrieval augmented generation?")
print(answer)

RAG Best Practices

  • Chunk wisely — too big = noise, too small = missing context (300-500 tokens is a good start)
  • Overlap chunks — so sentences aren't cut in half
  • Re-rank — use a cross-encoder to reorder retrieved chunks by relevance
  • Cite sources — always show the user which documents grounded the answer

Part 4: Prompt Engineering

The same model can give wildly different answers depending on how you ask. Prompt engineering is the craft of structuring your input for the best output — no retraining required.

ELI5: Asking for Directions

Asking a stranger "Where's the place?" (zero-shot) gets a confused look. Saying "I'm looking for the nearest coffee shop — last time I asked, someone said go past the park and turn left" (few-shot with context) gets a precise answer!

The Three Strategies

Zero-shot: Just ask the question directly. Few-shot: Provide examples of input → output pairs. Chain-of-thought (CoT): Ask the model to reason step by step before answering.

Interactive: Compare Prompt Strategies

Click a strategy to see how the prompt and output change for the same question.

Prompt:
"Is 17 a prime number?"

Model output:
"Yes."

⚠ Correct, but no reasoning — fragile for harder questions.

Prompt Anatomy

System Prompt — role, constraints, tone Few-Shot Examples (optional) Context / Retrieved Docs (RAG) User Question — the actual task Each layer adds more grounding → better, more reliable answers

Anti-Patterns to Avoid

  • Vague instructions — "Be good at this" gives the model nothing to work with
  • Conflicting constraints — "Be concise" + "Explain in detail" confuses the model
  • Prompt injection — never let untrusted user input appear unsanitized in system prompts
  • Over-engineering — if zero-shot works, don't add 50 examples

Part 5: Safety & Ethics

Deploying an LLM means putting a confident, fluent writer in front of your users. The problem? It can be confidently wrong, biased, or manipulated.

ELI5: The Eager Student

LLMs are like the student who always raises their hand — even when they don't know the answer. They'll give a beautifully-worded response that sounds right but might be completely made up. You need a teacher (guardrails) checking their work!

Hallucinations

The model generates plausible-sounding but factually incorrect text. Mitigation: RAG (ground in real data), temperature reduction, and asking the model to say "I don't know" when uncertain.

Bias

Models inherit biases from training data. A model trained on internet text absorbs stereotypes. Mitigation: evaluation benchmarks (e.g., BBQ, WinoBias), human review, and diverse training data.

Jailbreaks

Clever prompts can trick models into bypassing safety guidelines. Examples: "Pretend you're an evil AI…" or role-play scenarios. Mitigation: input/output filters, red-teaming, and RLHF alignment.

Production Safety Checklist

  • Input filtering — block prompt injection patterns and malicious inputs
  • Output filtering — scan for harmful, biased, or PII-leaking content
  • Rate limiting — prevent abuse and runaway costs
  • Human-in-the-loop — for high-stakes decisions, require human approval
  • Logging & monitoring — track model outputs for drift and failure patterns
  • Graceful degradation — if the model fails, fall back to a safe default

Analogy: Self-Driving Car Analogy

You wouldn't deploy a self-driving car without lane markers, speed limits, and emergency braking. LLMs need the same: guardrails that keep outputs within safe boundaries, even when the model is "driving" on its own.

Part 6: Serving via API

The final step: wrap your model in a REST API so any application can call it. We'll use FastAPI (async, fast, auto-docs) + Hugging Face Transformers.

ELI5: Vending Machine

Your model is a chef locked in a kitchen. An API is the vending machine window — users press a button (send a request), and food (generated text) comes out. They never need to see the kitchen!

Complete FastAPI Server

Python — app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = FastAPI(title="LLM API")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 150
    temperature: float = 0.7

class GenerateResponse(BaseModel):
    text: str
    tokens_generated: int

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    if not req.prompt.strip():
        raise HTTPException(400, "Prompt cannot be empty")

    inputs = tokenizer(req.prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=req.max_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    n_gen = outputs.shape[1] - inputs["input_ids"].shape[1]
    return GenerateResponse(text=text, tokens_generated=n_gen)

@app.get("/health")
async def health():
    return {"status": "ok"}

Testing It

Bash
# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000

# In another terminal — send a request
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The future of AI is", "max_tokens": 50}'

Production Hardening

  • Add authentication — API keys or OAuth to prevent unauthorized access
  • Set rate limits — use something like SlowAPI to throttle per user
  • Containerize — Docker image with pinned dependencies for reproducibility
  • GPU inference — switch to device_map="auto" and serve behind a load balancer
  • Streaming — use Server-Sent Events for token-by-token output to improve UX

Analogy: Production = Restaurant Opening

Building the model is learning to cook. Serving via API is opening a restaurant: you need a menu (docs), a front door (endpoint), a health inspector (monitoring), and a fire exit (error handling). The cooking is the easy part!

Test Your Knowledge

Time to check what you learned about deploying LLMs to production! Answer all 6 questions.

Q1: What does the KV-cache do during LLM inference?
Q2: What happens when you quantize a 7B-parameter model from FP32 to INT4?
Q3: What does RAG stand for, and why is it useful?
Q4: Which prompt engineering strategy asks the model to reason step by step before giving a final answer?
Q5: What is a "hallucination" in the context of LLMs?
Q6: Why is FastAPI a popular choice for serving LLM APIs in production?

Quiz — Test Your Knowledge

Q1: What is the primary purpose of the KV-cache during LLM inference?
Q2: What happens when you quantize a 7B-parameter model from FP32 to INT4?
Q3: What problem does RAG (Retrieval Augmented Generation) solve?
Q4: In prompt engineering, what is the advantage of "few-shot" over "zero-shot" prompting?
Q5: What is a "hallucination" in the context of LLMs?
Q6: Why is FastAPI a popular choice for serving LLM inference endpoints?

Module 8 of the LLM Engineering Course | Built by Fakhruddin Khambaty