Deployment & Production | LLM Course | Fakhruddin Khambaty's Learning Hub

Part 1: Making the Model Fast

Training a model is only half the battle. When real users hit your API, every millisecond counts. Inference optimization is the art of generating tokens as quickly and cheaply as possible.

Why Inference Is Slow

Autoregressive generation means the model produces one token at a time, each depending on every previous token. For a 500-token response, the model runs forward 500 times. Without tricks, each pass recomputes attention over the entire sequence from scratch.

ELI5: Drive-Through vs Sit-Down Restaurant

Imagine a sit-down restaurant where the waiter walks back to the kitchen, re-reads the entire order from scratch, and brings one dish at a time. That's naive inference.

A drive-through keeps your order on a sticky note (the KV-cache) so each new item is added instantly without re-reading everything. Way faster!

KV-Cache

The Key-Value cache stores the K and V matrices from previous tokens so they never need recomputing. At step t, only the new token's K and V are computed and appended. This turns an O(n²) operation into O(n) per step.

Token Generation: Without vs With KV-Cache

Continuous Batching

GPUs are parallel beasts — running one request wastes most of their capacity. Continuous batching groups multiple user requests together so the GPU stays saturated. Unlike static batching, new requests join the batch as soon as a slot frees up.

Analogy: Elevator vs Stairs

Static batching is like an elevator that waits until it's completely full before moving. Continuous batching lets people hop on and off at every floor — the elevator never stops to wait.

Key Inference Frameworks

vLLM — PagedAttention for memory-efficient KV-cache management
TGI (Text Generation Inference) — Hugging Face's production server
TensorRT-LLM — NVIDIA's optimized runtime for maximum GPU throughput
llama.cpp — Run quantized models on CPUs and Apple Silicon

Part 2: Quantization

A 7-billion-parameter model in FP32 needs 28 GB of memory just for the weights. Most GPUs can't hold that. Quantization shrinks each number's precision so the model fits on smaller hardware with minimal quality loss.

The Precision Ladder

FP32 (32 bits per weight) → FP16 (16 bits) → INT8 (8 bits) → INT4 (4 bits). Each halving cuts memory roughly in half.

ELI5: Ruler Precision

FP32 is measuring with a ruler that has marks every 0.001 mm — super precise but heavy to carry. INT4 is a pocket ruler with marks every centimetre — much lighter, and for most tasks, close enough!

Precision vs Memory Trade-Off

Interactive: Quantization Slider

Drag the slider to change bit precision and watch the model shrink!

FP32 — 8 GB

Full precision. Maximum quality, maximum memory.

Code: 4-Bit Loading with bitsandbytes

Python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",         # normalized float 4-bit
    bnb_4bit_use_double_quant=True,  # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quant_config,
    device_map="auto",
)
# 7B model now fits in ~3.5 GB VRAM!

Analogy: JPEG Compression

Quantization is like saving a photo as a JPEG instead of a raw BMP. The file is dramatically smaller and for most viewers, the image looks identical — only if you zoom to pixel level do you notice tiny differences.

Part 3: Retrieval Augmented Generation

LLMs have a knowledge cutoff and sometimes hallucinate facts. RAG solves this by fetching real documents at query time and stuffing them into the prompt so the model has reference material.

ELI5: Open-Book vs Closed-Book Exam

A closed-book exam forces you to rely on memory — you might mix things up. An open-book exam lets you flip to the right page and quote directly. RAG gives the LLM an open book!

How RAG Works

1) Embed your documents into vectors and store them. 2) At query time, embed the user's question. 3) Retrieve the top-K most similar chunks. 4) Inject those chunks into the prompt. 5) Let the LLM generate an answer grounded in real data.

Animated RAG Pipeline

Code: Simple RAG with LangChain

Python

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# 1. Load and chunk documents
docs = TextLoader("knowledge_base.txt").load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
).split_documents(docs)

# 2. Embed and store in vector DB
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)

# 3. Build retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=HuggingFacePipeline.from_model_id("google/flan-t5-base"),
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
)

answer = qa_chain.run("What is retrieval augmented generation?")
print(answer)

RAG Best Practices

Chunk wisely — too big = noise, too small = missing context (300-500 tokens is a good start)
Overlap chunks — so sentences aren't cut in half
Re-rank — use a cross-encoder to reorder retrieved chunks by relevance
Cite sources — always show the user which documents grounded the answer

Part 4: Prompt Engineering

The same model can give wildly different answers depending on how you ask. Prompt engineering is the craft of structuring your input for the best output — no retraining required.

ELI5: Asking for Directions

Asking a stranger "Where's the place?" (zero-shot) gets a confused look. Saying "I'm looking for the nearest coffee shop — last time I asked, someone said go past the park and turn left" (few-shot with context) gets a precise answer!

The Three Strategies

Zero-shot: Just ask the question directly. Few-shot: Provide examples of input → output pairs. Chain-of-thought (CoT): Ask the model to reason step by step before answering.

Interactive: Compare Prompt Strategies

Click a strategy to see how the prompt and output change for the same question.

Prompt:
"Is 17 a prime number?"

Model output:
"Yes."

⚠ Correct, but no reasoning — fragile for harder questions.

Prompt Anatomy

Anti-Patterns to Avoid

Vague instructions — "Be good at this" gives the model nothing to work with
Conflicting constraints — "Be concise" + "Explain in detail" confuses the model
Prompt injection — never let untrusted user input appear unsanitized in system prompts
Over-engineering — if zero-shot works, don't add 50 examples

Part 5: Safety & Ethics

Deploying an LLM means putting a confident, fluent writer in front of your users. The problem? It can be confidently wrong, biased, or manipulated.

ELI5: The Eager Student

LLMs are like the student who always raises their hand — even when they don't know the answer. They'll give a beautifully-worded response that sounds right but might be completely made up. You need a teacher (guardrails) checking their work!

Hallucinations

The model generates plausible-sounding but factually incorrect text. Mitigation: RAG (ground in real data), temperature reduction, and asking the model to say "I don't know" when uncertain.

Bias

Models inherit biases from training data. A model trained on internet text absorbs stereotypes. Mitigation: evaluation benchmarks (e.g., BBQ, WinoBias), human review, and diverse training data.

Jailbreaks

Clever prompts can trick models into bypassing safety guidelines. Examples: "Pretend you're an evil AI…" or role-play scenarios. Mitigation: input/output filters, red-teaming, and RLHF alignment.

Production Safety Checklist

Input filtering — block prompt injection patterns and malicious inputs
Output filtering — scan for harmful, biased, or PII-leaking content
Rate limiting — prevent abuse and runaway costs
Human-in-the-loop — for high-stakes decisions, require human approval
Logging & monitoring — track model outputs for drift and failure patterns
Graceful degradation — if the model fails, fall back to a safe default

Analogy: Self-Driving Car Analogy

You wouldn't deploy a self-driving car without lane markers, speed limits, and emergency braking. LLMs need the same: guardrails that keep outputs within safe boundaries, even when the model is "driving" on its own.

Part 6: Serving via API

The final step: wrap your model in a REST API so any application can call it. We'll use FastAPI (async, fast, auto-docs) + Hugging Face Transformers.

ELI5: Vending Machine

Your model is a chef locked in a kitchen. An API is the vending machine window — users press a button (send a request), and food (generated text) comes out. They never need to see the kitchen!

Complete FastAPI Server

Python — app.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = FastAPI(title="LLM API")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 150
    temperature: float = 0.7

class GenerateResponse(BaseModel):
    text: str
    tokens_generated: int

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    if not req.prompt.strip():
        raise HTTPException(400, "Prompt cannot be empty")

    inputs = tokenizer(req.prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=req.max_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    n_gen = outputs.shape[1] - inputs["input_ids"].shape[1]
    return GenerateResponse(text=text, tokens_generated=n_gen)

@app.get("/health")
async def health():
    return {"status": "ok"}

Testing It

Bash

# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000

# In another terminal — send a request
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The future of AI is", "max_tokens": 50}'

Production Hardening

Add authentication — API keys or OAuth to prevent unauthorized access
Set rate limits — use something like SlowAPI to throttle per user
Containerize — Docker image with pinned dependencies for reproducibility
GPU inference — switch to device_map="auto" and serve behind a load balancer
Streaming — use Server-Sent Events for token-by-token output to improve UX

Analogy: Production = Restaurant Opening

Building the model is learning to cook. Serving via API is opening a restaurant: you need a menu (docs), a front door (endpoint), a health inspector (monitoring), and a fire exit (error handling). The cooking is the easy part!

Test Your Knowledge

Time to check what you learned about deploying LLMs to production! Answer all 6 questions.

Q1: What does the KV-cache do during LLM inference?

Q2: What happens when you quantize a 7B-parameter model from FP32 to INT4?

Q3: What does RAG stand for, and why is it useful?

Q4: Which prompt engineering strategy asks the model to reason step by step before giving a final answer?

Q5: What is a "hallucination" in the context of LLMs?

Q6: Why is FastAPI a popular choice for serving LLM APIs in production?

Quiz — Test Your Knowledge

Q1: What is the primary purpose of the KV-cache during LLM inference?

Q2: What happens when you quantize a 7B-parameter model from FP32 to INT4?

Q3: What problem does RAG (Retrieval Augmented Generation) solve?

Q4: In prompt engineering, what is the advantage of "few-shot" over "zero-shot" prompting?

Q5: What is a "hallucination" in the context of LLMs?

Q6: Why is FastAPI a popular choice for serving LLM inference endpoints?