Full-stack LLM development¶
At a glance¶
Building an AI product involves more than just chatting with a model. You need to choose a model, teach it your domain (fine-tuning), make it fast enough for real users (inference), measure if it works (evaluation), and keep it running (deployment). This article covers the engineering decisions between "I have access to an AI" and "I have a product people can use."
The full LLM stack: parameter-efficient fine-tuning (LoRA, QLoRA), alignment via RLHF or DPO, inference serving (vLLM, TGI), quantization for cost/speed tradeoffs, evaluation (benchmarks, LLM-as-judge, Chatbot Arena), and production operations (observability, prompt versioning, cost tracking). Key decision: start with prompting + RAG before fine-tuning; fine-tune only when you've exhausted cheaper approaches.
LoRA (Hu et al., 2021): rank-decomposed weight updates at <1% parameter cost. QLoRA (Dettmers, 2023): 4-bit NF4 quantization + LoRA for 65B models on consumer GPUs. Alignment: PPO-RLHF (Ouyang, 2022), DPO (Rafailov, 2023), GRPO (DeepSeek). Inference: PagedAttention (vLLM), INT4/AWQ/GPTQ quantization, KV-cache optimization. Evaluation: LMSYS Elo, LLM-as-judge calibration, faithfulness metrics. Production gap (Huyen): non-determinism, output format fragility, silent failures.
Building a production LLM application means more than calling an API. The "full stack" spans fine-tuning (adapting a model to your domain), inference serving (running the model at scale), evaluation (measuring whether it actually works), and deployment (keeping it running reliably). This article covers the engineering layer between "I have a foundation model" and "I have a product."
"It's easy to make something cool with LLMs, but very hard to make something production-ready." — Chip Huyen (2023)
Fine-tuning¶
When to fine-tune vs. prompt¶
Fine-tuning trains the model's weights on your data. Prompting and context engineering modify the input without changing the model. The decision depends on:
| Factor | Use prompting / RAG | Fine-tune |
|---|---|---|
| Data volume | Small (dozens of examples) | Large (thousands+) |
| Task specificity | General instruction following | Domain-specific format, tone, or knowledge |
| Latency budget | Can tolerate longer prompts | Need shorter prompts + internalized behavior |
| Update frequency | Knowledge changes often | Behavior is stable, knowledge via RAG |
| Cost structure | Pay per token at inference | Pay upfront for training, cheaper inference |
Parameter-efficient fine-tuning (PEFT)¶
Full fine-tuning updates all model parameters — expensive and requires substantial GPU memory. PEFT methods update a small fraction:
- LoRA (Low-Rank Adaptation) — injects small trainable matrices into attention layers; typically updates <1% of parameters while achieving 90%+ of full fine-tuning performance. The foundational paper by Hu et al. (2021).
- QLoRA — combines LoRA with 4-bit quantization, enabling fine-tuning of 65B+ parameter models on a single consumer GPU. Dettmers et al. (2023).
RLHF and preference optimization¶
After SFT (supervised fine-tuning), alignment techniques shape model behavior: - RLHF (Ouyang et al., 2022) — train a reward model on human preferences, then optimize the LLM via PPO. The technique behind ChatGPT's helpfulness. - DPO (Rafailov et al., 2023) — directly optimize on preference pairs without a separate reward model. Simpler, more stable, increasingly preferred. - GRPO (Group Relative Policy Optimization) — used by DeepSeek for reasoning; optimizes on groups of sampled responses without a learned reward model.
Inference serving¶
Running LLMs in production requires specialized infrastructure:
Key concerns¶
- Latency — time to first token (TTFT) and tokens per second. Users notice >2s TTFT.
- Throughput — requests per second across concurrent users.
- Cost — GPU-hours per request. Dominates production costs.
- Memory — large models may not fit in a single GPU's VRAM; require model parallelism or quantization.
Inference engines¶
| Engine | Key feature |
|---|---|
| vLLM | PagedAttention for efficient KV-cache management; high throughput for batch serving |
| TGI (Text Generation Inference) | Hugging Face's production server; supports quantization, tensor parallelism, streaming |
| TensorRT-LLM | NVIDIA's optimized engine; INT4/INT8 quantization with hardware acceleration |
| Ollama | Local inference with one-command model downloads; developer-friendly for prototyping |
Quantization¶
Reducing model precision (FP16 → INT8 → INT4) shrinks memory footprint and increases speed at the cost of some quality. AWQ, GPTQ, and GGUF are common quantization formats. For many applications, INT4 quantization delivers 90%+ of FP16 quality at 4x memory reduction.
Evaluation¶
LLM evaluation is notoriously difficult — outputs are open-ended, subjective, and context-dependent.
Evaluation approaches¶
| Approach | What it measures | Limitations |
|---|---|---|
| Benchmarks (MMLU, HumanEval, MATH) | Specific capability on standardized tasks | May not reflect real-world performance |
| LLM-as-judge | Automated quality assessment using another LLM | Biases (verbosity, position); requires calibration |
| Human evaluation | Ground truth on quality, safety, helpfulness | Expensive, slow, hard to scale |
| A/B testing | Real user preference in production | Requires production traffic; slow feedback loop |
| LMSYS Chatbot Arena | Crowdsourced blind comparisons between models | Public leaderboard; strong signal for relative model quality |
What to measure¶
- Accuracy / correctness — does the output match ground truth?
- Faithfulness — does the output stay grounded in provided context (not hallucinate)?
- Helpfulness — does it actually answer the user's question?
- Safety — does it refuse harmful requests? Avoid generating problematic content?
- Latency and cost — fast enough and cheap enough for the use case?
Deployment and operations¶
Observability¶
LLM applications need purpose-built observability: - Trace logging — capture full prompt, response, latency, token count, and model version for every request - Cost tracking — token usage drives cost; track per-user, per-feature, per-model - Quality monitoring — automated checks for hallucination, refusal rates, and output format compliance
Experiment tracking¶
Tools like Weights & Biases, MLflow, and LangSmith track training runs, prompt versions, evaluation scores, and deployment metrics. Version your prompts and context configurations the same way you version code.
The production gap¶
Chip Huyen's framework identifies key production challenges: - Non-determinism — same input can produce different outputs; complicates testing - Output format fragility — no guarantee the model returns valid JSON, Markdown, or whatever downstream systems expect - Natural language ambiguity — instructions that seem clear to humans are ambiguous to models - Silent failures — the model produces plausible-sounding wrong answers; no error code, no stack trace
Key research and sources¶
| # | Source | Year | Why it matters |
|---|---|---|---|
| 1 | Chip Huyen, Building LLM Applications for Production | 2023 | Definitive practitioner guide to production challenges: ambiguity, non-determinism, evaluation, cost |
| 2 | Chip Huyen, AI Engineering (O'Reilly) | 2025 | Book-length treatment of building applications with foundation models |
| 3 | Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685) | 2021 | Parameter-efficient fine-tuning — update <1% of parameters, get 90%+ of full fine-tuning quality |
| 4 | Dettmers et al., QLoRA (arXiv:2305.14314) | 2023 | Fine-tune 65B models on a single GPU via 4-bit quantization + LoRA |
| 5 | Ouyang et al., Training Language Models to Follow Instructions (arXiv:2203.02155) | 2022 | The InstructGPT / RLHF paper — the technique behind ChatGPT's helpfulness |
| 6 | Rafailov et al., Direct Preference Optimization (arXiv:2305.18290) | 2023 | Simpler alternative to RLHF; directly optimize on preference pairs without a reward model |
| 7 | Post-Training Scaling survey (ACL 2025) | 2025 | SFT, RLxF, and test-time compute as the next scaling frontier |
| 8 | vLLM — PagedAttention and efficient inference | 2023+ | High-throughput LLM serving with PagedAttention; the de facto open-source inference engine |
| 9 | LMSYS Chatbot Arena | Ongoing | Crowdsourced model comparison via blind A/B testing; most trusted public LLM leaderboard |
| 10 | Hugging Face — Transformers, PEFT, TRL ecosystem | Ongoing | Open-source ecosystem for model hosting, fine-tuning (PEFT/TRL), inference, and evaluation |
Practical takeaways¶
- Start with prompting and RAG before fine-tuning. Fine-tuning is powerful but expensive and slow to iterate. Most applications should exhaust prompt engineering and retrieval-augmented generation first.
- Evaluation is the hardest part. If you can't measure whether your application works, you can't improve it. Invest in evaluation infrastructure before scaling.
- LoRA is the 80/20 of fine-tuning. For most domain adaptation tasks, LoRA on a strong base model outperforms full fine-tuning on cost and iteration speed.
- Instrument everything from day one. Log prompts, responses, latency, token counts, and costs. You'll need this data for debugging, optimization, and cost control.
- Treat prompts as code. Version them, test them, review them. A prompt change is a deployment.
Further reading¶
- LLM foundations (what you're fine-tuning): llm-foundations.md
- AI services (which APIs to use): ai-services-and-apis.md
- Full bibliography: SOURCE_INDEX.md