The FastAPI + vLLM + Docker Stack for Serving LLMs
If you're serving LLMs in production today, you've probably already discovered that wrapping a HuggingFace model in a Flask app and praying doesn't scale. I learned this the hard way at BulkMagic, where we needed to serve multiple large language models with real latency requirements and real users waiting on the other end. After months of iteration, the stack I keep coming back to is FastAPI + vLLM + Docker. It's not fancy. It just works.
Why vLLM Over Vanilla HuggingFace
The short version: 3-5x throughput improvement with zero changes to your model. vLLM achieves this through a few key innovations. PagedAttention manages KV cache memory the way an operating system manages virtual memory, eliminating the wasteful pre-allocation that kills throughput on standard inference. Continuous batching means new requests get processed as soon as there's capacity, rather than waiting for the entire batch to finish. And tensor parallelism lets you split a single model across multiple GPUs without rewriting your serving code.
If you're running anything larger than a 7B parameter model with more than a handful of concurrent users, vanilla HuggingFace inference is leaving performance on the table.
The Architecture
Here's how the stack fits together:
FastAPI handles everything before inference: authentication, input validation, rate limiting, and request queuing. Its async support is critical here because you need the API layer to handle hundreds of waiting connections while GPUs churn through the queue. vLLM handles inference, and only inference. You feed it prompts, it gives you tokens. Docker wraps the whole thing into a reproducible, deployable unit with GPU passthrough.
API Design Patterns That Matter
Three patterns I've found essential. First, streaming responses. For any generation longer than a sentence, stream tokens back via Server-Sent Events. Users perceive the response as faster, and your connection timeout problems disappear. Second, request batching at the API layer. Group similar-length prompts together before they hit vLLM. This improves GPU utilization dramatically. Third, queue management with backpressure. When the queue is full, return a 503 immediately rather than letting requests pile up and timeout.
@app.post("/generate")
async def generate(request: GenerateRequest):
if request_queue.full():
raise HTTPException(503, "Server at capacity")
async def token_stream():
async for token in vllm_engine.generate_stream(request.prompt):
yield f"data: {token}\n\n"
return StreamingResponse(token_stream(), media_type="text/event-stream")Docker GPU Setup
The Docker side is straightforward but has gotchas (if you need a refresher on container basics, see my Docker 101 guide). You need the NVIDIA Container Toolkit installed on the host, and your docker-compose.yml needs the GPU reservation block. Pin your CUDA version. Pin your vLLM version. I've had deployments break because a minor vLLM update changed the default quantization behavior.
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]Monitoring and Health Checks
At minimum, track four things: tokens per second (your throughput metric), time to first token (your latency metric), queue depth (your capacity metric), and GPU memory utilization (your headroom metric). vLLM exposes Prometheus metrics natively. Hook them into Grafana and set alerts on queue depth. For a deeper approach to LLM observability, I recommend instrumenting with OpenTelemetry's GenAI conventions. If the queue is growing, you need more GPUs or fewer users.
The Bigger Picture
This stack is becoming the standard for self-hosted LLM serving, and that matters. A year ago, serving your own models felt like a research project. Today, with vLLM's maturity, FastAPI's ecosystem, and Docker's GPU support, it's a well-understood production pattern. The barrier to running your own inference has dropped dramatically. That shift changes the economics of the entire industry, because if serving your own model is cheap and reliable, the case for paying per-token to an API provider gets harder to justify.
Related Posts
From Hackathon to Production: What Changes When Prototypes Get Real
After years of hackathons and production systems, I've learned the gap between a winning demo and a reliable product is mostly about what you choose to worry about.
Voice AI Architecture: Building Conversational Agents at Scale
The full architecture behind voice AI systems. Pipeline design, latency budgets, and why voice is a fundamentally different engineering challenge than chat.
Multi-Agent Systems in Production: What Nobody Tells You
Lessons from building multi-agent systems that actually run in production. What works, what doesn't, and what the hype skips over.