The FastAPI + vLLM + Docker Stack for Serving LLMs

If you're serving LLMs in production today, you've probably already discovered that wrapping a HuggingFace model in a Flask app and praying doesn't scale. I learned this the hard way at BulkMagic, where we needed to serve multiple large language models with real latency requirements and real users waiting on the other end. After months of iteration, the stack I keep coming back to is FastAPI + vLLM + Docker. It's not fancy. It just works.

Why vLLM Over Vanilla HuggingFace

The short version: 3-5x throughput improvement with zero changes to your model. vLLM achieves this through a few key innovations. PagedAttention manages KV cache memory the way an operating system manages virtual memory, eliminating the wasteful pre-allocation that kills throughput on standard inference. Continuous batching means new requests get processed as soon as there's capacity, rather than waiting for the entire batch to finish. And tensor parallelism lets you split a single model across multiple GPUs without rewriting your serving code.

If you're running anything larger than a 7B parameter model with more than a handful of concurrent users, vanilla HuggingFace inference is leaving performance on the table.

The Architecture

Here's how the stack fits together:

Production LLM serving stack from client request through the API layer, request queue, vLLM inference engine, and down to the GPU hardware.

FastAPI handles everything before inference: authentication, input validation, rate limiting, and request queuing. Its async support is critical here because you need the API layer to handle hundreds of waiting connections while GPUs churn through the queue. vLLM handles inference, and only inference. You feed it prompts, it gives you tokens. Docker wraps the whole thing into a reproducible, deployable unit with GPU passthrough.

API Design Patterns That Matter

Three patterns I've found essential. First, streaming responses. For any generation longer than a sentence, stream tokens back via Server-Sent Events. Users perceive the response as faster, and your connection timeout problems disappear. Second, request batching at the API layer. Group similar-length prompts together before they hit vLLM. This improves GPU utilization dramatically. Third, queue management with backpressure. When the queue is full, return a 503 immediately rather than letting requests pile up and timeout.

@app.post("/generate")
async def generate(request: GenerateRequest):
    if request_queue.full():
        raise HTTPException(503, "Server at capacity")
 
    async def token_stream():
        async for token in vllm_engine.generate_stream(request.prompt):
            yield f"data: {token}\n\n"
 
    return StreamingResponse(token_stream(), media_type="text/event-stream")

Docker GPU Setup

The Docker side is straightforward but has gotchas (if you need a refresher on container basics, see my Docker 101 guide). You need the NVIDIA Container Toolkit installed on the host, and your docker-compose.yml needs the GPU reservation block. Pin your CUDA version. Pin your vLLM version. I've had deployments break because a minor vLLM update changed the default quantization behavior.

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

Monitoring and Health Checks

At minimum, track four things: tokens per second (your throughput metric), time to first token (your latency metric), queue depth (your capacity metric), and GPU memory utilization (your headroom metric). vLLM exposes Prometheus metrics natively. Hook them into Grafana and set alerts on queue depth. For a deeper approach to LLM observability, I recommend instrumenting with OpenTelemetry's GenAI conventions. If the queue is growing, you need more GPUs or fewer users.

The Bigger Picture

This stack is becoming the standard for self-hosted LLM serving, and that matters. A year ago, serving your own models felt like a research project. Today, with vLLM's maturity, FastAPI's ecosystem, and Docker's GPU support, it's a well-understood production pattern. The barrier to running your own inference has dropped dramatically. That shift changes the economics of the entire industry, because if serving your own model is cheap and reliable, the case for paying per-token to an API provider gets harder to justify.

The FastAPI + vLLM + Docker Stack for Serving LLMs

Why vLLM Over Vanilla HuggingFace

The Architecture

API Design Patterns That Matter

Docker GPU Setup

Monitoring and Health Checks

The Bigger Picture

Related Posts

From Hackathon to Production: What Changes When Prototypes Get Real

Voice AI Architecture: Building Conversational Agents at Scale

Multi-Agent Systems in Production: What Nobody Tells You