vLLM PagedAttention: Why It's the Default for LLM Serving
I wrote about the FastAPI + vLLM stack back in July. That post covered the full production architecture -- API layer, Docker GPU passthrough, monitoring, the works. But re-reading it, I realized I glossed over the single most important piece: PagedAttention. I mentioned it in one sentence and moved on. That was a mistake, because PagedAttention is the reason vLLM exists, and it's the reason vLLM became the default serving engine for anyone running LLMs at scale.
Let me fix that.
The KV Cache Problem
Every time a transformer generates a token, it computes key-value pairs for every attention layer. These KV pairs get cached so the model doesn't have to recompute them for previous tokens when generating the next one. This is the KV cache, and it grows linearly with sequence length and model depth.
Here's where things get ugly. Traditional serving frameworks pre-allocate KV cache memory for the worst case -- the maximum sequence length the model supports. If your model handles 4096 tokens, the framework reserves GPU memory for all 4096 tokens the moment a request arrives, even if the actual generation only uses 200 tokens.
The waste is staggering. In practice, 60-80% of allocated GPU memory is just padding. Empty reserved space that no token will ever use. Multiply that across concurrent requests and you're burning through your most expensive resource -- GPU VRAM -- on nothing.
This is why naive HuggingFace serving falls over under load. It's not a compute bottleneck. It's a memory bottleneck. You run out of GPU RAM long before you run out of GPU compute.
How PagedAttention Fixes It
The insight behind PagedAttention is borrowed directly from operating systems. Your OS doesn't give each process a contiguous block of physical RAM. It uses virtual memory -- small fixed-size pages mapped to scattered physical locations. The process thinks it has a clean contiguous address space, but the OS is packing pages efficiently behind the scenes.
PagedAttention does the same thing for KV cache. Instead of one big contiguous reservation per request, the KV cache is split into small blocks (pages). Each block holds the KV pairs for a fixed number of tokens. Blocks are allocated on demand as tokens are actually generated, not reserved upfront.
TRADITIONAL ALLOCATION (contiguous, pre-allocated):
GPU Memory: [====Req A (padded)====][====Req B (padded)====][ WASTED ]
^^^actual^^^ ^padding^ ^^^actual^^^ ^padding^
PAGEDATTENTION (non-contiguous, on-demand):
GPU Memory: [A][B][A][B][A][B][A][B][ FREE ][ FREE ][ FREE ]
^allocated as needed^ ^available for new requests^
Because blocks are small and non-contiguous, there's no internal fragmentation. No padding. No wasted reservations. Memory that isn't actively holding KV pairs for real tokens stays in the free pool, available for new requests.
The virtual memory analogy goes deeper. PagedAttention supports copy-on-write for shared prefixes. If multiple requests start with the same system prompt, they share KV cache blocks for that prefix rather than duplicating them. This dovetails with prompt caching at the provider level, which reuses pre-computed KV tensors for identical prefixes to slash costs. For chat applications where every request includes the same system message, this alone can save significant memory.
The Numbers
The throughput improvements are not marginal. The original vLLM paper demonstrated 2-4x throughput over state-of-the-art systems like FasterTransformer in typical serving scenarios. In high-concurrency situations with many simultaneous requests and varied sequence lengths, the improvement climbs to up to 24x over naive HuggingFace inference.
That 24x number sounds like marketing, but it makes sense when you think about it. If traditional serving wastes 70% of GPU memory on padding, you can only fit a few concurrent requests. PagedAttention reclaims that memory, which means more concurrent requests, which means the GPU stays busier, which means higher throughput. The improvement compounds.
Even in modest setups -- single GPU, moderate concurrency -- you're looking at 2x throughput minimum. That means serving twice as many users with the same hardware, or cutting your GPU bill in half for the same traffic. Either way, it's the single highest-ROI optimization in the LLM serving stack.
Continuous Batching
PagedAttention doesn't work alone. vLLM pairs it with continuous batching, and the combination is what makes it dominant.
Traditional batching is static. You collect N requests, process them as a batch, wait for the entire batch to finish, then start the next batch. The problem is obvious: if one request in the batch generates 500 tokens and another generates 50, the short request is done but sits idle while the long one finishes. GPU cycles wasted.
Continuous batching eliminates this. When a request in the batch finishes generating, its slot is immediately filled by a new request from the queue. The batch is never stalled waiting for the slowest request. The GPU is never idle while requests are waiting.
Combined with PagedAttention, this creates a serving engine where memory is allocated exactly as needed and compute is utilized continuously. No padding waste, no batch stalls. It's the difference between a highway where every lane stays full and one where traffic stops at every on-ramp.
The Bottom Line
PagedAttention isn't a clever optimization. It's a fundamental rethinking of how GPU memory is managed during LLM inference. The virtual memory abstraction that operating systems figured out decades ago turns out to be exactly what LLM serving needed.
If you're serving any model larger than 7B to more than a handful of users, vLLM with PagedAttention isn't optional. It's table stakes. For lighter workloads and local prototyping, Ollama is the right starting point before graduating to vLLM.
Related Posts
Building an LLM Gateway with LiteLLM
One API to call OpenAI, Anthropic, and self-hosted models. LiteLLM handles routing, fallbacks, and cost tracking so you don't have to.
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
The LLM Inference Stack in 2026: From API Call to Response
The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.