ยท6 min read

vLLM PagedAttention: Why It's the Default for LLM Serving

llm-servinginfrastructure

I wrote about the FastAPI + vLLM stack back in July. That post covered the full production architecture -- API layer, Docker GPU passthrough, monitoring, the works. But re-reading it, I realized I glossed over the single most important piece: PagedAttention. I mentioned it in one sentence and moved on. That was a mistake, because PagedAttention is the reason vLLM exists, and it's the reason vLLM became the default serving engine for anyone running LLMs at scale.

Let me fix that.

PagedAttention manages GPU memory like virtual memory: logical KV cache blocks are mapped to scattered physical blocks on demand, eliminating internal fragmentation.

The KV Cache Problem

Every time a transformer generates a token, it computes key-value pairs for every attention layer. These KV pairs get cached so the model doesn't have to recompute them for previous tokens when generating the next one. This is the KV cache, and it grows linearly with sequence length and model depth.

Here's where things get ugly. Traditional serving frameworks pre-allocate KV cache memory for the worst case -- the maximum sequence length the model supports. If your model handles 4096 tokens, the framework reserves GPU memory for all 4096 tokens the moment a request arrives, even if the actual generation only uses 200 tokens.

The waste is staggering. In practice, 60-80% of allocated GPU memory is just padding. Empty reserved space that no token will ever use. Multiply that across concurrent requests and you're burning through your most expensive resource -- GPU VRAM -- on nothing.

This is why naive HuggingFace serving falls over under load. It's not a compute bottleneck. It's a memory bottleneck. You run out of GPU RAM long before you run out of GPU compute.

How PagedAttention Fixes It

The insight behind PagedAttention is borrowed directly from operating systems. Your OS doesn't give each process a contiguous block of physical RAM. It uses virtual memory -- small fixed-size pages mapped to scattered physical locations. The process thinks it has a clean contiguous address space, but the OS is packing pages efficiently behind the scenes.

PagedAttention does the same thing for KV cache. Instead of one big contiguous reservation per request, the KV cache is split into small blocks (pages). Each block holds the KV pairs for a fixed number of tokens. Blocks are allocated on demand as tokens are actually generated, not reserved upfront.

TRADITIONAL ALLOCATION (contiguous, pre-allocated):

GPU Memory: [====Req A (padded)====][====Req B (padded)====][  WASTED  ]
             ^^^actual^^^  ^padding^  ^^^actual^^^  ^padding^

PAGEDATTENTION (non-contiguous, on-demand):

GPU Memory: [A][B][A][B][A][B][A][B][  FREE  ][  FREE  ][  FREE  ]
             ^allocated as needed^    ^available for new requests^

Because blocks are small and non-contiguous, there's no internal fragmentation. No padding. No wasted reservations. Memory that isn't actively holding KV pairs for real tokens stays in the free pool, available for new requests.

The virtual memory analogy goes deeper. PagedAttention supports copy-on-write for shared prefixes. If multiple requests start with the same system prompt, they share KV cache blocks for that prefix rather than duplicating them. This dovetails with prompt caching at the provider level, which reuses pre-computed KV tensors for identical prefixes to slash costs. For chat applications where every request includes the same system message, this alone can save significant memory.

The Numbers

The throughput improvements are not marginal. The original vLLM paper demonstrated 2-4x throughput over state-of-the-art systems like FasterTransformer in typical serving scenarios. In high-concurrency situations with many simultaneous requests and varied sequence lengths, the improvement climbs to up to 24x over naive HuggingFace inference.

That 24x number sounds like marketing, but it makes sense when you think about it. If traditional serving wastes 70% of GPU memory on padding, you can only fit a few concurrent requests. PagedAttention reclaims that memory, which means more concurrent requests, which means the GPU stays busier, which means higher throughput. The improvement compounds.

Even in modest setups -- single GPU, moderate concurrency -- you're looking at 2x throughput minimum. That means serving twice as many users with the same hardware, or cutting your GPU bill in half for the same traffic. Either way, it's the single highest-ROI optimization in the LLM serving stack.

Continuous Batching

PagedAttention doesn't work alone. vLLM pairs it with continuous batching, and the combination is what makes it dominant.

Traditional batching is static. You collect N requests, process them as a batch, wait for the entire batch to finish, then start the next batch. The problem is obvious: if one request in the batch generates 500 tokens and another generates 50, the short request is done but sits idle while the long one finishes. GPU cycles wasted.

Continuous batching eliminates this. When a request in the batch finishes generating, its slot is immediately filled by a new request from the queue. The batch is never stalled waiting for the slowest request. The GPU is never idle while requests are waiting.

Combined with PagedAttention, this creates a serving engine where memory is allocated exactly as needed and compute is utilized continuously. No padding waste, no batch stalls. It's the difference between a highway where every lane stays full and one where traffic stops at every on-ramp.

The Bottom Line

PagedAttention isn't a clever optimization. It's a fundamental rethinking of how GPU memory is managed during LLM inference. The virtual memory abstraction that operating systems figured out decades ago turns out to be exactly what LLM serving needed.

If you're serving any model larger than 7B to more than a handful of users, vLLM with PagedAttention isn't optional. It's table stakes. For lighter workloads and local prototyping, Ollama is the right starting point before graduating to vLLM.