·11 min read

The LLM Inference Stack in 2026: From API Call to Response

infrastructureopinion

At the start of this month, I wrote about individual pieces of the LLM serving stack: vLLM, Ollama, LiteLLM, OpenTelemetry. Each post zoomed into one layer -- how PagedAttention manages GPU memory, how LiteLLM unifies provider APIs, how OTel semantic conventions make LLM observability real, how prompt caching slashes costs by reusing KV tensors.

Let me zoom out and show how they all fit together.

Because here's the thing: these aren't isolated tools. They're layers in a stack that, as of early 2026, has become remarkably well-defined. If you're serving LLMs in production today, you're running some version of this architecture whether you designed it intentionally or stumbled into it.

The Full Stack

Here's what the production LLM inference stack looks like from top to bottom:

                        CLIENT REQUEST
                             |
                             v
                  +---------------------+
                  |    API GATEWAY       |
                  |  (auth, rate limit,  |
                  |   request routing)   |
                  +---------------------+
                             |
                             v
                  +---------------------+
                  |    LLM ROUTER       |
                  |  (LiteLLM - model   |
                  |   selection,        |
                  |   fallback chains,  |
                  |   load balancing)   |
                  +---------------------+
                             |
                             v
                  +---------------------+
                  |  PROMPT CACHE CHECK |
                  |  (provider-native   |
                  |   KV tensor reuse)  |
                  +---------------------+
                             |
                             v
                  +---------------------+
                  |  INFERENCE ENGINE   |
                  |  (vLLM / TGI -      |
                  |   PagedAttention,   |
                  |   continuous        |
                  |   batching)         |
                  +---------------------+
                             |
                             v
                  +---------------------+
                  |    GPU CLUSTER      |
                  |  (tensor parallel,  |
                  |   pipeline parallel)|
                  +---------------------+
                             |
                             v
                  +---------------------+
                  | RESPONSE STREAMING  |
                  |  (SSE / WebSocket,  |
                  |   token-by-token)   |
                  +---------------------+
                             |
                             v
        +--------------------+--------------------+
        |                                         |
        v                                         v
+------------------+                  +-------------------+
| OBSERVABILITY    |                  | COST TRACKING     |
| PIPELINE         |                  | (token counts,    |
| (OTel ->         |                  |  per-model costs, |
| Langfuse /       |                  |  budget alerts)   |
| LangSmith)       |                  +-------------------+
+------------------+
The full LLM inference stack from client request to observability. Each layer handles a distinct concern in the serving pipeline.

Every production LLM deployment I've seen in the last six months has some version of this. The specific tools vary -- maybe you use AWS API Gateway instead of Kong, or TGI instead of vLLM, or Phoenix instead of Langfuse -- but the layers are always the same.

Let me walk through why each one exists.

The API gateway is table stakes. Authentication, rate limiting, request validation. This isn't LLM-specific and I won't belabor it. If you're exposing inference endpoints without a gateway, you're one scraped API key away from a very expensive morning.

The LLM router is where it gets interesting. I wrote about LiteLLM earlier this month, and this is the layer it owns. One unified interface to every provider. Model fallback chains so when OpenAI has a bad day, your app switches to Anthropic without your users noticing. Load balancing across multiple deployments. Cost-based routing so cheap queries go to cheap models. This layer barely existed 18 months ago. Now it's non-negotiable.

Prompt caching sits between the router and the engine. I covered this in detail -- both Anthropic and OpenAI now offer provider-native caching that reuses pre-computed KV tensors for identical prompt prefixes. If your system prompt is 2000 tokens and you're sending it with every request, caching turns that from "pay for 2000 input tokens every time" to "pay once, reuse forever." The savings are dramatic and the implementation cost is nearly zero.

The inference engine is the heart of the stack. vLLM with PagedAttention for memory-efficient KV cache management and continuous batching to maximize GPU throughput. I wrote about this in my first January post -- the virtual memory trick that turned GPU memory waste from 60-80% to near zero. This is the layer that made self-hosted LLM serving actually viable.

The GPU cluster handles the physical compute -- tensor parallelism to split models across GPUs, pipeline parallelism for larger deployments. Most teams don't think about this layer directly because vLLM and TGI abstract it away, but it's there.

Response streaming delivers tokens as they're generated. Server-Sent Events for HTTP clients, WebSockets for real-time applications. This is why ChatGPT feels fast even when generation takes 10 seconds -- you see the first token in 200ms.

Finally, the observability pipeline and cost tracking run in parallel on every response. OTel spans capture prompt content, token counts, model version, latency breakdown. That data flows to Langfuse or LangSmith or Braintrust for analysis. Cost tracking aggregates token usage by model, endpoint, user segment, and query type. I covered both of these layers in dedicated posts this month.

What Changed from 2024

If you were serving LLMs in production in 2024, you were probably duct-taping this stack together yourself. Here's what shifted.

vLLM became the undisputed default. In mid-2024, there was still a real debate between vLLM, TGI, and various llama.cpp wrappers. That debate is over. vLLM's combination of PagedAttention, continuous batching, and broad model support made it the standard. TGI is still solid, especially in Hugging Face's ecosystem, but the community momentum is overwhelmingly vLLM. Every major cloud provider's managed inference offering is built on it or heavily inspired by it.

Prompt caching became provider-native. In 2024, teams were rolling their own semantic caching layers with vector similarity search -- check if a similar prompt was asked before, return the cached response. It worked poorly. The similarity threshold was always wrong. Either you cache-hit too aggressively and return stale answers, or too conservatively and the cache is useless. Provider-native caching at the KV tensor level is deterministic, exact-match, and operates at a layer where it can't return wrong answers. It just skips redundant computation. This is a much better abstraction.

LLM gateways emerged as a category. LiteLLM, Portkey, Helicone -- the idea that you need a unified API layer in front of your LLM providers went from "nice to have" to "how else would you do it." Teams running multiple models without a gateway in 2026 are doing the equivalent of managing database connections without a connection pool. It's technically possible and practically insane.

Observability matured from "log everything" to structured conventions. The OTel GenAI semantic conventions gave the ecosystem a common vocabulary. Instead of every team inventing their own span attributes for token counts and model versions, there's now a standard. Tooling caught up fast. Langfuse and LangSmith both support OTel ingestion natively. The days of console.log debugging LLM pipelines are, mercifully, numbered.

Where Each Layer Is Heading

The stack is stable but not static. Here's where I see each layer evolving through 2026.

Inference: speculative decoding and disaggregated serving. Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. When the draft model guesses correctly -- and it does, often -- you get the quality of the large model at nearly the speed of the small one. Disaggregated serving separates the prefill phase (processing the input prompt) from the decode phase (generating tokens) onto different hardware, because they have completely different compute profiles. Prefill is compute-bound. Decode is memory-bound. Running both on the same GPU is a compromise that neither phase is happy with.

Routing: semantic routing by query intent. Today's routers select models based on static rules -- this endpoint uses GPT-4o, that one uses Claude. The next generation will classify queries by intent and route dynamically. A simple factual lookup doesn't need a frontier model. A complex multi-step reasoning task does. The router should know the difference and act accordingly, cutting costs without sacrificing quality where it matters.

Caching: smarter semantic caching that actually works. Provider-native exact-match caching is great for identical prefixes. But the holy grail is semantic caching -- recognizing that "What's the capital of France?" and "Tell me France's capital city" should return the same cached response. Early attempts were bad. But embedding models have gotten good enough, and cache invalidation strategies have gotten smart enough, that I expect reliable semantic caching layers to emerge this year. The key insight is that semantic caching should be opt-in per query type, not a blanket policy.

Observability: real-time eval-in-production. Today's observability is backward-looking. You trace requests, review them later, maybe run offline evals nightly. The next step is evaluating quality in real-time as responses stream. Lightweight classifiers that score hallucination risk, relevance, and safety before the response reaches the user. Not replacing offline evals -- supplementing them with a real-time safety net. Langfuse is already building toward this with their online evaluation features.

The Self-Hosting vs API Decision

All of this raises the obvious question: should you even run this stack yourself?

Self-host when:

  • Data privacy is non-negotiable. Healthcare, finance, legal. If your data cannot leave your infrastructure, API providers are off the table. Full stop.
  • You're at scale. Past roughly 100M tokens per month, self-hosting on your own GPU cluster starts beating API pricing. The exact crossover depends on your model choice and hardware costs, but the economics flip faster than most people expect.
  • You need fine-tuned models. If you're running custom LoRA fine-tunes or domain-specific models that aren't available from providers, self-hosting is your only option. The stack I described above -- vLLM, LiteLLM, OTel -- all works the same with your own models.
  • Inference is a core competency. If you're an AI company whose product is built on model serving, you should own the stack. Outsourcing your core to API rate limits and provider outages is a strategic risk.

Use API providers when:

  • You're small. Below the cost crossover point, API providers are cheaper because you're not paying for idle GPUs. You're paying per token, which scales linearly with actual usage.
  • You need frontier models. GPT-4o, Claude Opus, Gemini Ultra -- if you need the absolute best models, the providers have them and you don't. Self-hosting open models is viable for many use cases, but there are still tasks where the proprietary frontier models are meaningfully better.
  • ML infrastructure isn't your thing. Running GPU clusters, managing model updates, handling failover -- this is operational complexity. If your team's expertise is in building the application layer, not the infrastructure layer, let someone else manage the GPUs.
  • Speed to market matters. An API call is an afternoon. A self-hosted inference stack is a quarter. If you're validating a product idea, don't build infrastructure for a product that might not exist in three months.

Most teams I advise end up with a hybrid approach: API providers for frontier model capabilities and quick experimentation, self-hosted infrastructure for high-volume production workloads where the economics and privacy requirements justify it. LiteLLM makes this painless because your application code doesn't care which backend is serving the response.

Wrapping Up January

This month, I set out to document the LLM serving stack piece by piece. vLLM and PagedAttention. Ollama for local development. LiteLLM as the routing layer. OpenTelemetry for observability. Langfuse vs LangSmith for the analysis platform. Token cost tracking. Prompt caching.

Each post covered one layer in depth. This post is the zoomed-out view -- how they connect, why each layer exists, and where the whole stack is heading.

The thing that strikes me most, looking at the complete picture, is how normal this all feels now. Two years ago, serving an LLM in production was a research project. You were reading papers, writing custom CUDA kernels, praying your OOM errors would stop. Today, it's a stack. A well-understood, well-tooled, boring stack.

The infrastructure is boring now. That's the best compliment I can give it.

Boring means stable. Boring means predictable. Boring means teams can stop worrying about how to serve models and start worrying about what to build with them. And that shift -- from infrastructure as a challenge to infrastructure as a commodity -- is the real story of early 2026.

On to February.