ยท6 min read

Building an LLM Gateway with LiteLLM

llm-servinginfrastructure

Every LLM project I've worked on eventually hits the same wall: you're calling three different APIs with three different SDKs, and switching between them means rewriting your integration code. OpenAI uses one format. Anthropic uses another. Your self-hosted Llama endpoint uses a third. And when management asks you to "just try Claude instead of GPT-4 for this use case," that "just" turns into a week of refactoring.

This is a solved problem. The solution is an LLM gateway, and LiteLLM is the one I keep reaching for.

LiteLLM acts as a unified gateway: your application sends requests to one endpoint, and the gateway routes them to OpenAI, Anthropic, self-hosted models, or any other provider.

What LiteLLM Does

LiteLLM gives you a unified OpenAI-compatible interface for over 100 LLM providers. Anthropic, Cohere, Replicate, Azure OpenAI, Bedrock, Vertex AI, local models via Ollama -- all accessible through the same completion() call with the same request and response format.

You can use it two ways. The Python SDK lets you drop litellm.completion() into your existing code and swap models by changing a string. The proxy server sits between your application and your LLM providers, acting as a unified API endpoint that any HTTP client can hit. Your application talks to one URL. LiteLLM figures out the rest.

The proxy mode is where things get interesting for teams. You deploy it once, and every service in your org talks to the same gateway. No more individual teams managing their own API keys and SDKs.

The Proxy Setup

Here's the Docker Compose setup I use to run the LiteLLM proxy with PostgreSQL for usage tracking:

version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - DATABASE_URL=postgresql://llm_user:llm_pass@db:5432/litellm
      - MASTER_KEY=sk-your-master-key
    command: ["--config", "/app/config.yaml", "--detailed_debug"]
    depends_on:
      - db
 
  db:
    image: postgres:16
    environment:
      POSTGRES_USER: llm_user
      POSTGRES_PASSWORD: llm_pass
      POSTGRES_DB: litellm
    volumes:
      - pgdata:/var/lib/postgresql/data
 
volumes:
  pgdata:

The config YAML is where you define your model routing:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
 
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
 
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama:11434
 
  # Fallback chain: try GPT-4o first, fall back to Claude
  - model_name: best-available
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: best-available-primary
  - model_name: best-available
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      id: best-available-fallback
 
general_settings:
  master_key: sk-your-master-key
  database_url: os.environ/DATABASE_URL
 
litellm_settings:
  num_retries: 3
  request_timeout: 120
  fallbacks: [{"best-available": ["best-available"]}]

You can also create virtual API keys per team through the admin API. Each key has its own budget limits, allowed models, and rate limits. The product team gets access to GPT-4o with a monthly cap. The research team gets Claude with higher limits. Intern projects get Llama-local only. All tracked, all enforced at the gateway.

Why Gateways Matter

Once you have a gateway in place, you unlock capabilities that are painful to build yourself.

Automatic failover. If OpenAI goes down -- and it does -- your requests silently route to your fallback model. Your users don't notice. Your on-call engineer doesn't get paged at 2 AM. You define the fallback chain once in the config, and LiteLLM handles the rest.

Load balancing. When you have multiple deployments of the same model, maybe across regions or providers, the gateway distributes requests across them. You can weight the distribution or let it balance automatically based on latency.

Per-model rate limiting. Different providers have different rate limits. The gateway tracks them and queues or reroutes requests before you hit a 429. This matters a lot more than people realize until they're debugging why 5% of their production requests randomly fail.

Cost tracking by team and project. The PostgreSQL backend stores every request with its token count, cost, and the virtual key that made it. You can finally answer the question "how much is the recommendations team spending on GPT-4o this month?" without scraping billing dashboards. I go deeper into building cost dashboards and budget alerts in my post on tracking token costs.

A/B testing models in production. Want to know if Claude produces better results than GPT-4o for your specific use case? Route 10% of traffic to one, 90% to the other, and compare outputs. The gateway makes this a config change, not a code change.

The Code Is Almost Boring

Here's the Python SDK in action. Same interface, different models:

import litellm
 
# Call OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this product description."}],
    max_tokens=256,
)
print(response.choices[0].message.content)
 
# Call Anthropic -- same interface
response = litellm.completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Summarize this product description."}],
    max_tokens=256,
)
print(response.choices[0].message.content)
 
# Call your self-hosted model -- still the same interface
response = litellm.completion(
    model="ollama/llama3.2",
    messages=[{"role": "user", "content": "Summarize this product description."}],
    api_base="http://localhost:11434",
    max_tokens=256,
)
print(response.choices[0].message.content)

Three different providers. Three different underlying APIs. One interface. The response object is identical across all three. Your application code doesn't know or care which model answered.

If you're using the proxy mode, it's even simpler. Just point the standard OpenAI SDK at your gateway URL:

from openai import OpenAI
 
client = OpenAI(
    api_key="sk-your-virtual-key",
    base_url="http://localhost:4000",
)
 
response = client.chat.completions.create(
    model="best-available",
    messages=[{"role": "user", "content": "Summarize this product description."}],
)

Your existing OpenAI code works without modification. You just change the base URL and the key.

The Bottom Line

The LLM gateway is becoming as standard as the API gateway was five years ago. If you're building anything serious with LLMs -- especially if your team uses more than one provider -- running a gateway isn't optional anymore. It's infrastructure.

LiteLLM isn't the only option. There's also Portkey, Helicone, and a few others. But LiteLLM is open-source, self-hostable, and the community is moving fast. For most teams, it's the right starting point.

Stop writing provider-specific integration code. Put a gateway in front of your models, and spend your time on the parts of the system that actually differentiate your product.