Building an LLM Gateway with LiteLLM
Every LLM project I've worked on eventually hits the same wall: you're calling three different APIs with three different SDKs, and switching between them means rewriting your integration code. OpenAI uses one format. Anthropic uses another. Your self-hosted Llama endpoint uses a third. And when management asks you to "just try Claude instead of GPT-4 for this use case," that "just" turns into a week of refactoring.
This is a solved problem. The solution is an LLM gateway, and LiteLLM is the one I keep reaching for.
What LiteLLM Does
LiteLLM gives you a unified OpenAI-compatible interface for over 100 LLM providers. Anthropic, Cohere, Replicate, Azure OpenAI, Bedrock, Vertex AI, local models via Ollama -- all accessible through the same completion() call with the same request and response format.
You can use it two ways. The Python SDK lets you drop litellm.completion() into your existing code and swap models by changing a string. The proxy server sits between your application and your LLM providers, acting as a unified API endpoint that any HTTP client can hit. Your application talks to one URL. LiteLLM figures out the rest.
The proxy mode is where things get interesting for teams. You deploy it once, and every service in your org talks to the same gateway. No more individual teams managing their own API keys and SDKs.
The Proxy Setup
Here's the Docker Compose setup I use to run the LiteLLM proxy with PostgreSQL for usage tracking:
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- DATABASE_URL=postgresql://llm_user:llm_pass@db:5432/litellm
- MASTER_KEY=sk-your-master-key
command: ["--config", "/app/config.yaml", "--detailed_debug"]
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_USER: llm_user
POSTGRES_PASSWORD: llm_pass
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:The config YAML is where you define your model routing:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: llama-local
litellm_params:
model: ollama/llama3.2
api_base: http://ollama:11434
# Fallback chain: try GPT-4o first, fall back to Claude
- model_name: best-available
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
model_info:
id: best-available-primary
- model_name: best-available
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
id: best-available-fallback
general_settings:
master_key: sk-your-master-key
database_url: os.environ/DATABASE_URL
litellm_settings:
num_retries: 3
request_timeout: 120
fallbacks: [{"best-available": ["best-available"]}]You can also create virtual API keys per team through the admin API. Each key has its own budget limits, allowed models, and rate limits. The product team gets access to GPT-4o with a monthly cap. The research team gets Claude with higher limits. Intern projects get Llama-local only. All tracked, all enforced at the gateway.
Why Gateways Matter
Once you have a gateway in place, you unlock capabilities that are painful to build yourself.
Automatic failover. If OpenAI goes down -- and it does -- your requests silently route to your fallback model. Your users don't notice. Your on-call engineer doesn't get paged at 2 AM. You define the fallback chain once in the config, and LiteLLM handles the rest.
Load balancing. When you have multiple deployments of the same model, maybe across regions or providers, the gateway distributes requests across them. You can weight the distribution or let it balance automatically based on latency.
Per-model rate limiting. Different providers have different rate limits. The gateway tracks them and queues or reroutes requests before you hit a 429. This matters a lot more than people realize until they're debugging why 5% of their production requests randomly fail.
Cost tracking by team and project. The PostgreSQL backend stores every request with its token count, cost, and the virtual key that made it. You can finally answer the question "how much is the recommendations team spending on GPT-4o this month?" without scraping billing dashboards. I go deeper into building cost dashboards and budget alerts in my post on tracking token costs.
A/B testing models in production. Want to know if Claude produces better results than GPT-4o for your specific use case? Route 10% of traffic to one, 90% to the other, and compare outputs. The gateway makes this a config change, not a code change.
The Code Is Almost Boring
Here's the Python SDK in action. Same interface, different models:
import litellm
# Call OpenAI
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this product description."}],
max_tokens=256,
)
print(response.choices[0].message.content)
# Call Anthropic -- same interface
response = litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Summarize this product description."}],
max_tokens=256,
)
print(response.choices[0].message.content)
# Call your self-hosted model -- still the same interface
response = litellm.completion(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "Summarize this product description."}],
api_base="http://localhost:11434",
max_tokens=256,
)
print(response.choices[0].message.content)Three different providers. Three different underlying APIs. One interface. The response object is identical across all three. Your application code doesn't know or care which model answered.
If you're using the proxy mode, it's even simpler. Just point the standard OpenAI SDK at your gateway URL:
from openai import OpenAI
client = OpenAI(
api_key="sk-your-virtual-key",
base_url="http://localhost:4000",
)
response = client.chat.completions.create(
model="best-available",
messages=[{"role": "user", "content": "Summarize this product description."}],
)Your existing OpenAI code works without modification. You just change the base URL and the key.
The Bottom Line
The LLM gateway is becoming as standard as the API gateway was five years ago. If you're building anything serious with LLMs -- especially if your team uses more than one provider -- running a gateway isn't optional anymore. It's infrastructure.
LiteLLM isn't the only option. There's also Portkey, Helicone, and a few others. But LiteLLM is open-source, self-hostable, and the community is moving fast. For most teams, it's the right starting point.
Stop writing provider-specific integration code. Put a gateway in front of your models, and spend your time on the parts of the system that actually differentiate your product.
Related Posts
vLLM PagedAttention: Why It's the Default for LLM Serving
vLLM's PagedAttention manages GPU memory like an OS manages virtual memory. Here's why it's become the standard for serving LLMs.
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
The LLM Inference Stack in 2026: From API Call to Response
The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.