ยท9 min read

Self-Hosting LLMs with Ollama: When It Makes Sense

llm-servingtools

The pitch for Ollama is irresistible: one command to download and run any open model locally. ollama run llama3 and you're chatting. I use it daily for prototyping. When I'm evaluating a new model, testing prompt strategies, or building something that talks to a local LLM, Ollama is the first thing I reach for. It removes every barrier between "I want to try this model" and actually running it.

But I've also watched people try to take that same ollama run setup and put it in front of real users. That's where the story gets more complicated.

The Ollama workflow: pull a model from the registry, quantize it automatically, serve it via a local REST API, and hit the endpoint from your application.

What Ollama Actually Does

Ollama is a model management layer built on top of llama.cpp. That distinction matters. It's not an inference engine in the way vLLM or TGI are inference engines. It's a developer experience layer that makes the excellent but low-level llama.cpp accessible to anyone who can type a terminal command.

Here's what it handles for you. Model downloads from a curated registry, similar to Docker Hub but for LLMs. You pull models by name and tag. Quantization formats are abstracted away -- you don't need to know that you're running a Q4_K_M GGUF file unless you want to. GPU offloading happens automatically based on your available VRAM. If you have a GPU, Ollama uses it. If you don't, it falls back to CPU. No flags, no configuration.

The Modelfile system is where things get interesting for customization. Think of it like a Dockerfile but for model configuration. You can set system prompts, adjust temperature, define stop tokens, and layer LoRA adapters, all in a declarative file. It's genuinely well-designed.

FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful coding assistant. Be concise."

And the REST API makes programmatic access trivial. A POST to localhost:11434/api/generate with a JSON body and you're getting tokens back. It's compatible with the OpenAI API format too, which means most LLM libraries and frameworks work with it out of the box.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain PagedAttention in two sentences."
}'

For a single developer on a single machine, this is about as good as the experience gets.

Where It's Perfect

Local development and prototyping. This is Ollama's sweet spot and it's a big one. When I'm building a new feature that involves LLM calls, I don't want to burn API credits on OpenAI while I'm iterating on prompts and testing edge cases. Ollama lets me make hundreds of calls for free while I figure out what I actually want the model to do. The feedback loop is tight, the cost is zero, and I'm not dependent on anyone's API being up.

Privacy-sensitive workflows. If you're working with data that can't leave your machine -- medical records, financial documents, proprietary code -- local inference isn't a nice-to-have, it's a requirement. Ollama makes this dead simple. No network calls, no terms of service to audit, no data processing agreements. The data stays on your hardware, full stop.

Offline environments. I've used Ollama on planes, in coffee shops with terrible WiFi, and in environments where network access was restricted by policy. Once you've pulled a model, it works without any internet connection. That reliability is underrated.

Quick model evaluation. When a new open model drops, I can have it running locally in under a minute. ollama pull the model, run a few test prompts, compare it against what I was using before. No signing up for a new API, no reading documentation about authentication. Just pull and run.

Teaching and demos. I use Ollama constantly when showing people how LLMs work. There's something powerful about running a model on your own hardware and watching it generate tokens in real time. It demystifies the technology in a way that calling an API never does. Students get it immediately when they can see the model running on the machine in front of them.

Where It Breaks Down

Here's the part that most Ollama tutorials skip: concurrency.

Ollama was designed as a single-user, single-request serving tool. When one person is sending prompts and waiting for responses, it works beautifully. You'll see around 50 tokens per second on a decent GPU, which feels fast and responsive. The experience is genuinely good.

The problem starts when you try to serve multiple users. At 5 concurrent requests, you'll notice latency creeping up. At 10+, it degrades significantly. By the time you have 20 people hitting the same Ollama instance, response times become unacceptable.

This isn't a bug. It's an architectural limitation. Ollama lacks the key features that production inference engines rely on for throughput under load.

No continuous batching. Production engines like vLLM process multiple requests simultaneously, dynamically adding new requests to the batch as capacity opens up. Ollama processes requests more sequentially, which means each new request waits for previous ones to clear.

No PagedAttention. This is the memory management technique that lets vLLM serve many concurrent requests without running out of GPU memory. Without it, KV cache memory is allocated inefficiently, and you hit VRAM limits much sooner than you should.

No tensor parallelism. If you have multiple GPUs, production engines can split a single model across them to increase throughput. Ollama doesn't support this. You get one GPU per model instance.

The numbers tell the story. Ollama serving a single user: ~50 tokens/sec, great experience. vLLM serving 20 concurrent users: ~200 tokens/sec sustained, with consistent per-user latency. The gap isn't about raw speed -- it's about what happens when load increases. Ollama's throughput per user drops linearly with concurrency. vLLM's stays relatively flat until you hit GPU saturation.

The Progression I Recommend

After going through this evolution on multiple projects, here's the path I suggest.

Start with Ollama. Always. For every new LLM project, begin with Ollama on your local machine. Use it to evaluate models, develop prompts, build your application logic, and test your integration. It's the fastest way to go from idea to working prototype. Don't overthink the infrastructure at this stage.

Migrate to vLLM or TGI when you need concurrent users. The moment your application needs to serve more than a handful of simultaneous users, swap out the inference backend. vLLM is my default recommendation because the PagedAttention + continuous batching combination handles real-world load patterns better than anything else I've tested. Text Generation Inference from Hugging Face is a solid alternative, especially if you're already in the Hugging Face ecosystem.

The migration is usually straightforward because Ollama's API is similar enough to OpenAI's format that most application code doesn't need to change. You're swapping the inference backend, not rewriting your app.

Use Ollama in CI for model testing. This is an underrated pattern. If you have tests that validate model behavior -- output format, response quality, edge case handling -- run them against Ollama in your CI pipeline. It's lightweight enough to run on a standard CI runner, and it gives you confidence that model updates don't break your application. No GPU CI runners needed for basic validation with smaller quantized models.

Keep Ollama for personal use regardless. Even when your production stack is vLLM behind a load balancer, keep Ollama on your dev machine. It's the best tool for the "let me quickly try something" workflow that makes up 80% of LLM development time.

Why Local Models Still Matter

There's an argument that local inference is pointless when API providers like OpenAI and Anthropic are so cheap per token. I disagree, and not just for the obvious privacy and cost reasons.

Running models locally builds intuition. When you see a 7B model struggle with something that a 70B model handles easily, you develop a gut sense for model capabilities that no amount of API documentation gives you. When you watch GPU memory fill up as context length increases, you understand the tradeoffs at a visceral level. That intuition makes you a better engineer even when you're using hosted APIs.

Local models keep you independent. API providers change pricing, change rate limits, change models, deprecate endpoints. If your entire stack depends on one provider's API, you're one policy change away from a bad day. Having the muscle memory to stand up your own inference means you always have an exit strategy. An LLM gateway like LiteLLM makes the transition between local and hosted models seamless.

The open-source model ecosystem is accelerating. Every month, the gap between open models and frontier APIs narrows. Llama 3, Mixtral, DeepSeek, Qwen -- the quality of freely available models is stunning. The tooling to run them needs to keep up, and Ollama is a critical piece of that ecosystem. It's the on-ramp that gets developers comfortable with open models, and from there they graduate to production-grade serving when they're ready.

Ollama isn't trying to be a production inference engine. It's trying to be the best possible developer experience for local LLMs, and it succeeds at that completely. Use it for what it's good at, know when to move beyond it, and you'll have the right tool at every stage of the journey.