Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS
At BulkMagic, we needed a service that could take raw product data and generate clean, consistent summaries at scale. Not a chatbot. Not a playground. A proper microservice that other services could call, get a response, and move on. The kind of thing that needs to work at 3 AM on a Tuesday with nobody watching.
Here's how I built it with FastAPI, Llama 3.2, and AWS ECS, and what I learned about running open-source LLMs in production.
The Architecture
The system is straightforward. A FastAPI application wraps Llama 3.2, exposes a REST API, gets containerized with Docker, and runs on AWS ECS with auto-scaling.
Nothing exotic. That simplicity is deliberate. When you're building infrastructure that needs 99.9% reliability, every unnecessary component is a liability.
API Design for LLM Services
LLM endpoints are different from typical CRUD endpoints. The response times are measured in seconds, not milliseconds. That changes everything about how you design the API.
Streaming responses are non-negotiable for anything user-facing. FastAPI's StreamingResponse with server-sent events works well here. For service-to-service calls where the downstream consumer just needs the final result, a standard JSON response with aggressive timeout handling is cleaner.
Token counting matters more than you'd think. I added input token estimation to the validation layer so we could reject requests that would exceed the model's context window before wasting compute on them. A fast rejection is better than a slow failure.
Structured output validation was the other critical piece. LLMs don't always return well-formed JSON. I built a Pydantic validation layer that retries with adjusted prompts when the model output fails to parse. In production, this retry logic fires on about 3% of requests, and it catches almost all of them on the second attempt. I later wrote a deeper breakdown of JSON mode vs function calling patterns -- the retry-with-error-feedback approach is the key to making this reliable.
Containerization Challenges
Packaging an LLM into a Docker container is where things get interesting. The model weights for Llama 3.2 are several gigabytes. You don't want to download them every time a container starts.
Model caching was the first optimization. We bake the quantized model weights into the Docker image during the build step. The image is large, around 8GB, but container startup is fast because there's no download step. ECS pulls the image once and caches it on the host.
GPU containers require the NVIDIA Container Toolkit and specific base images (if you are new to containerization, my Docker 101 walkthrough covers the fundamentals). We used nvidia/cuda:12.1.0-runtime-ubuntu22.04 as the base and installed our Python dependencies on top. Getting CUDA versions aligned between the base image, PyTorch, and the host driver is the kind of dependency management that will eat an entire afternoon if you're not careful.
ECS Task Definitions and Scaling
The ECS task definition specifies GPU requirements, memory limits, and health check endpoints. We run on g5.xlarge instances with one task per instance, since each Llama 3.2 instance needs the full GPU.
Auto-scaling is based on a custom CloudWatch metric tracking request queue depth rather than CPU utilization. CPU is a poor proxy for LLM workload because inference is GPU-bound. When the queue depth exceeds a threshold, ECS spins up additional tasks. Scale-down is conservative, with a 10-minute cooldown, because cold starts on GPU instances are expensive in both time and cost.
Production Reliability
Hitting 99.9% API reliability required a few things beyond the happy path. Circuit breakers prevent cascade failures when the model gets stuck. Health checks verify not just that the container is running but that the model is loaded and responsive. Structured logging with correlation IDs makes it possible to trace a request through the entire pipeline when something goes wrong.
The monitoring stack is CloudWatch for metrics, structured JSON logs for debugging, and PagerDuty alerts for when response latency exceeds our SLA. For more on tracking token costs specifically, I wrote a dedicated post on building cost dashboards. Most production incidents weren't model failures. They were infrastructure issues like ECS task placement failures or GPU memory leaks from long-running containers.
The Bigger Picture
The pattern here, FastAPI wrapper, open-source LLM, Docker container, cloud orchestration, is becoming the standard architecture for AI microservices. I later refined this into the FastAPI + vLLM + Docker stack that handles higher concurrency with PagedAttention. It's the same pattern whether you're running Llama, Mistral, or any other open-weight model. The tooling is mature enough now that this is engineering, not research.
A year ago, running your own LLM in production felt like a bold choice. Now it feels like the obvious one for any use case where you need cost control, data privacy, or customization. The infrastructure patterns have caught up with the models.
Related Posts
LoRA Fine-Tuning on a Student Budget: Llama on a Single GPU
You don't need a GPU cluster to fine-tune an LLM anymore. LoRA makes it possible on a single GPU, and I did it on a grad student's budget.
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.