Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS

At BulkMagic, we needed a service that could take raw product data and generate clean, consistent summaries at scale. Not a chatbot. Not a playground. A proper microservice that other services could call, get a response, and move on. The kind of thing that needs to work at 3 AM on a Tuesday with nobody watching.

Here's how I built it with FastAPI, Llama 3.2, and AWS ECS, and what I learned about running open-source LLMs in production.

The Architecture

The system is straightforward. A FastAPI application wraps Llama 3.2, exposes a REST API, gets containerized with Docker, and runs on AWS ECS with auto-scaling.

End-to-end request flow from client through API Gateway to the LLM microservice running on ECS.

Nothing exotic. That simplicity is deliberate. When you're building infrastructure that needs 99.9% reliability, every unnecessary component is a liability.

API Design for LLM Services

LLM endpoints are different from typical CRUD endpoints. The response times are measured in seconds, not milliseconds. That changes everything about how you design the API.

Streaming responses are non-negotiable for anything user-facing. FastAPI's StreamingResponse with server-sent events works well here. For service-to-service calls where the downstream consumer just needs the final result, a standard JSON response with aggressive timeout handling is cleaner.

Token counting matters more than you'd think. I added input token estimation to the validation layer so we could reject requests that would exceed the model's context window before wasting compute on them. A fast rejection is better than a slow failure.

Structured output validation was the other critical piece. LLMs don't always return well-formed JSON. I built a Pydantic validation layer that retries with adjusted prompts when the model output fails to parse. In production, this retry logic fires on about 3% of requests, and it catches almost all of them on the second attempt. I later wrote a deeper breakdown of JSON mode vs function calling patterns -- the retry-with-error-feedback approach is the key to making this reliable.

Containerization Challenges

Packaging an LLM into a Docker container is where things get interesting. The model weights for Llama 3.2 are several gigabytes. You don't want to download them every time a container starts.

Model caching was the first optimization. We bake the quantized model weights into the Docker image during the build step. The image is large, around 8GB, but container startup is fast because there's no download step. ECS pulls the image once and caches it on the host.

GPU containers require the NVIDIA Container Toolkit and specific base images (if you are new to containerization, my Docker 101 walkthrough covers the fundamentals). We used nvidia/cuda:12.1.0-runtime-ubuntu22.04 as the base and installed our Python dependencies on top. Getting CUDA versions aligned between the base image, PyTorch, and the host driver is the kind of dependency management that will eat an entire afternoon if you're not careful.

ECS Task Definitions and Scaling

The ECS task definition specifies GPU requirements, memory limits, and health check endpoints. We run on g5.xlarge instances with one task per instance, since each Llama 3.2 instance needs the full GPU.

Auto-scaling is based on a custom CloudWatch metric tracking request queue depth rather than CPU utilization. CPU is a poor proxy for LLM workload because inference is GPU-bound. When the queue depth exceeds a threshold, ECS spins up additional tasks. Scale-down is conservative, with a 10-minute cooldown, because cold starts on GPU instances are expensive in both time and cost.

Production Reliability

Hitting 99.9% API reliability required a few things beyond the happy path. Circuit breakers prevent cascade failures when the model gets stuck. Health checks verify not just that the container is running but that the model is loaded and responsive. Structured logging with correlation IDs makes it possible to trace a request through the entire pipeline when something goes wrong.

The monitoring stack is CloudWatch for metrics, structured JSON logs for debugging, and PagerDuty alerts for when response latency exceeds our SLA. For more on tracking token costs specifically, I wrote a dedicated post on building cost dashboards. Most production incidents weren't model failures. They were infrastructure issues like ECS task placement failures or GPU memory leaks from long-running containers.

The Bigger Picture

The pattern here, FastAPI wrapper, open-source LLM, Docker container, cloud orchestration, is becoming the standard architecture for AI microservices. I later refined this into the FastAPI + vLLM + Docker stack that handles higher concurrency with PagedAttention. It's the same pattern whether you're running Llama, Mistral, or any other open-weight model. The tooling is mature enough now that this is engineering, not research.

A year ago, running your own LLM in production felt like a bold choice. Now it feels like the obvious one for any use case where you need cost control, data privacy, or customization. The infrastructure patterns have caught up with the models.

Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS

The Architecture

API Design for LLM Services

Containerization Challenges

ECS Task Definitions and Scaling

Production Reliability

The Bigger Picture

Related Posts

LoRA Fine-Tuning on a Student Budget: Llama on a Single GPU

Custom Commands and Slash Commands: Building Your Own Claude Code CLI

Subagents and Parallel Execution: Making Claude Code 5x Faster