Tracking Token Costs Before They Blow Up Your Bill
The first time I checked our LLM API bill after a feature launch, I nearly fell out of my chair. We'd burned through our monthly budget in five days.
The feature worked great. Users loved it. But we'd underestimated how chatty the model would be in production, and nobody had set up any cost monitoring beyond "check the billing dashboard next month." That's not a strategy. That's a prayer.
Token costs are the cloud compute bill of the LLM era. If you're not tracking them granularly, you will get surprised. And the surprises are never pleasant.
Why Token Costs Surprise You
Most teams look at per-million-token pricing and think "that's cheap." And it is -- until you do the math on output tokens.
Here's the thing that catches people off guard: output tokens cost 4-8x more than input tokens. With Claude Sonnet, you're paying $3 per million input tokens but $15 per million output tokens. GPT-4o has a similar ratio. This asymmetry is the source of almost every cost surprise I've seen.
Think about what that means in practice. Your system prompt might be 800 tokens. That's cheap -- it's input. But if your model generates a 2,000-token response for every query, and you're handling 50,000 queries a day, that output cost adds up fast. At $15 per million output tokens, that's $1.50 per day just for output on one endpoint. Scale that across multiple features and user segments, and you're looking at real money.
The worst offender I've seen was a summarization feature where the "summary" was consistently longer than the original text. Nobody had set a max_tokens limit. The model was happily generating 4,000-token responses when 500 would have been fine. That single oversight was 60% of our monthly bill.
What to Track
Knowing your total spend is table stakes. The question is: spend on what? Here's what actually matters:
Cost per query type. A simple chat response, a document summarization, and a code generation request have wildly different token profiles. If you're lumping them together, you can't optimize anything. Break it down. Know that your chat endpoint costs $0.002 per query on average but your code generation endpoint costs $0.015.
Cost per user segment. Power users generate 10-50x more tokens than casual users. Are your top 1% of users responsible for 40% of your cost? That's not unusual. You need to know so you can make informed decisions about rate limits, plan tiers, or caching strategies.
Cost per model. If you're running multiple models -- and you should be -- track each one separately. Maybe 80% of your queries can be handled by a cheaper model and only 20% need the expensive one. You can't route intelligently without this data.
Token efficiency. This is the ratio of useful output tokens to total output tokens. If your model consistently generates 1,500 tokens but users only read the first 300, you're paying for output nobody uses. Measure it. Then fix it with better prompts or structured output formats.
Building a Cost Dashboard
You don't need a fancy observability platform for this -- though if you want one, I compared LangSmith, Langfuse, and Braintrust in a separate post. At minimum, you need a logging middleware and a database.
The core idea: log every API call with the fields that matter for cost analysis. Model name, input token count, output token count, estimated cost, query type, user ID, and timestamp. That's it. Ship these to wherever you store structured logs -- Postgres, BigQuery, a data warehouse, even a CSV if you're scrappy.
Here's a FastAPI middleware that does exactly this:
import time
import logging
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
logger = logging.getLogger("token_costs")
# Cost per million tokens by model (update as pricing changes)
MODEL_PRICING = {
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-haiku-3-20250414": {"input": 0.80, "output": 4.00},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
@dataclass
class TokenUsageRecord:
timestamp: str
model: str
query_type: str
user_id: str
input_tokens: int
output_tokens: int
estimated_cost_usd: float
latency_ms: float
def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING.get(model)
if not pricing:
return 0.0
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (output_tokens / 1_000_000) * pricing["output"]
return round(input_cost + output_cost, 6)
def log_token_usage(
model: str,
query_type: str,
user_id: str,
input_tokens: int,
output_tokens: int,
latency_ms: float,
):
record = TokenUsageRecord(
timestamp=datetime.now(timezone.utc).isoformat(),
model=model,
query_type=query_type,
user_id=user_id,
input_tokens=input_tokens,
output_tokens=output_tokens,
estimated_cost_usd=estimate_cost(model, input_tokens, output_tokens),
latency_ms=latency_ms,
)
# Ship to your storage layer -- DB insert, log aggregator, etc.
logger.info("token_usage", extra=asdict(record))
return recordOnce you have a few days of data, the queries write themselves. Total cost by day. Cost by query type. Top 10 most expensive user sessions. Average cost per model. You can build these in SQL in an afternoon.
Set alerts at 80% of your budget threshold. If your monthly budget is $5,000, alert at $4,000. If your daily average suddenly doubles, alert immediately. The goal is to catch problems before they hit the billing page, not after.
Quick Wins
You don't need a month-long optimization project. These five changes can cut your bill significantly in a week:
Shorter system prompts. Every token in your system prompt is repeated on every single request. I've seen system prompts with 2,000 tokens of instructions that could be compressed to 400. That's a 4x reduction in input cost for your highest-volume endpoint. Audit every system prompt. Be ruthless.
Max token limits per endpoint. Set max_tokens on every API call. Your chat endpoint probably doesn't need to generate more than 500 tokens. Your summarization endpoint doesn't need more than 1,000. Without explicit limits, you're trusting the model to be concise. It won't be.
Cache identical queries. If 15% of your queries are duplicates or near-duplicates, you're paying for the same output tokens over and over. A simple Redis cache with a TTL can eliminate this waste entirely. Hash the prompt, check the cache, skip the API call.
Model routing. Not every query needs your most expensive model. Simple classification tasks, yes/no questions, and short factual lookups can be handled by a smaller, cheaper model. Route based on query complexity and you can cut costs 40-60% without meaningfully impacting quality. I covered gateway setup in my LiteLLM post -- the routing piece is a config change once you have the infrastructure.
Prompt caching. Anthropic and OpenAI both support prompt caching now, where repeated prefixes in your prompts are cached server-side and billed at a steep discount. If your system prompt is the same across requests -- and it should be -- prompt caching alone can cut your input token costs by 90%.
The Bottom Line
Token cost monitoring is the seatbelt of LLM engineering. You don't think about it when things are going well. But the moment something goes wrong -- a prompt regression, a traffic spike, a model that suddenly gets verbose -- it's the difference between catching the problem in an hour and discovering a four-figure bill at the end of the month.
Log every call. Track by query type. Set budget alerts. Optimize the expensive endpoints first. It's not glamorous work, but it's the work that keeps your LLM features sustainable instead of one bad week away from getting shut down by finance.
Related Posts
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
The LLM Inference Stack in 2026: From API Call to Response
The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.
Prompt Caching: How Anthropic and OpenAI Cut Costs by 90%
Prompt caching reuses pre-computed KV tensors for identical prompt prefixes. It's the easiest cost reduction you're not using yet.