OpenTelemetry for LLM Apps: Tracing Prompts and Tokens
I've been running LLM-powered features in production for over a year now, and the scariest part isn't the model. It's the black box between prompt and response.
You send a prompt. Tokens come back. The user is happy or they're not. If something goes wrong -- latency spikes, garbage output, cost explosions -- you're stuck guessing. Did the prompt change? Did the model get slower? Did the token count balloon because someone's input was unexpectedly long? Without proper instrumentation, you're flying blind.
OpenTelemetry fixes this. And it now has first-class support for LLM workloads.
Why Standard Logging Isn't Enough
If you're already running distributed tracing with OTel for your web services, you might think you're covered. You're not. Standard HTTP tracing tells you that a request to your /chat endpoint took 4.2 seconds. Cool. But it tells you nothing about why.
LLM calls have dimensions that don't exist in traditional request-response patterns:
- Prompt content and structure -- was this a single-turn call or a multi-turn conversation with 15 messages in context?
- Token counts -- input tokens vs output tokens, which directly map to cost and latency
- Model version -- are you on
gpt-4oorgpt-4o-mini? Did someone change the model in a config file and forget to tell anyone? - Temperature and sampling parameters -- production should be deterministic, but is it?
- Latency breakdown -- time to first token vs total generation time. These are completely different signals. A slow TTFT means queuing or prefill issues. A slow total time means the output is just long.
Standard HTTP spans collapse all of this into one opaque duration. That's not observability. That's a stopwatch.
OTel GenAI Semantic Conventions
OpenTelemetry's semantic conventions are the agreed-upon attribute names that make traces consistent across tools and vendors. In 2024, OTel introduced semantic conventions specifically for generative AI. This is a big deal because it means your LLM traces are vendor-neutral from day one.
The key span attributes:
gen_ai.system-- the AI provider (e.g.,openai,anthropic)gen_ai.request.model-- the model you requested (e.g.,gpt-4o)gen_ai.response.model-- the model that actually responded (these can differ)gen_ai.usage.input_tokens-- tokens consumed by the promptgen_ai.usage.output_tokens-- tokens generated in the responsegen_ai.response.finish_reason-- did it finish normally (stop), hit the token limit (length), or get filtered (content_filter)?
These attributes live on your spans just like http.method or db.system do for web and database calls. Any OTel-compatible backend -- Datadog, Grafana Tempo, SigNoz, Jaeger -- can ingest, index, and alert on them.
This is the key insight. You don't need a separate "LLM observability platform." You need your existing observability stack to understand LLM-shaped data. OTel's GenAI conventions give you exactly that.
OpenLLMetry: OTel for LLMs
Writing manual instrumentation for every LLM call is tedious. That's where OpenLLMetry comes in -- an open-source library from Traceloop that auto-instruments the most common LLM SDKs.
It supports OpenAI, Anthropic, LangChain, LlamaIndex, Cohere, and more. The setup is literally one line. After that, every LLM call in your application automatically produces OTel spans with the GenAI semantic conventions, including token counts, model info, and latency breakdowns.
Here's what it looks like in practice:
# pip install traceloop-sdk openai
from traceloop.sdk import Traceloop
from openai import OpenAI
# Initialize OpenLLMetry -- that's it, one line
Traceloop.init(app_name="my-llm-service")
client = OpenAI()
# This call is automatically traced with GenAI semantic conventions
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain observability in one paragraph."},
],
temperature=0.0,
)
print(response.choices[0].message.content)That single Traceloop.init() call monkey-patches the OpenAI SDK. Every subsequent call to client.chat.completions.create generates a span with gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and all the other GenAI attributes. No manual span creation. No decorators on every function.
You export to any OTel-compatible backend by setting the standard OTEL_EXPORTER_OTLP_ENDPOINT environment variable. If you're already running a collector for your web services, your LLM traces go to the same place. I compare the major backends -- LangSmith, Langfuse, and Braintrust -- in a separate post. Same dashboards. Same alerting rules. Same on-call workflows.
What You Can Actually Do With This
Once you have LLM traces flowing, the questions you can answer change completely:
- Cost attribution. Which feature is burning the most tokens? Group by endpoint and sum
gen_ai.usage.input_tokensplusgen_ai.usage.output_tokens. Now you know. I go deeper into building cost dashboards and budget alerts in a dedicated post. - Latency debugging. Is the LLM slow, or is it your code before/after the LLM call? The span tree shows you exactly where time is spent.
- Regression detection. Did output quality drop after switching models? Correlate
gen_ai.response.modelchanges with user feedback signals. This ties directly into the agent evaluation flywheel -- online traces feed your offline test suite. - Token budget alerts. Set an alert when
gen_ai.usage.output_tokensexceeds a threshold. Catch runaway generation before it hits your bill. - Finish reason monitoring. A spike in
finish_reason: lengthmeans your outputs are getting truncated. Users are getting incomplete answers and you might not even know.
The Bottom Line
Observability is the difference between "the LLM is slow" and knowing exactly why. Is it slow because the prompt has 8,000 tokens of context? Because the model is generating a 2,000-token response? Because the provider is having a bad day? These are completely different problems with completely different fixes.
The tooling is here. OTel's GenAI conventions are standardized. OpenLLMetry makes instrumentation trivial. Your existing observability backend already speaks OTel. There's no reason to treat LLM calls as black boxes anymore.
Instrument your LLM calls the same way you instrument your database calls. This is one piece of the broader MLOps discipline that separates production-grade AI systems from prototypes. Future-you debugging a production incident at 2 AM will be grateful.
Related Posts
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
The LLM Inference Stack in 2026: From API Call to Response
The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.
Tracking Token Costs Before They Blow Up Your Bill
Output tokens cost 4-8x more than input tokens. If you're not tracking usage by query type and user segment, you're flying blind.