LangSmith vs Langfuse vs Braintrust: Picking Your LLM Observability Stack
After running LLM features in production for a while, I got tired of console.log debugging. You know the drill. Something goes wrong, a user complains that the output was garbage, and you're staring at a wall of unstructured text in CloudWatch trying to figure out which prompt template was even used, what the token count was, and where the latency came from. It's miserable.
So I went looking for real observability tooling. Not APM for HTTP endpoints. Observability purpose-built for LLM applications. I evaluated three platforms seriously: LangSmith, Langfuse, and Braintrust. Each one takes a different philosophical approach to the same problem, and the right choice depends entirely on what you care about most.
What You Actually Need
Before comparing tools, it helps to know what "LLM observability" even means in practice. Here's the checklist I started with:
Trace visualization. When your chain makes four LLM calls, two tool invocations, and a retrieval step, you need to see the full execution tree. Input, output, latency, and token count at every node. This is table stakes.
Token cost tracking. LLM APIs charge per token. If you can't see your cost per request, per user, per feature, you're flying blind. I've seen a single poorly written prompt drain a budget in days because nobody was watching the token counts. I cover the mechanics of building token cost dashboards from scratch in a separate post.
Latency breakdown. Is the bottleneck the embedding lookup, the LLM call, or the post-processing? You need to know without guessing.
Prompt versioning. When you tweak a system prompt, you need to know which version produced which outputs. Without this, debugging regressions is impossible.
Evaluation harness. Can you run your test dataset against a new prompt version and compare results systematically? Manual spot-checking doesn't scale. For agent-heavy systems, see my post on evaluating AI agents for the specific metrics and frameworks that matter.
Data export. Can you get your data out? This matters more than people think. Lock-in is real.
With that framework in mind, here's how the three platforms stack up.
LangSmith
LangSmith is built by the LangChain team, and it shows. If you're already using LangChain or LangGraph, the integration is essentially one line of code. Set an environment variable and your traces start flowing. That's it. No SDK wrapping, no decorator gymnastics. It just works.
The trace UI is genuinely excellent. It's the best I've used for visualizing complex chains. You can click into any node, see the exact prompt that was sent, the raw completion, the token counts, and the latency. The hierarchy is clear even for deeply nested agent loops. They clearly spent a lot of design effort here.
LangSmith also has built-in evaluation with datasets. You can upload test cases, run them against different prompt versions, and compare metrics side by side. The annotation queue lets you do human labeling directly in the platform, which is useful for building evaluation datasets from production traffic.
The downsides are real though. It's hosted only. There's no self-hosting option. Your data lives on their servers, which can be a non-starter depending on your compliance requirements.
More importantly, the tight LangChain coupling is a double-edged sword. If you're using LangChain, the experience is magical. If you're not, you can still use LangSmith via their SDK, but it feels like a second-class experience. You're paying for an integration layer that doesn't benefit you.
Pricing scales steeply. The free tier is generous for development, but production workloads with high trace volumes can get expensive quickly. And because it's the only hosted option in this comparison, you're comparing managed-service pricing against self-hosted-is-free pricing, which isn't quite apples to apples but matters for budget conversations.
Langfuse
Langfuse takes the opposite approach. It's open-source and self-hostable. You can run it on your own infrastructure, keep your data in your own Postgres database, and never worry about a vendor's pricing page again. They also offer a managed cloud option if you don't want to deal with infrastructure, but the self-hosted path is fully supported and well-documented.
The big technical differentiator is native OpenTelemetry support. This means Langfuse plugs into the same observability infrastructure you're probably already running. If you have Grafana dashboards, if you use OTLP collectors, Langfuse speaks your language. It's framework-agnostic by design. It works with LangChain, LlamaIndex, raw OpenAI calls, Anthropic calls, or anything else. You're not locked into any particular LLM framework.
Prompt management and evaluation are both included. You can version prompts, run evaluations, and track metrics over time. The community is growing fast, and the pace of development has been impressive. New integrations and features land regularly.
The tradeoffs? The UI is less polished than LangSmith's. It's functional and getting better with every release, but if you put the two trace viewers side by side, LangSmith feels more refined. This is the kind of gap that narrows over time, but it's noticeable today.
The team is smaller. That means fewer resources for documentation, fewer pre-built integrations, and occasionally rougher edges in newer features. The open-source community helps fill this gap, but it's worth being realistic about the support experience compared to a well-funded commercial product.
For my own projects, Langfuse is what I actually run. The self-hosting story, the framework independence, and the open-source license make it the pragmatic choice when you want control over your stack. I deployed it on Railway in about 20 minutes and it's been solid since.
Braintrust
Braintrust comes at this from a different angle entirely. While LangSmith and Langfuse lead with tracing, Braintrust treats evaluation as the primary use case and builds observability around it. The philosophy is that you can't improve what you can't measure, and measurement means structured evals, not just pretty trace visualizations.
The eval framework is genuinely first-class. You define scoring functions, create datasets, and run experiments. Braintrust tracks every experiment, compares results across runs, and helps you understand which prompt changes actually improved things versus which ones just felt better. If you've ever done A/B testing for traditional software, this feels like that but for prompts.
It's TypeScript-first, which is either a selling point or a dealbreaker depending on your stack. The SDK is clean and well-designed, and the TypeScript types make the developer experience smooth if you're already in that ecosystem.
Braintrust also includes tracing and logging. It's not observability-only or eval-only. But the emphasis is clear: evals are the center of gravity, and everything else orbits around them. Prompt optimization, dataset management, and scoring all feel more deeply integrated than bolted on.
The downside is that Braintrust is newer and the ecosystem is smaller. Fewer community integrations, fewer blog posts and tutorials, fewer Stack Overflow answers when you hit a wall. The product itself is solid, but the surrounding support network is still developing. If you're the kind of person who needs a mature community with a thousand answered questions, this might feel early.
My Recommendation
There's no single winner. The right choice depends on your situation.
If you're deep in the LangChain ecosystem, use LangSmith. The integration is unmatched, the trace UI is best-in-class, and you'll be productive immediately. Don't fight the ecosystem; lean into it.
If you want vendor independence, self-hosting, or framework flexibility, use Langfuse. The open-source model means you own your data and your infrastructure. The OpenTelemetry integration is a genuine differentiator if you care about standards-based observability. This is the choice for teams that think long-term about their stack.
If evaluations are your primary concern, use Braintrust. If your biggest pain point is "I don't know if this prompt change actually made things better," Braintrust's eval-first approach is the most direct solution. The experiment tracking alone is worth evaluating.
You can also start with one and switch later. These tools mostly consume the same data -- traces, spans, scores -- and migrating between them is annoying but not catastrophic. Don't let analysis paralysis stop you from instrumenting your application.
The Bigger Picture
The fact that we now have three serious LLM observability platforms -- each with real users, active development, and distinct philosophies -- is itself a sign of how much this space has matured. A year ago, most of us were genuinely debugging LLM applications with print statements and prayer. That's no longer acceptable, and it's no longer necessary.
Pick one, instrument your app, and start collecting data. The specific tool matters far less than the practice of actually measuring what your LLM is doing in production.
Once you have visibility, everything else -- prompt optimization, cost reduction, latency improvements -- becomes tractable. Without it, you're guessing. And in production, guessing is expensive.
Related Posts
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
NotebookLM from the Terminal: Querying Your Docs with Claude Code
A Claude Code skill that queries Google NotebookLM notebooks directly from the terminal. Source-grounded answers from Gemini, with citations, without opening a browser.
I Track Calories and Plan Groceries from My Terminal
Claude Code isn't just for writing software. I built skills that track nutrition and automate grocery shopping at Wegmans, all from the terminal.