Evaluating AI Agents: Beyond 'Does It Work?'
The question "does this agent work?" is almost meaningless. Work how well? On what inputs? Compared to what baseline? Under what conditions?
I keep seeing teams demo an agent, watch it complete a task three times in a row, and declare it production-ready. That's not evaluation. That's a coin flip that happened to land heads.
If you're shipping agents to real users, you need a real evaluation strategy. Not a checklist. A layered system that catches different failure modes at different stages. Here's the framework I've converged on after building and breaking enough agents to have strong opinions.
Why Agent Eval Is Harder Than Model Eval
Model evaluation is relatively straightforward. You have inputs, you have expected outputs, you measure the gap. Accuracy, F1, perplexity -- pick your metric and run the benchmark.
Agent evaluation is a different beast entirely. An agent doesn't just produce an output. It produces a trajectory -- a sequence of decisions, tool calls, observations, and reactions that eventually land somewhere. That trajectory is where everything interesting happens, and where everything silently breaks.
Consider a coding agent asked to fix a bug. It could read the right file, identify the issue, and apply a surgical fix -- the kind of focused path a test-driven workflow naturally produces. Or it could read every file in the repo, run the test suite six times, make three wrong patches, then stumble into the correct one. Both "work." One costs 50x more tokens and will fail on anything slightly harder.
This is the core tension: right answer via wrong path is fragile, wrong answer via reasonable path is recoverable. An agent that consistently takes sensible steps but occasionally gets the final answer wrong is far more valuable than one that produces correct outputs through chaotic, unrepeatable trajectories. The first agent you can improve. The second agent you can only pray over.
Success also depends on factors that don't exist in model eval. Tool choice matters -- did the agent use the right tool for the job, or did it hammer every nail with the same API call? The three-pillar classification of function calls -- reads, computations, and actions -- shapes what "reasonable" even means for a given tool invocation. Ordering matters -- did it gather information before acting, or did it act first and scramble to recover? Parameter selection matters -- did it pass reasonable arguments, or did it stuff in defaults and hope for the best?
You can't capture any of this with a single pass/fail metric. You need layers.
Layer 1: System Efficiency
The first layer is the cheapest to measure and the fastest to catch regressions. It doesn't tell you whether the agent is good, but it tells you immediately when something has gotten worse.
Track these per task:
- Token usage. Total input and output tokens consumed. A sudden spike means the agent is looping, over-reading, or generating verbose intermediate steps.
- Tool call count. How many tools did the agent invoke? More isn't better. A well-designed agent should converge, not explore.
- Completion time. Wall clock, end to end. Latency matters for user-facing agents.
- Retry rate. How often did a tool call fail and get retried? High retry rates signal either bad parameter selection or unreliable tool integrations.
None of these metrics measure quality directly. But they're your smoke detectors. When you push a new prompt version or swap an underlying model and your average token usage doubles overnight, you know something broke before any user reports it.
Set baselines from your current agent performance. Alert on deviations. This layer pays for itself in the first week.
Layer 2: Trajectory Analysis
This is where evaluation gets interesting. You're no longer asking "did it finish?" You're asking "how did it finish?"
Trajectory analysis means recording the full sequence of agent actions -- every tool call, every parameter, every observation -- and comparing it against reference trajectories. You don't need exact matches. You need reasonable alignment.
A good trajectory analysis checks:
- Tool selection. Did the agent pick appropriate tools for each subtask? If a search tool was available and the agent instead tried to reason from memory, that's a trajectory problem even if the final answer was correct.
- Action ordering. Did the agent gather context before acting? Did it validate assumptions before committing to a path? Ordering errors are the most common source of agent fragility.
- Parameter quality. When the agent called a tool, were the arguments well-formed and specific? Vague or overly broad parameters signal an agent that's guessing rather than reasoning.
The practical way to do this: build a small library of "golden trajectories" for your most important tasks. Not hundreds -- even 10-20 well-annotated reference trajectories give you enormous signal. Score new agent runs against them using embedding similarity or a lightweight LLM judge that compares action sequences.
This layer catches the "right answer, wrong path" failure mode that pure outcome metrics miss completely.
Layer 3: Task Success
This is the layer everyone starts with and the layer that's insufficient alone. But it's still essential.
Task success evaluation has two tiers:
Deterministic checks for objective criteria. Did the agent produce valid JSON? Did the code it wrote pass the test suite? Did it call the correct API endpoint? These are binary, automatable, and should run on every single evaluation. No excuses.
LLM-as-judge for subjective quality. Is the summary coherent? Is the response helpful? Does the generated email match the requested tone? Use a separate, stronger model as a judge with a well-defined rubric. This isn't perfect, but calibrated LLM judges correlate surprisingly well with human ratings -- around 0.8+ agreement in most studies I've seen.
The key insight: grade on a rubric, not a binary. A 1-5 scale with explicit criteria for each level gives you far more signal than pass/fail. An agent scoring 3.2 on average tells you something different than one scoring 4.7, even if both "pass" most of the time.
Combine deterministic checks and LLM judges. Run the deterministic checks first -- they're cheap and fast. Only invoke the LLM judge when the deterministic checks pass. This keeps your eval costs sane.
Offline vs Online Evals
Here's the uncomfortable truth: only about 52% of organizations building with LLMs run offline evaluations at all. The rest are testing in production, whether they admit it or not.
You need both offline and online evals. They catch different things.
Offline evals are your pre-deployment gate. A curated test suite of tasks with known-good outcomes and reference trajectories. Run this on every prompt change, every model swap, every tool integration update. It should take minutes, not hours. If it takes hours, your test suite is too big -- shrink it to the cases that actually matter.
Here's a minimal eval harness in Python:
import json
from dataclasses import dataclass
@dataclass
class EvalCase:
task: str
expected_tools: list[str]
success_criteria: dict
max_tokens: int = 10000
max_tool_calls: int = 15
def run_eval(agent, cases: list[EvalCase]) -> dict:
results = []
for case in cases:
trace = agent.run(case.task)
result = {
"task": case.task,
"tokens_used": trace.total_tokens,
"tool_calls": len(trace.actions),
"tools_used": [a.tool for a in trace.actions],
"token_budget_ok": trace.total_tokens <= case.max_tokens,
"tool_budget_ok": len(trace.actions) <= case.max_tool_calls,
"correct_tools": set(case.expected_tools) <= set(
a.tool for a in trace.actions
),
"success": check_criteria(trace.output, case.success_criteria),
}
results.append(result)
passed = sum(1 for r in results if r["success"] and r["token_budget_ok"])
return {
"total": len(results),
"passed": passed,
"pass_rate": passed / len(results),
"details": results,
}It's not fancy. It doesn't need to be. What matters is that it runs automatically and blocks deployment when the pass rate drops.
Online evals are your production monitoring. Sample a percentage of real agent interactions -- 5-10% is usually enough -- and run them through your Layer 2 and Layer 3 checks asynchronously. Flag anomalies for human review.
The critical addition for online evals: sampled human review. The same observability infrastructure you use for latency and error tracking can power your online eval sampling. Have a human look at 50-100 agent interactions per week. Not to grade them exhaustively, but to catch failure modes your automated evals don't know to look for. Every time a human reviewer finds a new failure pattern, encode it as a new offline eval case. This creates a flywheel: production failures feed your test suite, which prevents future regressions.
The Eval Flywheel
The teams that ship the best agents aren't the ones with the best prompts or the most expensive models. They're the ones with the tightest eval loops.
The pattern looks like this: ship an agent, monitor it in production, catch failures through online evals and human review, encode those failures as offline test cases, iterate on the agent until the new tests pass, ship again. Every cycle, the eval suite gets smarter. Every cycle, the agent gets more robust. Every cycle, your confidence in deployment increases.
Eval is the moat. Not the model. Not the prompt. Not the framework. Teams that evaluate rigorously ship better agents, faster, with fewer production incidents and more confidence in every deployment.
The question isn't "does this agent work?" The question is "how would I know if it stopped working?" If you can't answer that, you're not ready to ship.
Related Posts
LangGraph vs CrewAI vs AutoGen: Building the Same Pipeline Three Ways
Three agent frameworks, one task. I built a research-and-report pipeline in each to compare developer experience, flexibility, and production readiness.
Function Calling Patterns for Production LLM Agents
Function calling connects LLMs to the real world. Here are the patterns that survive production: permission models, error handling, and human-in-the-loop checkpoints.
MCP Is the USB of AI: Why Model Context Protocol Matters
Anthropic's Model Context Protocol is doing for AI integrations what USB did for hardware. If you're building agents, this changes everything.