Red/Green TDD with Coding Agents: Why Test-First Matters More

Here's an observation from using Claude Code for hundreds of sessions: the quality of AI-generated code is directly proportional to the quality of the tests I write first.

Not the quality of my prompts. Not the quality of my descriptions. The quality of my tests. The concrete, executable, pass-or-fail specifications I write before I ask the agent to do anything.

This isn't a coincidence. It's a fundamental shift in what TDD means when a machine is writing the implementation.

The Old Argument for TDD

Test-driven development has been around for decades. The classic arguments are well-known. Tests document behavior. Tests prevent regressions. Tests enable fearless refactoring. Tests force you to think about interfaces before implementations. All of these arguments are still true.

But if you've spent time in the industry, you know how this usually plays out. Teams adopt TDD with enthusiasm. Then deadlines hit. Tests get written after the fact, if at all. The red/green/refactor cycle becomes "write code, then write enough tests to satisfy the coverage metric." I've been guilty of this myself. Most engineers have.

The old arguments for TDD were correct but insufficient. They told you why TDD was good but didn't give you a forcing function strong enough to overcome the natural gravity toward writing implementation first. You could always tell yourself you'd write the tests later.

That calculus changed.

The New Argument

When an AI agent writes your implementation, the test becomes something it never was before: the specification interface.

Think about what happens when you ask a coding agent to build a feature. You describe what you want in natural language. The agent interprets your description, makes assumptions, fills in gaps, and produces code. Sometimes the code matches your intent. Sometimes it's close but subtly wrong. Sometimes it's completely off-base. The failure mode is always the same: ambiguity in your specification.

Now think about what happens when you hand the agent a failing test. There is no ambiguity. The test defines the input. The test defines the expected output. The test defines the edge cases. The test defines the error handling. The agent's job becomes mechanical: make the red thing green.

The test IS the prompt, expressed in a language the machine can verify. Natural language tells the agent what you want. A failing test tells the agent exactly what "done" looks like.

The Red/Green Loop with Agents

Here's what the workflow actually looks like in practice.

The red/green TDD loop with coding agents. You write the spec as a failing test, the agent implements, and you review before tightening the spec with more tests.

Step 1: Write a failing test (red). This is your job. You're specifying behavior. You think about what the function should accept, what it should return, how it should handle edge cases. You write assertions that encode these decisions.

def test_parse_duration_string():
    assert parse_duration("2h30m") == 9000  # seconds
    assert parse_duration("45m") == 2700
    assert parse_duration("1h") == 3600
    assert parse_duration("0m") == 0
 
def test_parse_duration_invalid_input():
    with pytest.raises(ValueError):
        parse_duration("")
    with pytest.raises(ValueError):
        parse_duration("abc")
    with pytest.raises(ValueError):
        parse_duration("-1h")

Step 2: Tell the agent "make this test pass." That's it. That's the entire prompt. The agent reads the test, understands the contract, and writes an implementation. No multi-paragraph explanation needed. No examples to provide. The test IS the example.

Step 3: Review the implementation (green). The agent produces code that passes. You read it. Does it make sense? Is it clean? Does it handle the cases you care about? If yes, move on. If it's ugly or overly complex, you have a safety net: the test. Refactor freely. Tell the agent to simplify. Rewrite it yourself. The test catches regressions instantly.

Step 4: Add more tests, repeat. Each new test tightens the specification. Each green run confirms the implementation still satisfies the full contract.

I've been running this loop hundreds of times across real projects. It works consistently. The agent almost always produces correct code on the first try when given a clear test. When I skip the test and describe behavior in prose, the error rate climbs dramatically.

Why It Produces Better Code

There are several compounding reasons this loop produces better output than prose-prompted generation.

The agent gets a concrete, verifiable target. Instead of interpreting "handle errors gracefully" (which means different things to every engineer), the agent sees pytest.raises(ValueError) and knows exactly what to do. Precision eliminates guesswork.

There is zero ambiguity about scope. The agent doesn't add features the test doesn't require. No bonus helper functions. No unnecessary abstractions. No "I'll also add logging and retry logic in case you need it." The test defines the boundary, and the agent stays inside it. This alone eliminates a massive category of AI-generated bloat.

The feedback loop is immediate. If the agent's implementation is wrong, the test fails. You don't discover the bug in staging three days later. You discover it in the same minute. Fast feedback means fast iteration. The agent can try again immediately, with the failing assertion pointing directly at the problem.

Refactoring becomes trivial. AI-generated code sometimes has style issues. Variable names that don't match your conventions. An approach that works but isn't idiomatic. With a passing test suite, you can refactor aggressively without fear. Tell the agent to rewrite the function using a different approach. Run the tests. Green means the new version is functionally identical. This makes code review a conversation instead of an argument.

The Meta-Lesson

There's a bigger pattern here that extends beyond TDD.

As AI agents generate more of the code we ship, the most valuable engineering skills are shifting. Not disappearing. Shifting. The center of gravity is moving from writing implementations to specifying behavior.

Tests are one form of specification. But the principle applies broadly. Type signatures specify contracts. JSON schemas specify data shapes. Interface definitions specify boundaries. API contracts specify communication protocols. Database schemas specify data models. All of these are artifacts that express intent in a machine-verifiable form.

The engineers who will thrive in the agent era are the ones who can precisely specify what they want. Not in natural language, which is inherently ambiguous, but in formal artifacts that a machine can validate against. This is why evaluating agent output requires layered strategies rather than simple pass/fail checks. The skill isn't writing code anymore. The skill is writing specs that generate correct code.

This is a profound inversion. For decades, we taught junior engineers to write implementations and treated specification as a senior skill. Now the implementation is increasingly automated, and specification is the primary interface between human intent and machine output. It is the same shift I describe in context engineering vs prompt engineering -- the value is moving from crafting instructions to building the information environment.

If you're an engineer who already thinks in contracts and interfaces, you have a massive head start. If you've been the person who writes clear test cases, defines precise types, and thinks carefully about edge cases before writing a line of code, congratulations. You've been training for this era without knowing it.

Start Here

If you take one thing from this post: next time you're about to describe a feature to a coding agent in prose, write a failing test instead.

Write what you want the code to do, expressed as assertions. Then tell the agent to make the tests pass. Compare the quality of the output to what you get from a natural language prompt. I think you'll find the difference is striking.

TDD isn't a practice from the past that still works. It's a practice from the past that the AI era made essential. The red/green loop isn't a development methodology anymore. It's the communication protocol between humans who know what they want and machines that can build it.

Red/Green TDD with Coding Agents: Why Test-First Matters More

The Old Argument for TDD

The New Argument

The Red/Green Loop with Agents

Why It Produces Better Code

The Meta-Lesson

Start Here

Related Posts

Vibe Coding Is Real but Not What You Think

Context Engineering Is Not Prompt Engineering

The EU AI Act Is Here: What Developers Need to Know