Structured Output That Actually Works: JSON Mode vs Function Calling
I've shipped three different features that required structured JSON output from LLMs. Every single one broke in production in a different way. The first one failed because GPT-3.5 decided to wrap its JSON in a friendly "Here's your response:" preamble. The second broke because Claude returned trailing commas in arrays, which JavaScript's JSON.parse does not tolerate. The third looked perfect in testing but started hallucinating extra fields under load that blew up my downstream Postgres inserts.
If you've worked with LLMs in production, you know structured output is the hard part. Not prompting. Not fine-tuning. Getting a language model to reliably emit a specific data structure, every single time, at scale.
The Problem
LLMs are text generators. That's it. They predict the next token based on everything that came before. They have no concept of "valid JSON" or "matches this schema." When you tell a model to "respond in JSON format," you're asking it to do something it wasn't fundamentally designed to do -- generate text that happens to also be a valid data structure.
The failure modes are predictable and maddening:
- Trailing commas after the last element in arrays or objects
- Missing quotes around keys, especially after long generation contexts
- Preamble text before the JSON -- "Sure, here's the JSON:" followed by the actual payload
- Postamble text after the JSON -- the model "explaining" what it just generated
- Schema drift where the model invents new fields, drops optional ones, or nests things differently than you specified
- Type confusion where a field that should be a number comes back as a string, or a boolean comes back as "yes"
These aren't edge cases. These are things I've seen happen in the first week of every structured output feature I've shipped. The error rate might be 2-5%, which sounds low until you realize that means dozens of failed requests per hour at any reasonable scale.
JSON Mode
OpenAI introduced JSON mode, and Anthropic followed. The idea is simple: constrain the model's token generation so that it can only produce syntactically valid JSON. Under the hood, this uses constrained decoding -- at each token position, the model's logits are masked so only tokens that would keep the output valid JSON are eligible. The model literally cannot produce invalid syntax.
This is a real improvement. You will always get parseable JSON back. JSON.parse will never throw. That alone eliminates a whole class of production errors.
But here's what JSON mode does not guarantee:
- That the JSON matches your expected schema
- That required fields are present
- That field types are correct
- That enum values are from your allowed set
- That nested structures have the right shape
I've seen JSON mode return a perfectly valid JSON object with completely wrong field names. Valid JSON, useless data. So JSON mode is a necessary foundation, but it's not sufficient on its own.
Function Calling and Tool Use
Function calling -- or "tool use" in Anthropic's API -- takes things a step further (I cover the broader production patterns for function calling in a separate post). Instead of asking the model to generate free-form JSON, you define a schema upfront. The model generates arguments that are supposed to match that schema. The API layer enforces structural constraints beyond just "valid JSON."
tools = [
{
"type": "function",
"function": {
"name": "extract_contact",
"description": "Extract contact information from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"phone": {"type": "string"},
"company": {"type": "string"}
},
"required": ["name", "email"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_contact"}}
)This works well for flat, simple structures. The model is very reliable at filling in top-level string and number fields. But it starts to struggle with:
- Deeply nested objects -- three or more levels deep and accuracy drops noticeably
- Optional fields -- the model sometimes includes them with null values, sometimes omits them entirely, and your code needs to handle both
- Arrays of complex objects -- each element in the array might have slightly different structure
- Union types -- fields that could be one of several types are a consistent pain point
Function calling gives you a stronger guarantee than JSON mode alone, but it's still not bulletproof. And in production, "usually works" is not good enough.
The Schema Validation Layer
Here's the pattern that actually works. It's not clever. It's just defense in depth.
Step 1: Use function calling or JSON mode for generation. Get the best structured output the model can give you.
Step 2: Validate the output against a strict schema using Pydantic or Zod.
Step 3: If validation fails, retry with the validation error message fed back to the model.
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI
from typing import Optional
class ContactInfo(BaseModel):
name: str = Field(description="Full name of the contact")
email: str = Field(description="Email address")
phone: Optional[str] = Field(default=None, description="Phone number")
company: str = Field(description="Company name")
role: str = Field(description="Job title or role")
def extract_contact(text: str, max_retries: int = 3) -> ContactInfo:
client = OpenAI()
messages = [
{"role": "system", "content": "Extract contact information from the provided text."},
{"role": "user", "content": text}
]
for attempt in range(max_retries):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"}
)
raw_json = response.choices[0].message.content
try:
return ContactInfo.model_validate_json(raw_json)
except ValidationError as e:
# Feed the error back to the model
messages.append({"role": "assistant", "content": raw_json})
messages.append({
"role": "user",
"content": f"That JSON had validation errors: {e}. Please fix and try again."
})
raise ValueError(f"Failed to get valid output after {max_retries} attempts")The key insight is the retry with error feedback. When you tell the model "field email is required but was missing," it almost always fixes it on the next attempt. In my experience, the retry rate is about 3-5% on the first attempt, and the second attempt succeeds over 95% of the time. Third attempts are rare.
This pattern works because you're combining the model's generation capability with deterministic validation. The model does what it's good at -- understanding text and extracting information. Pydantic does what it's good at -- enforcing schemas. Neither alone is sufficient, but together they're rock solid.
The Instructor Library
If you're thinking "that retry pattern seems like something that should be a library," you're right. Instructor by Jason Liu wraps exactly this pattern into a clean interface.
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import Optional
# Patch the OpenAI client
client = instructor.from_openai(OpenAI())
class ContactInfo(BaseModel):
name: str = Field(description="Full name of the contact")
email: str = Field(description="Email address")
phone: Optional[str] = Field(default=None, description="Phone number")
company: str = Field(description="Company name")
role: str = Field(description="Job title or role")
# This handles JSON mode, validation, and retries automatically
contact = client.chat.completions.create(
model="gpt-4o",
response_model=ContactInfo,
max_retries=3,
messages=[
{"role": "user", "content": "Extract: John Smith, john@acme.co, VP of Engineering at Acme Corp"}
]
)
# contact is a validated ContactInfo instance
print(contact.name) # "John Smith"
print(contact.email) # "john@acme.co"Instructor handles the messy parts: it patches the client to use the right generation mode, validates against your Pydantic model, retries with error context, and returns a typed Python object. It also supports streaming, partial responses, and works with both OpenAI and Anthropic APIs.
I've been using Instructor in production for about six months now. The code is cleaner, the error handling is better, and I haven't had a single structured output failure make it past the validation layer. The library isn't magic -- it's doing exactly the retry-with-feedback pattern described above -- but it does it consistently and correctly, which matters more than cleverness.
The Takeaway
The answer isn't picking one approach. It's layering them: constrained generation plus schema validation plus retry. Belt and suspenders.
If I'm starting a new feature today that needs structured output from an LLM, here's my exact decision tree:
- Define the Pydantic model first. Before writing any LLM code, nail down exactly what structure you need. This forces you to think about optional fields, types, and edge cases upfront.
- Use Instructor or build the equivalent pattern yourself. Function calling for generation, Pydantic for validation, retry with error feedback for resilience.
- Log every validation failure. Even with retries, track what's failing. It tells you where your prompts are weak or your schemas are ambiguous. An LLM observability platform makes this analysis much easier at scale.
- Set a retry budget. Three attempts is my default. If it fails three times, something is wrong with your prompt or schema, not with the model's luck.
Structured output from LLMs went from "basically impossible to rely on" to "production-ready with the right patterns" in about eighteen months. The models got better, the APIs added constraints, and the tooling matured. But you still can't just ask nicely and hope for the best. The reliability comes from the validation layer, not from the model. When building production systems, pair this with proper guardrails to catch not just schema issues but content safety and injection attempts too.
Related Posts
LLM Guardrails in Practice: Input Validation to Output Filtering
A three-layer guardrail pipeline: validate inputs, constrain execution, filter outputs. Here's what each layer catches and how to build them.
Function Calling Patterns for Production LLM Agents
Function calling connects LLMs to the real world. Here are the patterns that survive production: permission models, error handling, and human-in-the-loop checkpoints.
Reranking: The 20-Line Fix for Bad RAG Results
If your RAG pipeline retrieves the wrong chunks, adding a cross-encoder reranker between retrieval and generation can fix it in 20 lines of code.