Function Calling Patterns for Production LLM Agents

Function calling is the bridge between "LLM that talks" and "LLM that does things." I've been building agents that call external tools for the past year, and the patterns that work in demos rarely survive production. The demo agent calls a weather API, gets a clean JSON response, and formats it nicely. The production agent calls a payment API that returns a 503, the model hallucinates a retry strategy that doesn't exist, and your customer gets charged twice.

The gap between demo and production isn't about the model's intelligence. It's about the infrastructure you build around function calls. The model is the easy part. The plumbing is everything.

I've learned this building agents at work, building tools at hackathons, and -- most concretely -- using Claude Code daily as a function-calling agent that operates on my actual codebase. The patterns I'm about to describe aren't theoretical. They're extracted from systems that run every day and handle real failures.

Function calling lifecycle: the model proposes a tool call, the permission layer gates it, and structured errors enable recovery.

The Three Pillars

Not all function calls are created equal. The first thing I do when designing a tool set for an agent is classify every function into one of three categories based on what it does to the world.

Pillar 1: Reading data. These are pure reads. Fetch a user profile. Query a database. List files in a directory. Read-only operations that cannot change state. They're safe by default. If a read call fails, you retry. If it returns garbage, you ask the model to try again. The worst case is wasted tokens. Nothing burns down.

Pillar 2: Computing on data. These are transforms. Summarize a document. Calculate a price. Parse a CSV into structured records. They take input, produce output, and have no side effects. The input and output exist only in the agent's context. Like reads, these are inherently safe -- you can run them ten times and the outside world doesn't change.

Pillar 3: Taking actions. These change state. Send an email. Create a database record. Delete a file. Push a git commit. Transfer money. Every action is potentially irreversible, and every action needs safeguards that the first two pillars don't.

The mistake I see in almost every agent architecture is treating all three pillars identically. Same approval flow, same error handling, same retry logic. That's wrong. A read can be retried aggressively. An action should be retried never -- or at least not without explicit confirmation that the first attempt actually failed and didn't just time out after succeeding.

The pillar classification should be encoded in your tool definitions. Tag each function. The orchestration layer uses those tags to decide approval requirements, retry behavior, and logging verbosity. Reads get auto-approved and retried three times. Computations get auto-approved with no retries since they're deterministic. Actions get human approval and zero automatic retries.

Schema Design

The model reads your function schemas. Every description, every parameter name, every enum value is part of the prompt. Bad schema design produces bad function calls, and no amount of prompt engineering in the system message fixes a confusing schema.

Keep schemas flat. Nested objects are where models start hallucinating field names. If your function expects a deeply nested config object with optional sub-objects, the model will invent fields that don't exist. Flatten it. If a function needs five parameters, make them five top-level parameters. It's uglier in your API spec and dramatically more reliable in practice. I cover the full spectrum of structured output strategies -- JSON mode, function calling, and the validation-retry pattern -- in a dedicated post.

Use enums for constrained parameters. If a parameter can only be "asc" or "desc", don't describe it as a string with a note saying "must be asc or desc." Make it an enum. Models respect enum constraints almost perfectly. They respect description-based constraints maybe 80% of the time. That 20% gap is production incidents.

Write descriptions for the model, not for humans. Your function description isn't documentation for developers. It's a prompt for the model. "Sends an email to the specified recipient" is fine for a developer. The model needs more: "Sends an email. Use this when the user explicitly asks to send a message. Do not use for drafts or previews. The recipient must be a valid email address. The subject should be concise." Tell the model when to use the function, when not to, and what constraints matter.

Minimize optional parameters. Every optional parameter is a decision the model has to make. Should it include the timezone parameter or let it default? The model doesn't know your defaults. It guesses. Sometimes it guesses wrong and passes null when you expected it to omit the field entirely, or includes a value when you expected the default. Make parameters required when possible. If a parameter has a sensible default, consider removing it from the schema entirely and hardcoding the default server-side.

Permission Models

This is where Claude Code taught me the most. I use Claude Code every day, and its permission model is the best reference implementation I've seen for how a function-calling agent should handle trust.

Claude Code operates on a three-tier approval system that maps cleanly to the three pillars I described above.

Tier 1: Auto-approve. Read-only operations run without asking. git status, ls, cat, file reads -- these happen silently because they can't change anything. The agent stays fast for safe operations. This maps to Pillar 1.

Tier 2: Session-approve. Operations like file edits get approved once, then allowed for the rest of the session. You review the first edit, and if it looks good, you let the agent continue editing without interrupting you for every change. Trust builds incrementally within a session but resets between sessions. This maps to Pillar 2 and low-risk actions.

Tier 3: Always-ask. Destructive operations -- git push --force, rm -rf, anything that touches production -- require explicit approval every single time. No session memory. No permanent allowlist. The human is always in the loop for high-stakes actions. This maps to Pillar 3.

Every production agent needs a permission layer. This is one piece of the broader guardrails pipeline -- input validation, execution constraints, and output filtering -- that I wrote about separately. If your agent can call functions that change the world and there's no approval step between the model's decision and the execution, you have a production incident waiting to happen. The model will eventually hallucinate a parameter, misunderstand an instruction, or be manipulated by adversarial content in its context. The permission layer is what stands between that failure and your users.

The implementation doesn't have to be complex. A simple mapping from function names to approval tiers, checked before every execution, is enough. The hard part isn't building the permission system. It's resisting the urge to set everything to auto-approve because the approval prompts feel slow.

Error Handling for Flaky APIs

External APIs fail. They fail in creative and inconsistent ways. They return 500s with HTML error pages. They time out after 29 seconds. They return 200 OK with an error message buried in a nested JSON field. Your model has to handle all of this, and it can't handle what it doesn't understand.

The model needs structured error feedback, not stack traces. When a function call fails, don't return the raw exception. The model will try to parse a Python traceback, extract the wrong conclusion, and compound the error. Instead, return a structured error object.

{
  "success": false,
  "error_type": "api_timeout",
  "message": "The payment API did not respond within 10 seconds",
  "retryable": true,
  "suggested_action": "Wait and retry, or ask the user to try again later"
}

The error_type tells the model what category of failure occurred. The message gives human-readable context. The retryable flag tells the model whether trying again makes sense. The suggested_action gives the model a hint about what to do next. This is vastly more useful than a raw exception.

Wrap every external call in a try-catch that returns structured errors. This is non-negotiable. Every function that touches an external service should catch all exceptions and convert them into the structured format above. The model should never see a raw stack trace from a function call. It should always see a clean, parseable error object that it can reason about.

Distinguish between "the API failed" and "the model called it wrong." A 404 because the resource doesn't exist is different from a 400 because the model passed an invalid parameter. Your error wrapper should detect parameter validation failures and return them with enough context for the model to self-correct. "Parameter user_id must be a UUID, but received john_smith" is actionable. "Bad Request" is not.

I've found that structured error handling alone reduces cascading failures by roughly half. The model stops guessing about what went wrong and starts reasoning about it. That's the difference between an agent that recovers gracefully and one that spirals into increasingly confused retry attempts.

Human-in-the-Loop

For high-stakes actions, the pattern is straightforward: the model proposes, the human confirms, the system executes. Three steps, always in that order.

The model generates a proposed action -- a complete description of what it wants to do, with all parameters filled in. This gets presented to the user before anything executes. "I'm going to send an email to alice@company.com with the subject 'Invoice #4521' and the body below. Should I proceed?"

The user reviews and confirms or rejects. If confirmed, the action executes exactly as proposed. No modifications between confirmation and execution. What the user approved is what runs. If rejected, the model asks for clarification and proposes a revised action.

This sounds obvious. The subtlety is in the implementation. The proposed action must be complete and unambiguous. "I'll send the email" is not a confirmation prompt. The user needs to see the recipient, the subject, the body, and any attachments. Every parameter that will be passed to the function should be visible in the proposal. If the user can't tell exactly what will happen from reading the proposal, the proposal isn't detailed enough.

The second subtlety: the confirmation step must be a hard gate, not a soft suggestion. I've seen agent implementations where the model "asks" for confirmation but proceeds if the user doesn't respond within a timeout. That's not human-in-the-loop. That's human-in-the-general-vicinity. If the action requires confirmation, the system blocks until it gets explicit approval. No timeouts. No defaults. No "proceeding since I didn't hear back."

I use this pattern for anything involving money, external communication, data deletion, or access control changes. The friction is the point. A two-second confirmation step prevents the kind of errors that take hours to unwind.

The Model That Ships

The model that can call functions is interesting. The model that can call functions with guardrails is production-ready. The difference is everything I've described: pillar classification so you know what's safe and what isn't, flat schemas the model actually follows, permission tiers that match the risk level, structured error handling that enables recovery, and human-in-the-loop checkpoints for actions that matter.

None of this is glamorous. Permission models don't make good demos. Error wrappers don't trend on social media. Confirmation prompts feel like friction, not features. But these are the patterns that keep agents running in production while the demo-only agents are still crashing on their first 503.

Build the plumbing. The model will do the rest. If you are building more complex multi-agent systems, my comparison of LangGraph, CrewAI, and AutoGen covers how these frameworks handle function calling orchestration at a higher level.

Function Calling Patterns for Production LLM Agents

The Three Pillars

Schema Design

Permission Models

Error Handling for Flaky APIs

Human-in-the-Loop

The Model That Ships

Related Posts

LangGraph vs CrewAI vs AutoGen: Building the Same Pipeline Three Ways

LLM Guardrails in Practice: Input Validation to Output Filtering

Evaluating AI Agents: Beyond 'Does It Work?'