Function Calling (LLM)

Key Takeaways

  • Function calling is how LLMs interact with external systems — it enables models to select and parameterize function invocations based on natural language input, bridging text generation and real-world action.
  • The LLM never executes the function itself — it outputs a structured JSON object describing which function to call and with what arguments; your application code handles actual execution.
  • Function calling and tool calling are used interchangeably, though "tool calling" is the more modern, broader term encompassing code interpreters and retrieval alongside custom functions.
  • Reliability varies significantly across models — the Berkeley Function Calling Leaderboard (BFCL) tracks accuracy across categories like parallel calls, multi-step chains, and relevance detection, revealing that no single benchmark tells the whole story.
  • Security risk scales with capability — every function an LLM can invoke becomes an attack surface for prompt injection, making scoped permissions, input validation, and audit trails non-negotiable in production.
  • Structured Outputs improve reliability — enabling strict mode (e.g., `strict: true` in OpenAI's API) ensures function call arguments match your JSON Schema exactly, achieving 100% schema adherence on evals.

What Is Function Calling (LLM)?

Function calling is the ability of a large language model to analyze a user's natural language input, determine that an external function or API should be invoked, and produce a structured output specifying the function name and arguments — without executing the function itself. When using function calling, the LLM itself does not execute the function; instead, it identifies the appropriate function, gathers all required parameters, and provides the information in a structured JSON format.

Think of it like a dispatcher at an operations center. The dispatcher (the LLM) listens to incoming requests, decides which specialist team (function) should handle each one, and writes up a work order (JSON payload) with all the relevant details. But the dispatcher never picks up the tools. Your application code reads the work order and executes the call — to a database, an API, an internal service, whatever the function targets.

Function calling allows LLMs to interact with external tools, APIs, and functions based on user input. Instead of just generating text, the LLM can determine that a specific action needs to be taken, and request that action be performed by an external function. This is what turns a text generator into something that can query your Salesforce instance, check inventory levels, trigger a CI/CD pipeline, or create a Jira ticket — all from a conversational prompt.

Sometimes function calling is also known as tool calling. These two are the same terms. However, as Martin Fowler's engineering blog explains, "tool calling" is the more general and modern term, referring to a broader set of capabilities that LLMs can use to interact with the outside world — including built-in tools like code interpreters and retrieval mechanisms, not just custom functions.

How Function Calling Works

The Request-Response Loop

The mechanics are straightforward but precise. Here's what happens when a user asks an agent, "What's the status of order #4821?":

  1. Definition — You provide the LLM with a set of function schemas (JSON descriptions including name, description, and parameters) alongside the user's message.
  2. Selection — The LLM evaluates the prompt and decides whether a function call is needed. If so, it selects the appropriate function and extracts the arguments. The LLM processes the prompt and determines if a function call is necessary. If so, it identifies the correct function from the provided list and generates a JSON object that includes the selected function's name and the required input arguments.
  3. Execution — Your application code receives the JSON, calls the actual function (e.g., `get_order_status(order_id="4821")`), and captures the result.
  4. Integration — The retrieved data is sent back to the LLM, which processes it and generates a contextual, accurate response for the user.

This loop can repeat across multiple turns. A deployment agent might chain `get_latest_commit()` → `run_tests(commit_sha)` → `deploy_to_staging(build_id)` in sequence, each function call feeding the next.

JSON Schema and Structured Outputs

Function definitions follow the OpenAPI/JSON Schema specification. Almost all the SDKs and LLM APIs supporting function calling require you to send the definition of your functions or tools as a JSON Schema object. A typical definition includes the function name, a natural language description, and a `parameters` object specifying types, enums, and required fields.

Reliability here matters. On evals of complex JSON schema following, GPT-4o-2024-08-06 with Structured Outputs scores a perfect 100%, compared to less than 40% for GPT-4-0613. As OpenAI's documentation notes, setting `strict: true` will ensure function calls reliably adhere to the function schema, instead of being best effort. They recommend always enabling strict mode.

Native vs. Prompt-Engineered Function Calling

Not all models support function calling natively. When it comes to LLMs, you need deterministic behavior. Models that have function/tool calling as part of their SDK and API are typically trained to perform function calling, while other models are not — and trained models give better, more consistent results. For open-source models without native support, developers use prompt engineering techniques and constrained decoding to achieve similar results, though with lower reliability.

Why Function Calling Matters

Function calling is the mechanism that turns LLMs from text generators into systems that can act on the world. Without it, an AI agent is just a chatbot with good diction.

From Text to Action

Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. Consider a production incident response agent: it can't just describe what to check — it needs to actually query your monitoring API, pull recent deploys from your CD system, and page the on-call engineer. Function calling makes each of those actions possible.

Keeping Models Current Without Retraining

As new functions or APIs become available, the LLM can be updated to use them without retraining the entire model. This keeps the system up-to-date with minimal effort. Your PagerDuty integration changes? Update the function schema. No fine-tuning required.

Enabling Multi-Step Agent Workflows

Function calling is the primitive that makes agentic workflows possible. Instead of just answering a single question, the LLM can orchestrate multiple function calls to solve a multi-step problem — like planning a trip by checking flight availability, booking a hotel, and renting a car through different APIs. In engineering terms, this is how an agent chains `fetch_logs()` → `correlate_with_deploys()` → `generate_incident_summary()` → `post_to_slack()` in a single flow.

Function Calling in Practice

CI/CD Pipeline Automation

A deployment agent receives "Deploy the feature branch to staging and run smoke tests." It calls `get_branch_status(branch="feature-xyz")`, evaluates the result, then invokes `trigger_deploy(environment="staging", branch="feature-xyz")`, followed by `run_smoke_tests(deploy_id)`. Each function returns structured data that feeds the next decision. The agent only proceeds to deploy if tests pass — function calling gives it the control flow, your code enforces the gates.

Data Extraction and Structured Output

A data pipeline agent processes unstructured customer feedback emails. Using function calling, it invokes `extract_entities(text, schema)` with a defined output structure for sentiment, product mentions, and urgency rating. As OpenAI describes, generating structured data from unstructured inputs is one of the core use cases for AI in today's applications — including function calling to fetch data and answer questions, extracting structured data for data entry, and building multi-step agentic workflows.

Dynamic Tool Discovery with MCP

The Model Context Protocol (MCP), an open protocol proposed by Anthropic, is gaining traction as a standardized way to structure how LLM-based applications interact with the external world. A growing number of SaaS providers are now exposing their services to LLM agents using this protocol. MCP decouples tool definitions from the agent itself, enabling dynamic discovery — an agent can learn about new tools at runtime rather than having them hardcoded, though as Martin Fowler notes, this limits the agent's ability to adapt or scale to new types of requests, but in turn makes it easier to secure against malicious usage.

Key Considerations

Hallucinated Arguments Are Real

LLMs will hallucinate, and this also applies to the arguments they specify for the parameters of a function call. Code accordingly. A model asked to query an order might fabricate an order ID that looks plausible but doesn't exist. Validate every argument before execution. Treat function call outputs the same way you'd treat user input: never trust, always verify.

Every Function Is an Attack Surface

Simple conversational LLMs are limited to text generation; compromise results in misinformation or system prompt extraction. In contrast, tool-calling agents transform prompt injection from an information threat into an operational threat with physical consequences. The OWASP Top 10 for LLMs lists prompt injection as the number-one risk. Attacks targeting LLM agents with tool access include thought/observation injection, tool manipulation — tricking agents into calling tools with attacker-controlled parameters — and context poisoning. Scope function permissions tightly. An agent that can read orders should not be able to delete them.

Benchmark Scores Don't Tell the Full Story

The Berkeley Function Calling Leaderboard tracks model accuracy across categories including single-function, parallel, and multi-step scenarios. Function calling is a complex capability that significantly enhances the utility of LLMs in real-world applications. However, evaluating and improving this capability is far from straightforward. No single benchmark tells the whole story — a holistic approach combining multiple evaluation frameworks is crucial. As Databricks found, high scores on certain benchmarks, while necessary, are not always sufficient to guarantee superior function-calling performance in practice.

Nondeterminism Is Inherent

As LLMs are nondeterministic by design, there is no guarantee that tool calling works flawlessly all the time. Build retry logic, fallback paths, and validation layers. A production agent calling `execute_trade()` needs fundamentally different safeguards than one calling `get_weather()`.

Cost and Latency Compound

Every function call adds a round-trip: tokens for the schema, tokens for the model's reasoning, network latency for the API call, and tokens for processing the result. A five-step agent chain means five round-trips minimum. Monitor token consumption per workflow, not just per call — this is where those 1,440x cost multiplier surprises come from when agents run unsupervised loops.

The Future We're Building at Guild

AI pair programming shows what happens when developers get a capable collaborator. But it also exposes the limits of single-player tools: no governance, no shared context, no way to inspect what the AI actually did across a team. Guild.ai builds the infrastructure layer where AI agents — including coding assistants — run as governed, observable, shared systems. Versioned, permissioned, auditable.

Learn more and join the waitlist at Guild.ai

Where builders shape the world's intelligence. Together.

The future of software won't be written by one company. It'll be built by all of us. Our mission: make building with AI as collaborative as open source.

FAQs

While these terms are often used interchangeably, "tool calling" is the more general and modern term. Function calling typically refers to invoking a specific, developer-defined function with structured arguments. Tool calling is broader and can include built-in capabilities like code execution and file retrieval. In practice, most APIs treat them as synonymous.

No. LLMs cannot actually call the tool themselves; instead, they express the intent to call a specific tool in their response. Developers should then execute this tool with the provided arguments and report back the results. Your application handles all execution, error handling, and security validation.

Most major commercial models support it natively, including OpenAI's GPT-4o family, Anthropic's Claude, Google's Gemini, and Mistral's models. Open-source models like Llama 3 and Mistral also support it through frameworks like Ollama. Not all LLMs support tools equally well. The ability to understand, select, and correctly use tools depends heavily on the specific model and its capabilities.

Apply least-privilege principles: scope each agent's available functions to the minimum needed. Validate all arguments before execution. Use human-in-the-loop approval for high-risk operations. Avoid connecting LLMs to external resources whenever reasonably possible, and multistep chains that call multiple external services should be rigorously reviewed from a security perspective. Standard security practices such as least-privilege, parameterization, and input sanitization must be followed.

Function calling is the model-level capability. MCP is a protocol layer that standardizes how agents discover and connect to external tools. The core problem MCP addresses is flexibility and dynamic tool discovery. MCP builds on function calling but adds a client-server architecture for managing tool registries at scale.

Yes. Many models support parallel function calling, where multiple functions are invoked in a single turn. However, OpenAI notes that Structured Outputs is not compatible with parallel function calls — when a parallel function call is generated, it may not match supplied schemas. Set `parallel_tool_calls: false` to disable parallel function calling if strict schema adherence is required. Function calling is the bridge between what an LLM knows and what it can do. At Guild.ai, we're building the runtime and control plane that makes this bridge safe for production — with scoped permissions on every function an agent can invoke, full audit trails of every call made, and cost visibility across multi-step agent workflows. When agents act on the world, someone needs to govern what they're allowed to touch. Learn more about how Guild.ai is building the infrastructure for AI agents at guild.ai.