LLM Agents and Tool Calling: A Practical Engineering Guide

Q: What is the difference between function calling and tool calling?

They are the same mechanism under two names. Function calling was the original term; tool calling is the broader, current label because tools can be more than functions, including server-side capabilities like web search and code execution. In both cases the model emits a structured request with validated arguments, and your code executes it and returns the result.

Q: What makes a good tool schema?

Clear, trigger-aware descriptions that tell the model when to call the tool, tight JSON Schemas using enums and required fields, a one-line description on every property, and strict validation enabled. Keep the tool set small and non-overlapping so the model is never guessing between two tools. Always parse the model's arguments with a real JSON parser.

Tool calling is the single feature that turns a language model from a clever text generator into something that can actually do work in your systems. It is also where most teams either ship a genuinely useful product or waste a quarter building an agent that should have been a single prompt. This guide is the practical version: how function calling actually works on the wire, how to design tool schemas the model will use correctly, how to run a multi-step agent loop without it spiraling, the guardrails that keep it safe, and the honest decision of when an agent is worth the cost at all.

What "tool calling" actually is

A tool call is a structured request the model emits instead of a final answer. You hand the model a list of tools — each with a name, a description, and a JSON Schema for its inputs — and when the model decides a tool would help, it stops generating prose and returns a structured block that says, in effect, "call get_order with {order_id: "A-1843"}."

The model does not run anything. It cannot reach your database, your payment processor, or the public internet on its own. It produces a request; your code executes it and feeds the result back. That separation is the whole security model, and it is why tool calling is safe to build on: the boundary between "the model wants to do X" and "X actually happens" lives entirely in your harness.

The round trip looks like this:

You send the conversation plus the tool definitions.
The model responds with a tool_use block (or a normal text answer, if no tool is needed).
You execute the tool and return a tool_result referencing the same call ID.
The model reads the result and either calls another tool or writes its final answer.

Everything else — agents, multi-step planning, retrieval pipelines — is built on this one primitive.

Designing tool schemas the model will use correctly

The schema is the contract, but the description is the prompt. The model chooses tools and fills in arguments based almost entirely on the natural-language description and the per-field hints. A vague description produces a tool that gets ignored or misused; a precise one produces reliable calls.

Three rules carry most of the weight:

**Describe *when* to call, not just what it does.** "Look up a customer's current subscription. Call this whenever the user asks about their plan, billing date, or renewal." beats "Gets subscription data." Recent models are conservative about reaching for tools, so the trigger condition earns measurable lift.
Constrain inputs at the schema level. Use enum for fixed sets, mark only the truly required fields as required, and give each property its own one-line description. The tighter the schema, the fewer malformed calls.
Keep the tool set small and orthogonal. Five sharp tools beat fifteen overlapping ones. If two tools could plausibly answer the same request, the model will sometimes pick the wrong one.

Here is a well-formed definition. Note the strict: true flag, which guarantees the model's arguments validate against the schema rather than merely being encouraged to:

const tools = [
  {
    name: "get_order_status",
    description:
      "Look up the current status of a customer order by its ID. " +
      "Call this when the user asks where their order is, whether it " +
      "shipped, or for a delivery estimate. Do not guess order IDs.",
    strict: true,
    input_schema: {
      type: "object",
      properties: {
        order_id: {
          type: "string",
          description: "Order ID in the form 'A-1843'. Ask the user if unknown.",
        },
        include_tracking: {
          type: "boolean",
          description: "Whether to return carrier tracking events.",
        },
      },
      required: ["order_id"],
      additionalProperties: false,
    },
  },
];

One operational detail that bites people: always parse the model's arguments with a real JSON parser, never with string matching. Models can vary Unicode and slash escaping in the serialized input, so JSON.parse(block.input) is correct and substring checks are a latent bug.

The agent loop: from one call to many

A single tool call is a workflow. An agent is what you get when you put that exchange in a loop and let the model decide its own next step until the task is done. The minimal loop, using the Anthropic SDK with claude-opus-4-8, is short enough to read in one sitting:

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

let messages = [{ role: "user", content: userInput }];

while (true) {
  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 4096,
    tools,
    messages,
  });

  messages.push({ role: "assistant", content: response.content });

  if (response.stop_reason !== "tool_use") break; // model gave a final answer

  const toolResults = [];
  for (const block of response.content) {
    if (block.type !== "tool_use") continue;
    const result = await executeTool(block.name, block.input); // YOUR code
    toolResults.push({
      type: "tool_result",
      tool_use_id: block.id,
      content: JSON.stringify(result),
    });
  }
  messages.push({ role: "user", content: toolResults });
}

The loop terminates when stop_reason is end_turn — the model has stopped calling tools and produced an answer. Two things make this production-grade rather than a demo:

Append the full `response.content`, not just the text. The tool_use blocks must stay in the history or the next turn loses the thread.
Match every `tool_result` to its `tool_use_id`. The model fired the calls in parallel; the IDs are how the results get reconnected.

The SDKs also ship a tool runner that wraps this loop for you, executing your tool functions and looping automatically. Use it for the common case. Write the manual loop when you need what the runner abstracts away: human-in-the-loop approval, custom logging, conditional execution, or per-step budget checks.

Guardrails: keeping an autonomous loop safe and bounded

An agent that can call tools is an agent that can call tools too many times, call the wrong one, or take an irreversible action you did not intend. Guardrails are not optional polish — they are the difference between a tool you trust in production and one you babysit.

The ones that matter most:

Cap the iterations. Set a hard max_steps ceiling on the loop. Without it, a confused model can churn through dozens of calls (and dollars) chasing a goal it cannot reach. Break and surface a clear error when you hit the cap.
Gate destructive actions. Reversibility is the right criterion. A read-only get_order can run freely; issue_refund, send_email, or delete_account should require explicit confirmation. With a manual loop you intercept the tool_use block and wait for a human; managed agent frameworks expose an "always ask" permission policy that pauses the session for approval.
Validate inputs inside the tool, not just in the schema. The schema constrains shape; your handler enforces business rules — that the order belongs to the requesting user, that the refund amount is within policy, that the ID exists. Treat tool inputs as untrusted, the same way you treat any API request.
Return errors as data, not exceptions. When a tool fails, send back a tool_result with an informative message and an error flag. The model reads it and adapts — retries with a corrected argument or asks the user for clarification — instead of the whole loop crashing.
Set a token budget for the whole task. Long agentic runs accumulate context. Cap max_tokens per response and watch the cumulative spend; for long-horizon work, the newer models expose a task-budget mechanism the model itself can see and self-moderate against.

The single highest-leverage guardrail is the reversibility gate. Most agent incidents are not the model hallucinating — they are the model confidently taking a real, hard-to-undo action that nobody put a confirmation step in front of.

When an agent is worth it — and when it is not

This is the decision most teams get wrong, and getting it wrong is expensive. Agents cost more in latency, tokens, and engineering complexity than a single call. They are worth that cost only when the task genuinely demands open-ended, model-driven steps.

Check all four before reaching for an agent:

Complexity — Is the task multi-step and hard to fully specify in advance? "Turn this support ticket into a resolved order adjustment" qualifies. "Extract the invoice total from this PDF" does not.
Value — Does the outcome justify the higher cost and latency per run?
Viability — Is the model actually good at this task type today?
Cost of error — Can mistakes be caught and recovered (tests, review, rollback), or does a wrong step cause real damage?

If any answer is no, step down a tier. The tiers, simplest first:

Tier	Use when	Shape
Single prompt	Classification, summarization, extraction, Q&A	One request, one response
Single tool call	The model needs one piece of live data to answer	One round trip through your code
Workflow	Multi-step but the steps are known and fixed	You orchestrate the sequence in code; the model fills in each step
Agent	The path is genuinely unknown until the model explores it	The model decides the next step in a loop

A surprising amount of "agent" work is really a workflow: a fixed pipeline of three known steps where you control the order and the model just does the reasoning at each one. Workflows are more predictable, cheaper, easier to test, and easier to debug than agents. Reach for a true agent only when you genuinely cannot pre-determine the sequence of steps. The instinct to build an agent because agents are exciting is the most common and most costly mistake in this space.

Putting it together: a pragmatic build order

When we build this at CodeAustral, the order of operations is consistent: start with the simplest tier that could work, define a small set of sharp tools with trigger-aware descriptions and strict schemas, run a manual loop first so the behavior is observable, add the iteration cap and the reversibility gate before anything touches production, and only then consider whether the workload truly needs an autonomous agent or whether a fixed workflow would be more reliable. Instrument token usage and tool-call counts from day one — they are your early warning that the model is thrashing.

The mental model worth keeping: the model proposes, your code disposes. Every capability the agent has is a tool you wrote and a guardrail you placed. Design those well and tool calling is one of the most reliable building blocks in modern software. Skip the guardrails and you have built a very expensive way to take actions you cannot take back.

Frequently Asked Questions

What is the difference between function calling and tool calling?

They are the same mechanism under two names. "Function calling" was the original term; "tool calling" is the broader, current label because tools can be more than functions — they include server-side capabilities like web search and code execution. In both cases the model emits a structured request with validated arguments, and your code executes it and returns the result.

Do LLM agents execute code on their own?

No. The model only emits a structured request describing which tool to call and with what arguments. Your application code actually runs the tool and decides whether to run it at all. This separation is the core security boundary: nothing happens in your systems unless your harness chooses to execute it, which is why you can safely gate destructive actions behind confirmation.

When should I use an agent instead of a single prompt?

Use an agent only when the task is multi-step, the sequence of steps cannot be specified in advance, the value justifies higher latency and cost, and errors are recoverable. If the steps are known and fixed, build a workflow you orchestrate in code instead. For classification, extraction, or Q&A, a single prompt is almost always the right and cheapest answer.

How do I stop an agent loop from running forever?

Set a hard maximum-iterations cap and break out of the loop when you hit it, returning a clear error. Also cap tokens per response, set a budget for the whole task, and return tool failures as data so the model adapts rather than retrying blindly. The loop should naturally end when the model stops calling tools, but the cap is your safety net for when it does not.

What makes a good tool schema?

Clear, trigger-aware descriptions that tell the model *when* to call the tool, tight JSON Schemas using enums and required fields, a one-line description on every property, and strict validation enabled. Keep the tool set small and non-overlapping so the model is never guessing between two tools that could answer the same request. Always parse the model's arguments with a real JSON parser.

Working with CodeAustral

We build LLM-powered features, agents, and tool-calling pipelines for web platforms, AI products, and operational tools — and we are just as happy to tell you that a single prompt or a fixed workflow is the better fit. If you are designing an agent and want a second opinion on the architecture, the guardrails, or whether it should be an agent at all, send us a brief at https://codeaustral.com/contact.