Tool calling in voice agents: actions from spoken requests

"Book me a table for two at seven tonight." A voice agent without tools can only answer; it might say "sure, I've noted that," and do nothing. An agent with tools recognizes that this requires an action, calls your reservation system, gets a confirmation number, and says "done, you're booked for seven, confirmation 4471." That step is tool calling, and wiring it into a voice agent has wrinkles a text chatbot never faces.

The loop has four steps, then the problems specific to doing it out loud.^[1]

Describe the tools to the model

The model can only call tools it knows about, so you give it a description of each: a name, what it does, and the parameters it takes. The model reads these and decides, per turn, whether and how to call them.

[
  {
    "name": "book_table",
    "description": "Reserve a table at the restaurant.",
    "parameters": {
      "type": "object",
      "properties": {
        "party_size": { "type": "integer" },
        "time": { "type": "string", "description": "ISO 8601 datetime" }
      },
      "required": ["party_size", "time"]
    }
  }
]

Structured tool-call output

When the user's request matches a tool, the model does not produce words to speak. It produces a structured call: the tool name and the arguments it extracted from the conversation. This is where the agent shifts from talking to acting.

{ "tool_call": { "name": "book_table", "arguments": { "party_size": 2, "time": "2026-06-17T19:00" } } }

Run the tool in your code

The model cannot touch your systems; it only asks. It is the customer at a counter writing an order slip, not the clerk who fills it. Your code is the clerk: it executes the call against the real API and brings back a result.

Return the tool result to the model

The result goes back into the model as part of the conversation, and the model produces the spoken reply, now grounded in real data.

That is the full loop. Everything past here is making it work in a live conversation rather than a text box.

Tool-call latency

A tool call adds a round trip the user can hear. The model stops, your API responds, the model runs again to phrase the result, and all of that is silence on the line, a second or more, landing right in the middle of the latency budget. In text, a brief spinner covers it. In voice there is no spinner, so a second of nothing reads as the agent having frozen.

The fix is a holding phrase. The moment the agent decides to call a tool, it says something to fill the gap, "let me check that for you," or "one moment," while the call runs in the background.

Confirmation before consequential actions

Speech is lossy. The agent might have misheard "transfer five hundred" as "transfer five thousand," or the model might have extracted the wrong account, and unlike a form, the user never saw what was captured. For any irreversible or consequential action, such as moving money, cancelling a booking, or sending a message, the agent should confirm before executing: "transferring five hundred dollars to savings, is that right?" This catches recognition errors on the numbers and names that tool parameters are full of, where transcription is weakest, before they become actions you cannot take back.

Parameter errors and tool failures

The parameters come from speech, so the dates, quantities, and identifiers the model extracts inherit every difficulty of alphanumeric recognition. A careful agent validates them and asks again when they are implausible rather than booking a table for "two hundred" people. Tools also fail: the API times out, or the slot is gone. The agent needs a spoken fallback ("I couldn't reach the booking system, want me to try again?") instead of a stack trace or dead air. And because the user can barge in during the holding phrase, the orchestrator has to handle an interruption that arrives while a tool call is still in flight.

Tool calling makes a voice agent useful, and it is also where it gains the power to be confidently wrong out loud. A booking it never made, a transfer for the wrong amount, a balance read back with a digit dropped: the agent says each in the same composed voice it uses for correct answers. The confirmation, validation, and fallback work above is what keeps a fluent voice from acting on errors it cannot take back.

Common questions

What is tool calling in a voice agent?

The step that turns a spoken request into a real action: the model emits a structured call instead of speech, your code runs it against your systems, and the result feeds back so the reply is grounded in real data. It separates an agent that says "I've noted that" from one that actually books the table and reads back the confirmation number.

Why do voice agents say "one moment" before answering?

To cover a round trip you would otherwise hear as dead air. The model stops, your API responds, the model runs again to phrase the result, and that one to two seconds is silence on the line that reads as a frozen call. The holding phrase masks it. A text spinner does the same job, but voice has none.

Should a voice agent confirm before taking an action?

For anything consequential or irreversible, yes. The user never saw what was captured, and tool parameters are full of the numbers and names that recognition handles worst, where "five hundred" and "five thousand" diverge. Read the critical values back ("transferring five hundred dollars, correct?") before executing, and you catch the error before it becomes an action you cannot undo.

What happens if a tool call fails during a conversation?

The agent needs a spoken fallback ("I couldn't reach the booking system, want me to try again?") rather than a stack trace or dead air. It also has to handle the user barging in while the call is still in flight, since the holding phrase invites a reply mid-lookup.

References

Soniox (2026). Build a voice agent with Pipecat and Soniox. Soniox documentation.