Putting voice agents on the phone: Twilio, SIP, and codecs

The phone is the oldest and largest voice interface on earth, and almost everyone knows how to use it. That is why so many voice agents end up there: there is no app to install and no link to click, only a number. Connecting a 2026 AI to it means inheriting a network designed in the 1970s: a narrowband codec and signaling conventions that predate the internet, with a latency floor you cannot negotiate.

A browser agent and a phone agent run the same STT, model, and TTS. What changes is everything around them.

Call audio path

A phone agent has more links in its chain than a browser one. A call originates on the public telephone network (the PSTN) or a VoIP system and reaches a telephony provider, the Twilio or SIP trunk that bridges the phone world and yours. The provider sets up the call using SIP (Session Initiation Protocol, which establishes, modifies, and tears down calls) and carries the audio, classically over RTP, the real-time media protocol.

To get that audio into your agent, the modern pattern is media streaming. The provider opens a connection to your server, often a WebSocket, and streams the call's audio to you in small chunks while you stream the agent's audio back.^[1] Twilio's Media Streams works this way: it sends the caller's audio as it arrives and plays back what you return. Your server runs the agent (recognition, model, synthesis); the provider is the gateway to the caller.

flowchart LR A[Caller<br/>PSTN] --> B[Telephony provider<br/>SIP + media] B --> C[Your server<br/>STT to LLM to TTS] C --> B

The phone agent stack. The provider bridges the telephone network and your agent over a media stream.

Effects of telephony codecs

The first thing the phone imposes is its audio. Call audio is 8 kHz, narrowband, carried by G.711 μ-law or A-law, and not negotiable on a normal call. The band was chosen in the 1960s to fit the most channels onto a copper wire, and it has outlived the copper. Two consequences follow.

Recognition gets harder. The 8 kHz band removes the high-frequency detail that distinguishes many consonants, so a recognizer that excels on a browser's wideband audio makes more errors on the phone, for the reasons laid out in telephony transcription. A phone agent wants a recognizer tuned for telephony, not one that only ever heard studio speech.

Synthesis has to match. The agent's TTS output must be 8 kHz μ-law to feed back into the call cleanly. Generating a pristine 48 kHz waveform for a phone line is like mailing a 4K film to someone with a fax machine: the network throws away everything past the seam. Match the format to the channel and you skip the wasted conversion.

Network latency

Every link in the longer chain costs time. Audio travels caller to PSTN to provider to your server and back, and those hops add a latency floor that a browser's direct connection does not have. That is why the same agent feels slower on the phone, the question raised in the latency budget: the round trip is longer before the agent's own processing begins. You cannot beat the speed of the network; you absorb it by being faster everywhere you control and by placing your servers near the provider's.

Telephone signaling and keypad input

The phone carries baggage a browser agent never meets. DTMF tones, the beeps of the keypad, are how callers enter numbers in menus. They are not speech, so the agent detects them separately rather than transcribing them, which makes them reliable for press-1 input and for capturing digits when recognition is risky. Echo is worse on a phone, so acoustic echo cancellation is mandatory; without it the agent trips on its own voice. And telephony has events with no browser equivalent: the call being answered, transferred to a human, sent to voicemail (which an outbound agent should detect rather than talk to), or hung up. A phone agent handles the whole call lifecycle, including the parts outside the conversation.

Inbound and outbound calling

Phone agents come in two directions, with different concerns. Inbound agents answer calls (support and reception) and must pick up fast and handle whoever calls. Outbound agents place calls (reminders and notifications) and add problems: detecting voicemail, respecting calling rules and consent. If you skip voicemail detection, the agent reads its script to an answering machine.

Both directions hit scaling the moment they are real. A hundred simultaneous calls means a hundred concurrent recognition and synthesis streams, a capacity and cost question before it is a code one. The phone makes an agent reachable by anyone, so the load can arrive without warning.

Common questions

How does a voice agent connect to a phone call?

Through a telephony provider such as Twilio or a SIP trunk. It sets up the call with SIP and streams the call audio to your server, often over a WebSocket, where your agent recognizes, reasons, and synthesizes, then sends audio back for the provider to play to the caller.

Why does my voice agent sound worse and slower on the phone?

Both come from the phone network, not the agent. The 8 kHz narrowband codec strips the high-frequency detail that recognition relies on, so errors rise. The extra hops (caller to PSTN to provider to your server and back) add a latency floor that a browser's direct connection does not have.

What format should my agent's TTS output for a phone call?

8 kHz μ-law (or A-law), to match what the network carries. Higher-rate audio is wasted; the network downsamples it. Matching the channel avoids a lossy conversion and keeps the audio clean as it feeds back in.

Do phone agents need to handle keypad input?

Usually. DTMF tones are how callers enter numbers and menu choices, and they are not speech, so the agent detects them separately. They are more reliable than spoken digits when the line is noisy and a misheard number costs you.

References

Soniox (2026). Build a voice agent with Pipecat and Soniox. Soniox documentation.