Transcribing phone calls: 8kHz audio, channels, and telephony codecs

A call-center recording comes back with "send" transcribed as "sent," "fifteen" as "sixteen," and the agent's careful spelling of a reference code mangled into prose. The same recognizer handles a studio podcast almost perfectly. What changed is not the model but the audio, because it came through a phone. The phone network was engineered for intelligibility at the lowest possible cost, decades before anyone wanted to transcribe it.

Narrowband sampling at 8 kHz

Phone audio is sampled at 8 kHz and band-limited to roughly 300 to 3,400 hertz, the passband the telephone system settled on to keep a voice intelligible on a cheap line. That ceiling is baked into the network, not a setting you can raise on a live call.

8 kHz sampling captures sound only up to 4 kHz (the Nyquist limit), and much of what distinguishes consonants lives above that. The hiss that separates "s" from "f" and the burst that separates "t" from "k" are attenuated or gone. The recognizer is asked to tell apart sounds whose distinguishing features were filtered out before it heard them, so consonant confusions and number errors rise on the phone for reasons the model cannot fix.

Codec distortion

Inside the network, the audio is compressed. The classic codec is G.711 (μ-law in North America and Japan, A-law elsewhere), an 8-bit logarithmic encoding from 1972 that is light and ubiquitous. Lower-bitrate codecs like G.729, and narrowband Opus on modern VoIP, compress harder to save bandwidth.

Every lossy codec discards detail, and the harder it compresses, the more it discards. G.711 is relatively gentle; aggressive low-bitrate codecs blur the fine spectral structure recognition leans on. And because a call may be transcoded several times through carriers and gateways, the losses stack, each conversion shaving a little more off a signal that was already narrowband. By the time the audio reaches a recognizer, it can be several lossy steps removed from the voice that was spoken.

Packet loss

VoIP carries audio as packets over a network that does not guarantee delivery. Some packets arrive late, some never arrive, and the jitter buffer that smooths this out can only do so much.

A lost packet is a small hole in the audio: tens of milliseconds of silence, or a concealment guess where speech used to be. A single dropped packet can clip a short word out of existence, and a burst of loss during a bad connection can shred a whole phrase. The recognizer sees gaps and glitches that were never spoken, and on a poor line these accumulate into dropped and garbled words with nothing to do with the speaker.

Mono and separate-channel audio

How the call is recorded decides how hard the rest of the job is. A two-party call can be captured with each leg on its own channel (caller left, agent right) or mixed down to a single mono track.

In the mono case, both speakers share one channel, so separating them becomes a diarization problem, and any moment they talk over each other is overlapping speech, one of the hardest conditions there is. Capture the same call in stereo with one speaker per channel and the separation is free and perfect, because each voice is already on its own track. The single most useful thing you can do for call transcription is often to record the legs separately.

Echo, hold music, and tones

Real calls are not pure speech. Echo, the agent's own audio leaking back through the caller's line, confuses turn-taking and can be transcribed as phantom speech if it is not cancelled. Hold music and IVR prompts are non-speech audio that a recognizer may try to turn into words (a hallucination risk). And DTMF tones, the beeps of a keypad, are not speech at all; they need to be detected separately rather than transcribed. A system that assumes a clean conversational stream has no plan for any of these unless someone built one. Telephony deployments that work treat echo cancellation, non-speech handling, and DTMF as part of the pipeline.

Requirements for telephony transcription

Phone audio is its own domain. Capture both legs separately when you can, for free speaker separation and cleaner overlap. Expect 8 kHz and choose a recognizer trained on telephony audio rather than one that only saw studio speech, because tolerance for these conditions comes from having heard them before.^[1] Tune endpointing and VAD for the phone, separately from your microphone settings, since comfort noise, packet loss, and the narrow band all push silence detection toward false readings. For live phone agents, all of this feeds into putting voice agents on the phone, where the same constraints meet a tight latency budget.

Common questions

Why is phone audio harder to transcribe than a recording from my laptop?

Because the phone network samples at 8 kHz and band-limits the voice, removing the high-frequency detail that distinguishes many consonants, then compresses it with lossy codecs and may drop packets along the way. A laptop can capture wideband 16 kHz audio with none of those losses, so the recognizer simply has more to work with.

Should I record a call in mono or stereo?

Stereo with one speaker per channel, whenever you can. That separates the caller and agent for free, avoids needing diarization, and makes overlapping speech a non-issue because each voice is on its own track. A mono mixdown forces you to separate the speakers afterward and suffers wherever they talk at once.

Why does endpointing misbehave on phone calls?

Because the phone network injects comfort noise during pauses, switches transmission on and off, and drops packets, all of which a silence detector tuned on clean audio reads incorrectly. Telephony usually needs its own voice-activity and endpointing thresholds rather than the ones that work on a clean microphone.

Can a recognizer handle keypad tones and hold music?

Not as speech, and it should not try. DTMF keypad tones need separate detection, and hold music or IVR prompts are non-speech audio that can trigger hallucinated words if fed to the recognizer. A telephony pipeline handles these explicitly rather than transcribing everything that comes down the line.

References

Soniox (2026). Real-time transcription. Soniox documentation.