The number 8000 has shaped speech technology for a century. When engineers digitized the telephone, they decided a voice needs only the band from about 300 to 3,400 hertz to stay intelligible, sampled the signal 8,000 times a second, and built the entire phone network around that budget. It is why a voice on a landline sounds thin, why "s" and "f" are hard to tell apart on a call, and why a recognizer that shines on a podcast stumbles on a phone line. That capture decision set the ceiling on what any recognizer could hear from a phone call.
A format is a stack of those decisions, made once at capture and binding forever after. Each setting captures something, throws something out, and some of those losses a recognizer can survive.
Sample rate
Sound is a wave, and a digital recording measures its height many times a second. The sample rate is how many times. The governing rule, the Nyquist theorem, is unforgiving: to capture a frequency, you must sample at least twice as fast as that frequency. Sample at 8,000 hertz and the highest sound you can represent is 4,000 hertz. Everything above that is gone, not merely degraded but absent from the recording.
Three rates matter for speech.
8 kHz (narrowband) is the telephone rate. It captures up to 4 kHz, which holds most of the energy that makes speech intelligible but clips the high-frequency detail that distinguishes similar consonants. It is a permanent feature of telephony transcription, not a choice you get to make on a phone call.
16 kHz (wideband) captures up to 8 kHz, covering essentially all the information a recognizer uses. This is the standard rate for speech recognition. It gives the model everything it needs without spending bytes on detail no recognizer will read.
44.1 or 48 kHz are music and video rates, capturing up to roughly 22 kHz. For speech recognition this is overkill. The extra range sits above where speech lives, so it adds bytes without accuracy, and a recognizer downsamples it to 16 kHz internally anyway.
Bit depth
If sample rate is how often you look, bit depth is how precisely you record each look. Think of it as the number of rungs on the ladder the wave's height has to be rounded to: more rungs, less rounding. 16-bit audio, the standard for speech, gives 65,536 rungs per sample, enough that the rounding error (quantization noise) is inaudible and recognition unaffected. Higher bit depths exist for studio work and do nothing for recognition. The default is 16-bit, and almost everything assumes it.
Codecs
Raw audio is PCM (pulse-code modulation): the uncompressed stream of samples, the baseline that every other format encodes or decodes to. PCM is large, so most audio is compressed by a codec, and codecs split into two kinds.
Lossless codecs (FLAC, for example) shrink the file while preserving every sample exactly; decode and you get the original PCM back. Lossy codecs throw information away to shrink the file far more, and the question for recognition is always what they threw away. A codec designed for speech discards what the ear and the recognizer can spare. A codec pushed to a very low bitrate discards things the recognizer needs.
The ones you will meet:
| Codec | Kind | Typical use | Notes for recognition |
|---|---|---|---|
| PCM (s16le) | Uncompressed | Local capture, streaming | The baseline; nothing lost |
| μ-law / A-law (G.711) | Lossy, 8 kHz | Telephony | 8-bit logarithmic; narrowband ceiling |
| Opus | Lossy | WebRTC, modern streaming | Excellent speech quality at low bitrate |
| MP3 / AAC | Lossy | Music, podcasts | Fine at decent bitrate; tuned for music |
| FLAC | Lossless | Archives | Safe; just larger than lossy |
Two hazards. Very low bitrates hurt: squeeze a voice into a tiny Opus or MP3 stream and you blur the fine detail recognition depends on. And transcoding compounds. Every lossy re-encode loses a little more, the way a photocopy of a photocopy slowly turns to mud, and a typical clip has been through several before it reaches you: phone to recording to upload to re-compress. Keep the audio in one good format from capture to recognizer rather than running it through that chain.
Channels
A recording is mono (one channel) or has multiple channels. For a single speaker, mono is all you need. The interesting case is telephony and conferencing, where the two sides of a call are recorded on separate channels: caller on the left, agent on the right. When that holds, you get speaker separation for free, no diarization required, because each channel is one person. Mixing down to mono throws that away. For two-party calls, keep the channels split if you can.
Recommended input formats
If you control capture, send 16 kHz, 16-bit, mono PCM, or Opus at a healthy bitrate when you need to save bandwidth on a stream. On the phone, you are at 8 kHz with μ-law or A-law and there is no way around it, so tune for narrowband rather than fight it. Keep two-party calls in stereo to separate speakers by channel. And do not transcode without reason.
Every one of those choices is made before the model runs, and none can be taken back. The recognizer can only work with the information the format kept.
Common questions
What sample rate should I use for speech recognition?
16 kHz if you control the capture. It records frequencies up to 8 kHz, covering everything a recognizer uses, with no wasted data. Higher rates like 44.1 kHz add no accuracy for speech and just enlarge the file. Use 8 kHz only when you are stuck with it, as on a phone call.
Does converting 8 kHz audio to 16 kHz improve recognition?
No. Upsampling makes the file larger but adds no information, because the frequencies above 4 kHz were never captured and cannot be recovered. The recognizer sees the same narrowband content either way. Capture at the higher rate from the start.
Will compressing my audio hurt accuracy?
Depends on the codec and bitrate. Lossless compression (FLAC) is safe. Speech-tuned lossy codecs like Opus at a reasonable bitrate are nearly free. The damage comes from very low bitrates and from repeated transcoding. Avoid stacking lossy conversions.
Should I send stereo or mono?
Mono is fine for a single speaker. For two-party calls, keep stereo if each side is on its own channel: that separates the speakers for free and avoids diarization. Mixing a two-party call down to mono discards that separation.
Related concepts
- Telephony transcription
- Transcribing noisy audio
- Speaker diarization
- Streaming speech recognition
- TTS audio formats
References
- Soniox (2026). WebSocket API. Soniox documentation.