Audio formats, sample rates, and codecs for speech recognition

The number 8,000 has shaped speech technology for over half a century. When engineers digitized the telephone, they decided a voice needs only the band from about 300 to 3,400 hertz to stay intelligible, sampled the signal 8,000 times a second, and built the entire phone network around that budget. It is why a voice on a landline sounds thin, and why a recognizer that shines on a podcast stumbles on a phone line. That capture decision set the ceiling on what any recognizer could ever hear from a call.

A format is a stack of such decisions, made once at capture and binding forever after. Each setting keeps something and throws something out, and only some of those losses are survivable.

Sample rate

Sound is a wave, and a digital recording measures its height many times a second. The sample rate is how many times. The governing rule, the Nyquist theorem, is unforgiving: to capture a frequency, you must sample at least twice as fast as that frequency. If you sample at 8,000 hertz, the highest sound you can represent is 4,000 hertz. Everything above that is gone, not merely degraded but absent from the recording.

8 kHz (narrowband) is the telephone rate. It captures up to 4 kHz, which holds most of the energy that makes speech intelligible but clips the high-frequency detail that distinguishes similar consonants. It is a permanent feature of telephony transcription, not a choice you get to make on a phone call.

16 kHz (wideband) captures up to 8 kHz, covering essentially all the information a recognizer uses. This is the standard rate for speech recognition. It gives the model everything it needs without spending bytes on detail no recognizer will read.

44.1 or 48 kHz are music and video rates, capturing up to roughly 22 kHz. For speech recognition this is overkill. The extra range sits above where speech lives, so it adds bytes without accuracy, and a recognizer downsamples it to 16 kHz internally anyway.

Bit depth

If sample rate is how often you look, bit depth is how precisely you record each look. Think of it as the number of rungs on the ladder the wave's height has to be rounded to: more rungs, less rounding. 16-bit audio, the standard for speech, gives 65,536 rungs per sample, enough that the rounding error (quantization noise) is inaudible and recognition unaffected. Higher bit depths exist for studio work and do nothing for recognition. The default is 16-bit, and almost everything assumes it.

Codecs

Raw audio is PCM (pulse-code modulation): the uncompressed stream of samples, the baseline that every other format encodes or decodes to. PCM is large, so most audio is compressed by a codec, and codecs split into two kinds.

Lossless codecs (FLAC, for example) shrink the file while preserving every sample exactly; decode and you get the original PCM back. Lossy codecs throw information away to shrink the file far more, and the question for recognition is always what they threw away. A codec designed for speech discards what the ear and the recognizer can spare. A codec pushed to a very low bitrate discards things the recognizer needs.

The ones you will meet:

Codec	Kind	Typical use	Notes for recognition
PCM (s16le)	Uncompressed	Local capture, streaming	The baseline; nothing lost
μ-law / A-law (G.711)	Lossy, 8 kHz	Telephony	8-bit logarithmic; narrowband ceiling
Opus	Lossy	WebRTC, modern streaming	Excellent speech quality at low bitrate
MP3 / AAC	Lossy	Music, podcasts	Fine at decent bitrate; tuned for music
FLAC	Lossless	Archives	Safe; just larger than lossy

Two hazards. Very low bitrates hurt: if you squeeze a voice into a tiny Opus or MP3 stream, you blur the fine detail recognition depends on. And transcoding compounds. Every lossy re-encode loses a little more, the way a photocopy of a photocopy slowly turns to mud, and a typical clip has been through several before it reaches you: phone to recording to upload to re-compress. Keep the audio in one good format from capture to recognizer rather than running it through that chain.

Channels

A recording is mono (one channel) or has multiple channels. For a single speaker, mono is all you need. The interesting case is telephony and conferencing, where the two sides of a call are recorded on separate channels: caller on the left, agent on the right. When that holds, you get speaker separation for free, no diarization required, because each channel is one person. Mixing down to mono throws that away. For two-party calls, keep the channels split if you can.

flowchart TB A[Voice<br/>energy to ~8 kHz] --> B{Capture rate} B -->|8 kHz telephony| C[Keeps to 4 kHz<br/>consonants blur] B -->|16 kHz wideband| D[Keeps to 8 kHz<br/>full detail]

The same voice, two ceilings. 8 kHz telephony discards the high band before the recognizer ever sees it.

Recommended input formats

If you control capture, send 16 kHz, 16-bit, mono PCM, or Opus at a healthy bitrate when you need to save bandwidth on a stream.^[1] On the phone, you are at 8 kHz with μ-law or A-law and there is no way around it, so tune for narrowband rather than fight it. Keep two-party calls in stereo to separate speakers by channel. And do not transcode without reason.

Every one of those choices is made before the model runs, and none can be taken back. The recognizer can only work with the information the format kept.

Common questions

What sample rate should I use for speech recognition?

16 kHz if you control the capture. It records frequencies up to 8 kHz, covering everything a recognizer uses, with no wasted data. Higher rates like 44.1 kHz add no accuracy for speech and just enlarge the file. Use 8 kHz only when you are stuck with it, as on a phone call.

Does converting 8 kHz audio to 16 kHz improve recognition?

No. Upsampling makes the file larger but adds no information, because the frequencies above 4 kHz were never captured and cannot be recovered. The recognizer sees the same narrowband content either way. Capture at the higher rate from the start.

Will compressing my audio hurt accuracy?

Depends on the codec and bitrate. Lossless compression (FLAC) is safe. Speech-tuned lossy codecs like Opus at a reasonable bitrate are nearly free. The damage comes from very low bitrates and from repeated transcoding. Avoid stacking lossy conversions.

Should I send stereo or mono?

Mono is fine for a single speaker. For two-party calls, keep stereo if each side is on its own channel: that separates the speakers for free and avoids diarization. Mixing a two-party call down to mono discards that separation.

References

Soniox (2026). WebSocket API. Soniox documentation.