TTS audio formats and sample rates: choosing output quality

A higher sample rate is not always better. If you ask a TTS for 48 kHz audio and send it down a phone line, the network downsamples it to 8 kHz before it reaches anyone, so you paid for bytes and latency the channel threw away. The output format question is about matching the audio to its destination, not maximizing a single number.

This is the mirror image of audio formats for speech recognition: there the format is decided by where the audio came from; here you choose it for where it is going.

Sample rate and encoding are separate choices

Most format confusion comes from collapsing two unrelated choices into one slider. They answer different questions.

Sample rate sets the frequency ceiling. As on the input side, the Nyquist rule holds: a rate of N captures sound up to N/2.^[3] Speech lives almost entirely below 8 kHz, so 16 kHz output captures essentially all of it, and 24 kHz is a common TTS default that leaves a little headroom. 44.1 or 48 kHz adds range above where speech lives, so it does nothing for a voice.^[2] The one exception runs in the other direction: telephony output must be 8 kHz to match the network.

Encoding is a separate lever: raw or compressed, and by which codec. It governs file size, latency, and compatibility no matter the sample rate. Raw 8 kHz audio and compressed 24 kHz audio are both valid combinations, because the two settings move independently.

Match the format to the destination

The right format follows from where the audio plays.

For a phone line, output 8 kHz μ-law or A-law. The phone network is narrowband and expects exactly this, so generating it directly avoids a lossy conversion.^[1] Anything higher is downsampled away. This is the rule for voice agents on the phone.

For a browser or app doing live playback, a 16 or 24 kHz stream gives full speech quality at a modest size. For low-latency streaming TTS, use raw PCM chunks or Opus: PCM needs no decode step, and Opus compresses well while staying fast, so both let audio start quickly.^[4]

For a file to download or store, a compressed format like MP3 or Opus at a reasonable bitrate cuts size with no audible loss for speech, which matters when you are saving thousands of clips.^[5]^[6] Use a lossless or raw format only when you will process the audio further and cannot afford generational loss.

Streaming and file formats

The destination also decides raw versus compressed, through latency. In streaming, every conversion step adds delay, so raw PCM wins: the client plays it immediately with no decode, and chunked Opus is the compressed option that stays low-latency. In file delivery, nobody is waiting on the first byte, so compression is nearly free: encode to MP3 or Opus, ship a small file, and decode at leisure.

Destination	Sample rate	Encoding	Why
Phone / telephony	8 kHz	μ-law / A-law	Matches the network; higher is discarded
Live web/app playback	16-24 kHz	PCM or Opus chunks	Full speech quality, low latency
Downloaded / stored file	16-24 kHz	MP3 or Opus	Small size, no audible loss for speech
Further processing	24 kHz+	PCM / lossless	Avoids generational loss

Output format by destination. The biggest sample rate is rarely the right one.

Effects of bitrate

Compressed formats add a third number, the bitrate, which determines how much quality is lost. A high sample rate squeezed into a very low bitrate sounds worse than a modest sample rate at a healthy bitrate, because the bitrate constrains how much detail survives. When you compress, set a bitrate appropriate for speech rather than the lowest one that fits; a high sample rate cannot make up for a bitrate that is too low.^[4]

Storage and bandwidth costs

The destination sets both a floor and a ceiling. On a phone call, below 8 kHz the audio sounds worse, and above 8 kHz nothing the listener can hear improves; the extra bytes cost storage, bandwidth, and streaming latency on every request. Generate the audio that survives the trip and no more. The input-side mirror of this problem is where the format is dictated by where the audio came from.

Common questions

What sample rate should I use for TTS output?

8 kHz for telephony, 16 to 24 kHz for everything else. The phone network caps you at 8 kHz, so anything higher is downsampled away before it reaches the caller. For web and app playback, 16 to 24 kHz already covers all of speech. The jump to 44.1 or 48 kHz only adds range above the voice, which matters for music but not for speech.

Is a higher sample rate always better quality?

No. Speech energy does not reach above roughly 16 kHz, so a bigger number adds nothing the listener can hear, and in streaming it adds latency on the extra bytes. If the channel is a phone line, it downsamples to 8 kHz regardless of what you generate.

Which format is best for a streaming voice agent?

Raw PCM chunks or Opus, because both start playing with minimal delay: PCM needs no decode step, and Opus compresses while staying fast. For a phone agent specifically, generate 8 kHz μ-law or A-law directly, which matches the network and skips a lossy conversion you would otherwise pay for.

Should I store TTS output as MP3 or WAV?

MP3 (or Opus). Compression cuts file size with no audible loss for speech, which matters when you are saving thousands of clips and no one is waiting on the first byte. Keep WAV/PCM or lossless only when you will process the audio further and cannot accept generational loss from re-compression.

References

ITU-T (1988). Pulse code modulation (PCM) of voice frequencies. ITU-T Recommendation G.711.
Monson, B. B., Hunter, E. J., Lotto, A. J., & Story, B. H. (2014). The perceptual significance of high-frequency energy in the human voice. Frontiers in Psychology, 5, 587.
Shannon, C. E. (1949). Communication in the Presence of Noise. Proceedings of the IRE.
Valin, J. M., Vos, K., & Terriberry, T. (2012). Definition of the Opus Audio Codec. IETF RFC 6716.
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., & Tagliasacchi, M. (2021). SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30.
Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2023). High-Fidelity Audio Compression with Improved RVQGAN. Advances in Neural Information Processing Systems (NeurIPS) 2023.