TTS audio formats and sample rates

Selecting sample rates, encodings, and codecs for TTS output

Updated June 29, 2026

A higher sample rate is not always better. Ask a TTS for 48 kHz audio and send it down a phone line, and the network downsamples it to 8 kHz before it reaches anyone, so you paid for bytes and latency the channel threw away. The output format question is about matching the audio to its destination, not maximizing a single number.

This is the mirror image of audio formats for speech recognition: there the format is decided by where the audio came from; here you choose it for where it is going.

Sample rate and encoding are separate choices

Most format confusion comes from collapsing two unrelated choices into one slider. They answer different questions.

Sample rate sets the frequency ceiling. As on the input side, the Nyquist rule holds: a rate of N captures sound up to N/2.[9] Speech lives almost entirely below 8 kHz, so 16 kHz output captures essentially all of it, and 24 kHz is a common TTS default that leaves a little headroom. 44.1 or 48 kHz adds range above where speech lives, so it does nothing for a voice.[4][5][6][7][8] The one exception runs in the other direction: telephony output must be 8 kHz to match the network.

Encoding is a separate lever: raw or compressed, and by which codec. It governs file size, latency, and compatibility no matter the sample rate. Raw 8 kHz audio and compressed 24 kHz audio are both valid combinations, because the two settings move independently.

Match the format to the destination

The right format follows from where the audio plays.

For a phone line, output 8 kHz μ-law or A-law. The phone network is narrowband and expects exactly this, so generating it directly avoids a lossy conversion.[1][2][3] Anything higher is downsampled away. This is the rule for voice agents on the phone.

For a browser or app doing live playback, a 16 or 24 kHz stream gives full speech quality at a modest size. For low-latency streaming TTS, use raw PCM chunks or Opus: PCM needs no decode step, and Opus compresses well while staying fast, so both let audio start quickly.[10][11]

For a file to download or store, a compressed format like MP3 or Opus at a reasonable bitrate cuts size with no audible loss for speech, which matters when you are saving thousands of clips.[12][13][17] Use a lossless or raw format only when you will process the audio further and cannot afford generational loss.

Streaming and file formats

The destination also decides raw versus compressed, through latency. In streaming, every conversion step adds delay, so raw PCM wins: the client plays it immediately with no decode, and chunked Opus is the compressed option that stays low-latency.[16] In file delivery, nobody is waiting on the first byte, so compression is nearly free: encode to MP3 or Opus, ship a small file, and decode at leisure.

DestinationSample rateEncodingWhy
Phone / telephony8 kHzμ-law / A-lawMatches the network; higher is discarded
Live web/app playback16-24 kHzPCM or Opus chunksFull speech quality, low latency
Downloaded / stored file16-24 kHzMP3 or OpusSmall size, no audible loss for speech
Further processing24 kHz+PCM / losslessAvoids generational loss
Output format by destination. The biggest sample rate is rarely the right one.

Effects of bitrate

Compressed formats add a third number, the bitrate, which determines how much quality is lost. A high sample rate squeezed into a very low bitrate sounds worse than a modest sample rate at a healthy bitrate, because the bitrate constrains how much detail survives. When you compress, set a bitrate appropriate for speech rather than the lowest one that fits; a high sample rate cannot make up for a bitrate that is too low.[14][15]

Storage and bandwidth costs

Audio you generate above what the channel carries does not improve the call. It gets thrown away, usually by a downsampler in the network, sometimes before the first byte reaches the listener. A 48 kHz stream bound for a phone line is discarded at the network edge, and the caller hears the same 8 kHz they would have heard from a request a fraction of the size.

So the destination sets both a floor and a ceiling. On a phone call, below 8 kHz the audio sounds worse, and above 8 kHz nothing the listener can hear improves. Generate the audio that survives the trip and no more. The input-side mirror of this problem is where the format is dictated by where the audio came from.

Common questions

What sample rate should I use for TTS output?

8 kHz for telephony, 16 to 24 kHz for everything else. The phone network caps you at 8 kHz, so anything higher is downsampled away before it reaches the caller. For web and app playback, 16 to 24 kHz already covers all of speech. The jump to 44.1 or 48 kHz only adds range above the voice, which matters for music but not for speech.

Is a higher sample rate always better quality?

No. Speech energy does not reach above roughly 16 kHz, so a bigger number adds nothing the listener can hear, and in streaming it adds latency on the extra bytes. If the channel is a phone line, it downsamples to 8 kHz regardless of what you generate.

Which format is best for a streaming voice agent?

Raw PCM chunks or Opus, because both start playing with minimal delay: PCM needs no decode step, and Opus compresses while staying fast. For a phone agent specifically, generate 8 kHz μ-law or A-law directly, which matches the network and skips a lossy conversion you would otherwise pay for.

Should I store TTS output as MP3 or WAV?

MP3 (or Opus). Compression cuts file size with no audible loss for speech, which matters when you are saving thousands of clips and no one is waiting on the first byte. Keep WAV/PCM or lossless only when you will process the audio further and cannot accept generational loss from re-compression.

References

  1. ITU-T (1988). Pulse code modulation (PCM) of voice frequencies. ITU-T Recommendation G.711.
  2. TMS320C6713 DSK Implementation of G.711 Coded VoIP Signal. Academia.edu.
  3. Improvement of a band extension technique for G.711 telephony speech by using steganography. IEEE Xplore.
  4. Monson, B. B., Hunter, E. J., Lotto, A. J., & Story, B. H. (2014). The perceptual significance of high-frequency energy in the human voice. Frontiers in Psychology, 5, 587.
  5. What sampling rate is necessary for speech recognition?. Amivoice.com (2022).
  6. For normal speech, isn't the conventional 44.1 kHz overkill?. Reddit (2022).
  7. Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction. IEEE Xplore (2024).
  8. Introduction to the special issue on perception and production of sounds in the high-frequency range of human speech. AIP Publishing.
  9. Shannon, C. E. (1949). Communication in the Presence of Noise. Proceedings of the IRE.
  10. TTS API audio quality lower than Playground, even with PCM/WAV. OpenAI Community (2025).
  11. Valin, J. M., Vos, K., & Terriberry, T. (2012). Definition of the Opus Audio Codec. IETF RFC 6716.
  12. SoundStream: An end-to-end neural audio codec. IEEE Xplore (2021).
  13. A Comparative Study of Audio Compression Algorithms and Their Effects on Automated Speech Recognition Accuracy. project-archive.inf.ed.ac.uk.
  14. Digital audio basics: audio sample rate and bit depth. iZotope (2025).
  15. HighRateMOS: sampling-rate aware modeling for speech quality assessment. arXiv (2025).
  16. On the influence of best-effort network conditions on the perceived speech quality of VoIP connections. IEEE Xplore.
  17. High-fidelity audio compression with improved RVQGAN. Advances in Neural Information Processing Systems (NeurIPS) 2023.