Streaming TTS: why time-to-first-audio decides UX

Two text-to-speech systems can be equally fast at producing a five-second clip and still feel completely different to talk to. One waits until the whole clip is ready, then plays it. The other starts speaking after a couple of hundred milliseconds and generates the rest while you are already listening. The total work and final audio are the same, but the experience is the opposite, because a person in a conversation reacts to the silence before the voice begins, not to the length of the clip. Cutting that silence is the purpose of streaming TTS, and time-to-first-audio measures whether you succeeded.

Time to first audio

People take turns in conversation on a tight clock. We begin replying within a few hundred milliseconds of the other person stopping, often before we have finished planning the sentence.^[1]^[2]^[3] A voice that matches that rhythm feels responsive, while one that pauses a second and a half before its first word reads as slow or broken.^[4] A listener on the other end of a call does not care that the synthesizer is busy; they hear a person who hesitated. The same effect appears with TTS voices, and across whole systems in the voice agent latency budget.

It is easy to optimize the wrong number. A system that synthesizes a whole sentence in 300 milliseconds looks excellent on a benchmark, but if it must finish all 300 before playing any of it, the user hears 300 milliseconds of dead air first. A streaming system that takes longer overall but emits its first chunk in 150 milliseconds feels twice as fast, because the user is listening while the work continues. Total synthesis speed is throughput; time-to-first-audio tracks how responsive a conversation feels.

Text-input and audio-output streaming

"Streaming TTS" names two distinct things, and conversational systems need both.

Output streaming is the obvious one: text goes in, audio comes back in chunks as it is generated, and playback starts early. This is what TTFA measures.

Input streaming matters just as much. In a voice agent, the text is not sitting ready; a language model produces it word by word. A TTS that consumes text incrementally can start synthesizing the beginning of the reply while the language model is still writing the end. A TTS that needs the complete text first makes the system wait for the whole reply before any audio begins, throwing away every millisecond the language model spent streaming. The fastest voice agents pipe partial text straight into a TTS that accepts it, so the stages overlap instead of running one after another.^[7]

	Batch TTS	Streaming TTS
First audio	After the whole clip is done	After the first chunk
Text input	Complete text required	Accepts partial text as it arrives
Interruption	The clip plays or is cut off	Cancels cleanly mid-stream
Prosody context	The whole utterance	A chunk, plus whatever lookahead it waited for

The same synthesis, two contracts. Everything a conversation cares about lives in the right column.

Latency and prosody trade-offs

Streaming would be free if a voice could begin speaking from the first phoneme with no context, but it cannot, because of prosody. To place the stress, pitch contour, and pauses correctly, the model needs to know where the sentence is going. A question ends differently from a statement, and the emphasis in a list depends on the whole list. Committing to the first word with no lookahead leaves the intonation wrong by the time the sentence resolves.^[5]

Every streaming TTS picks a point on a curve. Starting sooner, with less text in hand, lowers TTFA but risks degrading the prosody. Waiting for more text, up to a natural boundary such as the end of a clause or sentence, improves the prosody but delays the first sound. The usual compromise is to chunk at sentence or phrase boundaries: gather enough text to read one unit well, start speaking it, and generate the next unit while the first plays. The unit is small enough to start fast and complete enough to sound intentional.

Continuous audio delivery

A fast start is necessary but not sufficient. Once playback begins, the audio must keep arriving faster than it is consumed, or the listener hears a gap. This is an underrun, the audio equivalent of a video stall. The client guards against it with a small playback buffer: it holds a little audio in reserve so a brief hiccup in generation or the network does not become an audible break. But the buffer is itself a form of latency, since audio sitting in it is audio you are not yet playing, so it is kept as small as the connection's reliability allows. A streaming voice is always trading an early start against a safe reserve.

Streaming requirements for interruption

A streaming voice produces audio in small pieces rather than one indivisible clip, so it can be stopped cleanly partway through. When a user interrupts, or barges in, the system cancels the rest of the synthesis and playback almost immediately, instead of talking over the user until a pre-generated clip finishes.^[6] This is why streaming is a prerequisite for natural turn-taking and barge-in. A voice you cannot interrupt without an awkward delay does not feel like a conversation.

Common questions

What is time-to-first-audio?

The delay between requesting speech and hearing the first sound. It is the latency the user actually feels, because they are timing the silence before the voice starts, not the length of the clip.

How is streaming TTS different from regular TTS?

Regular (batch) TTS synthesizes the entire utterance, then returns it, so the user waits for the whole thing before any audio plays. Streaming TTS emits audio in chunks as it is generated, so playback begins almost immediately. The total work is similar. The difference is whether the user waits in silence or listens while the rest is produced.

Why does a streaming voice sometimes get intonation slightly wrong?

To set pitch and emphasis, the model needs to see where the sentence is going, and streaming forces it to commit to early words before the later ones arrive. Chunking at clause or sentence boundaries is the compromise: enough context to sound intentional without waiting for the entire input.

Do I need streaming TTS for a voice agent?

Almost always. It cuts the delay before the agent starts speaking, and it lets the agent be interrupted cleanly mid-sentence. A non-streaming voice starts late and is awkward to barge in on, which breaks the feel of a real conversation.

References

Levinson, S. C., & Torreira, F. (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6, 731.
Meyer, A. S. (2023). Timing in Conversation. Journal of Cognition, 6(1), 20.
Boudin, A. (2022). Interdisciplinary corpus-based approach for exploring multimodal conversational feedback. Proceedings of the 2022 International Conference on Multimodal Interaction.
Song, J., Wan, N., Yang, F. Z., & Lin, W. (2026). From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing. arXiv preprint arXiv:2603.09261.
Sheng, et al. (2026). Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input. arXiv preprint arXiv:2603.06444.
Chen, C., et al. (2025). Reinforcement learning enhanced full-duplex spoken dialogue language models for conversational interactions. OpenReview.
Soniox (2026). Real-time Text-to-Speech generation. Soniox.