Streaming speech recognition: how live transcription works over WebSockets

Watch live captions closely and you will see them think. A word appears, sits for a moment, sometimes changes, then settles. The flicker comes from how the data moves: audio flows up the connection in small pieces, and the recognizer sends its best guess down just as fast, revising as more sound arrives.

Streaming recognition commonly uses a WebSocket because audio and recognition results must travel concurrently over a persistent bidirectional connection.^[3]

Cut the audio into chunks

You cannot stream a sound you have not finished making, so the first job is to slice the live microphone signal into small frames, commonly 20 to 100 milliseconds each (the exact size is a latency-versus-overhead trade-off). Smaller frames mean lower latency and more messages. Larger frames mean fewer messages and slightly more lag.^[1]^[2]

In a browser, this means asking for the microphone, resampling to the rate the recognizer expects (16 kHz is typical), and letting an audio worklet hand back a steady drip of raw PCM frames. Doing that without ever dropping a sample is real engineering in its own right; for this walkthrough, assume the frames are arriving.

Send the configuration first

A streaming session begins with a handshake. Before sending audio, the client supplies the audio format, sample rate, model, and any language hints or custom vocabulary. The server needs this information to interpret everything that follows, because a stream of raw PCM bytes does not describe itself.^[3] The first message up the wire is configuration, not sound, and its shape is small:

{ "audio_format": "pcm_s16le", "sample_rate": 16000, "model": "stt-rt" }

Stream at the rate audio is captured

Send frames at the rate the audio is captured, which is to say in real time. Dumping a complete file down the socket at once is not streaming, it is an upload wearing a costume, and chunks that arrive late leave the recognizer idling with nothing to transcribe.^[3]

Receive words as they finalize

This is what makes streaming feel alive. The server sends results continuously, each marked provisional or settled. Provisional (partial) words are the recognizer's current best guess and may still change. Settled (final) words will not.^[5] A single message down the wire can carry both at once:

{ "tokens": [ { "text": "hello", "is_final": true }, { "text": "wor", "is_final": false } ] }

Render finals in solid text and partials in a lighter style, then promote the partials as they lock. Why text rewrites itself is covered in partial vs final results. Short version: the recognizer is allowed to change its mind until it commits.

flowchart LR A[Mic frames] --> B[WebSocket up] B --> C[Recognizer] C --> D[Tokens down] D --> A

The streaming loop: audio flows up in frames, words flow down continuously, and the connection stays open the whole time.

End the turn cleanly

The speaker stops. Two things happen: the system decides the turn is over (see endpoint detection), and the client tells the server that no more audio is coming, so the server can flush its last words and close the session. If you skip that explicit goodbye, you hit one of streaming's most common bugs: the final word or two never arrives, because the server is still waiting patiently for audio that will never come.^[6]

Benefits and requirements

The main benefit is low incremental latency. Audio and results share one open connection, avoiding a new request handshake for every utterance.^[1]^[3] This behavior supports live captions and voice agents.

The cost is lifecycle management. A streaming client has states a batch client never does: connecting, streaming, paused, reconnecting, finalizing, closed. Networks drop and sockets time out. A resilient client keeps the session alive through pauses, reconnects without losing the in-flight utterance, and never assumes the last word it received is the last word it will get.^[4] This work separates a demo from a production system, and it is why some teams reach for async transcription when nothing is waiting on the words.

Common questions

Why a WebSocket and not regular HTTP?

An HTTP request is one round trip: send, wait, receive, done. Streaming needs a connection that stays open so audio flows up and words flow down for as long as the speaker talks. A WebSocket is that full-duplex, long-lived channel.

How fast should I send audio chunks?

At the rate audio is captured: real time. Faster does not make words appear sooner and can overwhelm buffers. Slower starves the recognizer and adds lag.

Why do the captions change after they appear?

Early words are provisional. The recognizer shows its best guess immediately, revises it as later audio resolves ambiguity, then commits. See partial vs final results.

What happens if the network drops mid-sentence?

A naive client loses the utterance. A production client detects the drop, reconnects, and resumes without losing the audio buffered since the last final word. Designing for this case is most of the work in making streaming code survive a real network rather than only a clean demo.

References

Banfic, N., Fan, D., Vaishnavi, K., et al. (2026). Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference. arXiv preprint arXiv:2604.14493.
Wang, C., Wu, Y., Liu, S., et al. (2020). Low Latency End-to-End Streaming Speech Recognition with a Scout Network. arXiv preprint arXiv:2003.10369.
Soniox (2026). Real-time Transcription. Soniox Docs.
Kasakowskij, T., & Haake, J. M. (2025). A model for near real-time voice transcription in virtual group meetings. Discover Education (Springer).
Bruguier, A., Qiu, D., & He, Y. (2023). Partial Rewriting for Multi-Stage ASR. arXiv preprint arXiv:2312.09463.
Chang, S.-Y., Li, B., Sainath, T. N., Simko, G., & Parada, C. (2017). Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition. Interspeech 2017.