Streaming speech recognition

Audio transport, partial results, finalization, and recovery

Updated June 29, 2026

Watch live captions closely and you will see them think. A word appears, sits for a moment, sometimes changes, then settles. The flicker comes from how the data moves: audio flows up the connection in small pieces, and the recognizer sends its best guess down just as fast, revising as more sound arrives.

Streaming recognition commonly uses a WebSocket because audio and recognition results must travel concurrently over a persistent bidirectional connection.[3]

Cut the audio into chunks

You cannot stream a sound you have not finished making, so the first job is to slice the live microphone signal into small frames, commonly 20 to 100 milliseconds each (the exact size is a latency-versus-overhead trade-off). Smaller frames mean lower latency and more messages. Larger frames mean fewer messages and slightly more lag.[1][2]

// Browser: get the mic and hand raw PCM frames to a worklet
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const ctx = new AudioContext({ sampleRate: 16000 });
const src = ctx.createMediaStreamSource(stream);
await ctx.audioWorklet.addModule("pcm-worklet.js");
const worklet = new AudioWorkletNode(ctx, "pcm-worklet");
src.connect(worklet); // worklet posts ~20ms PCM frames back to the main thread

Doing this without dropping samples is its own topic, covered in capturing microphone audio in the browser. Assume frames are arriving.

Send the configuration first

A streaming session begins with a handshake. Before sending audio, the client supplies the audio format, sample rate, model, and any language hints or custom vocabulary. The server needs this information to interpret the audio stream.[3]

const ws = new WebSocket("wss://stt.example.com/transcribe");
ws.onopen = () => {
  ws.send(JSON.stringify({
    audio_format: "pcm_s16le",
    sample_rate: 16000,
    model: "stt-rt",
  }));
};

Stream at the rate audio is captured

Send frames at approximately the rate at which the audio is captured. Sending a complete file at once is not equivalent to live streaming, while delayed chunks leave the recognizer without new input.[3]

worklet.port.onmessage = (e) => {
  if (ws.readyState === WebSocket.OPEN) ws.send(e.data); // raw PCM bytes
};

Receive words as they finalize

This is what makes streaming feel alive. The server sends results continuously, each marked provisional or settled. Provisional (partial) words are the recognizer's current best guess and may still change. Settled (final) words will not.[5] Render finals in solid text and partials in a lighter style, then promote the partials as they lock.

ws.onmessage = (msg) => {
  const { tokens } = JSON.parse(msg.data);
  for (const t of tokens) {
    if (t.is_final) commit(t.text);   // will not change
    else preview(t.text);             // may still change
  }
};

Why text rewrites itself is covered in partial vs final results. Short version: the recognizer is allowed to change its mind until it commits.

flowchart LR A[Mic frames] --> B[WebSocket up] B --> C[Recognizer] C --> D[Tokens down] D --> A
The streaming loop: audio flows up in frames, words flow down continuously, and the connection stays open the whole time.

End the turn cleanly

The speaker stops. Two things happen: the system decides the turn is over (see endpoint detection), and the client signals that no more audio is coming so the server can flush its last words and close. Skip the explicit end and you hit a common bug, where the final word or two never arrives because the server is still waiting for audio that will never come.[6]

function finish() {
  ws.send(JSON.stringify({ type: "finalize" })); // ask for last finals
  ws.close();
}

Benefits and requirements

The main benefit is low incremental latency. Audio and results share one open connection, avoiding a new request handshake for every utterance.[1][3] This behavior supports live captions and voice agents.

The cost is lifecycle management. A streaming client has states a batch client never does: connecting, streaming, paused, reconnecting, finalizing, closed. Networks drop and sockets time out. A resilient client keeps the session alive through pauses, reconnects without losing the in-flight utterance, and never assumes the last word it received is the last word it will get.[4] This work separates a demo from a production system, and it is why some teams reach for async transcription when nothing is waiting on the words.

Common questions

Why a WebSocket and not regular HTTP?

An HTTP request is one round trip: send, wait, receive, done. Streaming needs a connection that stays open so audio flows up and words flow down for as long as the speaker talks. A WebSocket is that full-duplex, long-lived channel. The trade-offs against other transports are in WebSockets vs HTTP for audio.

How fast should I send audio chunks?

At the rate audio is captured: real time. Faster does not make words appear sooner and can overwhelm buffers. Slower starves the recognizer and adds lag.

Why do the captions change after they appear?

Early words are provisional. The recognizer shows its best guess immediately, revises it as later audio resolves ambiguity, then commits. See partial vs final results.

What happens if the network drops mid-sentence?

A naive client loses the utterance. A production client detects the drop, reconnects, and resumes without losing the audio buffered since the last final word. Designing for this case is most of the work in making streaming code survive a real network rather than only a clean demo.

References

  1. Banfic, N., Fan, D., Vaishnavi, K., et al. (2026). Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference. arXiv preprint arXiv:2604.14493.
  2. Wang, C., Wu, Y., Liu, S., et al. (2020). Low Latency End-to-End Streaming Speech Recognition with a Scout Network. arXiv preprint arXiv:2003.10369.
  3. Soniox (2026). Real-time Transcription. Soniox Docs.
  4. Kasakowskij, T., & Haake, J. M. (2025). A model for near real-time voice transcription in virtual group meetings. Discover Education (Springer).
  5. Bruguier, A., Qiu, D., & He, Y. (2023). Partial Rewriting for Multi-Stage ASR. arXiv preprint arXiv:2312.09463.
  6. Chang, S.-Y., Li, B., Sainath, T. N., Simko, G., & Parada, C. (2017). Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition. Interspeech 2017.