Real-time vs async transcription

Choosing between immediate output and complete-recording processing

Updated June 29, 2026

Many projects pick the wrong mode on day one, and the symptom is always the same. Someone builds a WebSocket streaming pipeline, complete with partial results, endpointing, and reconnection logic, to caption a folder of yesterday's recordings. Or someone drives live closed captions off a system designed to hand back a file twenty seconds after the meeting ends.

Both run, but each is the wrong tool for its job, and you pay in either needless complexity or unacceptable lag.

Choosing between real-time and asynchronous transcription

Ask whether a person or a process is blocked on the words as they are spoken.

Three cases where the answer is yes: a caption that has to appear while the speaker is mid-sentence, a voice agent deciding when to answer, a doctor watching the note form as they dictate. All three are blocked, so all three are real-time. Now a recorded podcast, a stack of call recordings, or a video that needs subtitles. Nothing is blocked, because the recording already exists in full. You want the best transcript you can get, and spending a few seconds or minutes on processing costs you nothing.

That second case is async, and it has an advantage people forget: it can use audio that arrives after a word to decide how to transcribe that word.

Accuracy advantages of asynchronous transcription

A real-time recognizer commits to words before the speaker finishes the sentence. It guesses "two" before it has heard whether the next word is "hundred" or "tomorrow." An async system has the whole recording in hand before it writes a single word, so it can use audio from ten seconds later to fix a word now. Only the batch system gets that later context.[1][2]

The gap is small for clean speech and widest where transcription is hardest. With a quiet single speaker you will rarely notice it; on an overlapping, accented, jargon-heavy call, the extra context earns its keep.[3]

Computational and operational costs

Real-time transcription requires a streaming connection, usually a WebSocket, that carries audio in small chunks as it is captured. The client must handle partial and final results, endpoint detection, pauses, and connection recovery. These requirements do not apply to a file-based batch request.[4]

Async buys you simplicity and charges you latency. You hand over a file or a URL, get a job back, and collect the result later by polling or, better, through a webhook. Processing runs far faster than real time: an hour of audio commonly comes back in a few minutes (dependent on load and file length).[5][6] Even so, a delay of a few minutes after the recording exists is a non-starter for anything live.

--- config: layout: elk --- flowchart LR subgraph RT [Real-time] direction TB A1[Live audio chunks] --> A2[Partial words now] A2 --> A3[Final words<br/>as turns close] end subgraph AS [Async] direction TB B1[Complete recording] --> B2[Process when ready] B2 --> B3[Full transcript<br/>delivered later] end
The same recognizer, two delivery contracts. Real-time trades context for immediacy; async trades immediacy for context.

Comparison of transcription modes

Real-timeAsync (batch)
InputLive stream, chunk by chunkA complete file or URL
First text appearsWithin ~hundreds of msAfter the job finishes
Sees future audio?NoYes, the whole recording
TransportStreaming (WebSocket)Upload + poll or webhook
Best forCaptions, agents, live notesRecordings, archives, subtitles
Engineering effortHigher (stream lifecycle)Lower (submit and collect)

Common selection errors

A live recording is still live. Recording a meeting and captioning it at the same time is a real-time job that happens to be saving a file. The recording is a byproduct; the captions are the point, and they are blocked on the words.

A "real-time" demo over a recorded file is a habit, not a design choice. Streaming a finished .wav through a real-time API to watch the words scroll looks impressive, but you pay streaming complexity and give up full-context accuracy to transcribe something sitting on disk. Send it to async.

"We might need it live someday" is not a reason to build streaming now. The two modes share a model and most of a vocabulary of concepts, so moving a batch pipeline to streaming later is bounded work, not a rewrite. Build for the latency you have today.

Common questions

Is real-time transcription less accurate than async?

Slightly, for a specific reason: the real-time system commits to words before hearing the rest of the sentence, while the async system reads the whole recording first. On clean speech the difference is hard to notice. It widens on names, numbers, and overlapping or accented speech, where later audio resolves earlier words.

Can I run async transcription on a live microphone?

No. Async needs a complete recording before it starts, so it cannot transcribe audio still being spoken. If you need words while someone is talking, that is real-time. Record the live audio and send the finished file to async afterward.

Why would I ever choose real-time if async is more accurate?

Because async gives you nothing until the recording is over. Captions, voice agents, and live dictation are blocked on the next word the instant it is spoken, so a transcript that lands after the conversation ends is useless to them. Choose real-time whenever something is waiting.[7]

How fast is async transcription?

Faster than real time. Processing commonly runs at several times the audio's own length, so an hour of recording can return in minutes (it depends on file length and current load). The delay is measured in minutes after the file exists, fine for archives but fatal for anything live.[8]

References

  1. Andrusenko, A., Bataev, V., Grigoryan, L., Tadevosyan, N., Lavrukhin, V., & Ginsburg, B. (2026). Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization. arXiv preprint arXiv:2604.19079.
  2. Moriya, T., Mimura, M., Matsui, K., & Sato, H. (2025). Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces. Interspeech.
  3. Kuhn, K., Kersken, V., Reuter, B., Egger, N., & Schmidt, M. (2024). Measuring the Accuracy of Automatic Speech Recognition Solutions. ACM Transactions on Speech and Language Processing, 1–27.
  4. Raj, D., Lu, L., Chen, Z., Gaur, Y., & Li, J. (2022). Continuous Streaming Multi-Talker ASR with Dual-Path Transducers. ICASSP 2022 — IEEE International Conference on Acoustics, Speech and Signal Processing, 7767–7771.
  5. Kudlur, M., King, E., Wang, J., & Warden, P. (2026). Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications. arXiv preprint arXiv:2602.12241.
  6. Banfic, N., Fan, D., Vaishnavi, K., Kemp, S., & Choi, S. (2026). Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference. arXiv preprint arXiv:2604.14493.
  7. Soniox (2026). Real-time Speech-to-Text. Soniox.
  8. Soniox (2026). Async Transcription. Soniox.