Real-time vs async transcription: which one do you need?

Many projects pick the wrong mode on day one, and the symptom is always the same. Someone builds a WebSocket streaming pipeline, complete with partial results and reconnection logic, to caption a folder of yesterday's recordings. Or someone drives live closed captions off a system designed to hand back a file twenty seconds after the meeting ends.

Both run, but each is the wrong tool for its job, and you pay in either needless complexity or unacceptable lag.

Choosing between real-time and asynchronous transcription

Ask whether a person or a process is blocked on the words as they are spoken.

A live caption is blocked: it has to appear while the speaker is still mid-sentence. A voice agent is blocked too, since it cannot even decide when to answer without the words as they land. Jobs like these are real-time. A recorded podcast or a stack of yesterday's call recordings is the opposite situation. Nothing is blocked, because the recording already exists in full; you want the best transcript you can get, and spending a few minutes on processing costs you nothing.

The recordings are the async case, and async has an advantage people forget: it can use audio that arrives after a word to decide how to transcribe that word.

Accuracy advantages of asynchronous transcription

A real-time recognizer commits to words before the speaker finishes the sentence. It guesses "two" before it has heard whether the next word is "hundred" or "tomorrow." An async system has the whole recording in hand before it writes a single word, so it can use audio from ten seconds later to fix a word now. Only the batch system gets that later context.^[1]^[2]

The gap is small for clean speech and widest where transcription is hardest. With a quiet single speaker you will rarely notice it; on an overlapping, accented, jargon-heavy call, the extra context earns its keep.^[3]

Computational and operational costs

Real-time transcription requires a streaming connection, usually a WebSocket, that carries audio in small chunks as it is captured. The client must handle partial and final results, endpoint detection, pauses, and connection recovery. These requirements do not apply to a file-based batch request.^[4]

Async buys you simplicity and charges you latency. You hand over a file or a URL, get a job back, and collect the result later by polling or, better, through a webhook. Processing runs far faster than real time: an hour of audio commonly comes back in a few minutes, depending on load and file length.^[6] Even so, a delay of a few minutes after the recording exists is a non-starter for anything live.

flowchart LR subgraph RT [Real-time] direction TB A1[Live audio chunks] --> A2[Partial words now] A2 --> A3[Final words<br/>as turns close] end subgraph AS [Async] direction TB B1[Complete recording] --> B2[Process when ready] B2 --> B3[Full transcript<br/>delivered later] end

The same recognizer, two delivery contracts. Real-time trades context for immediacy; async trades immediacy for context.

Comparison of transcription modes

	Real-time	Async (batch)
Input	Live stream, chunk by chunk	A complete file or URL
First text appears	Within ~hundreds of ms	After the job finishes
Sees future audio?	No	Yes, the whole recording
Transport	Streaming (WebSocket)	Upload + poll or webhook
Best for	Captions, agents, live notes	Recordings, archives, subtitles
Engineering effort	Higher (stream lifecycle)	Lower (submit and collect)

Common selection errors

A live recording is still live. Recording a meeting and captioning it at the same time is a real-time job that happens to be saving a file. The recording is a byproduct; the captions are the point, and they are blocked on the words.

A "real-time" demo over a recorded file is a habit, not a design choice. Streaming a finished .wav through a real-time API to watch the words scroll looks impressive, but you pay streaming complexity and give up full-context accuracy to transcribe something sitting on disk. Send it to async.

"We might need it live someday" is not a reason to build streaming now. The two modes share a model and most of a vocabulary of concepts, so moving a batch pipeline to streaming later is bounded work, not a rewrite. Build for the latency you have today.

Common questions

Is real-time transcription less accurate than async?

Slightly, for a specific reason: the real-time system commits to words before hearing the rest of the sentence, while the async system reads the whole recording first. On clean speech the difference is hard to notice. It widens on names, numbers, and overlapping or accented speech, where later audio resolves earlier words.

Can I run async transcription on a live microphone?

No. Async needs a complete recording before it starts, so it cannot transcribe audio still being spoken. If you need words while someone is talking, that is real-time. Record the live audio and send the finished file to async afterward.

Why would I ever choose real-time if async is more accurate?

Because async gives you nothing until the recording is over. A caption or a voice agent is blocked on the next word the instant it is spoken, so a transcript that lands after the conversation ends is useless to it. Choose real-time whenever something is waiting.^[5]

How fast is async transcription?

Faster than real time. Processing commonly runs at several times the audio's own length, so an hour of recording can return in minutes (it depends on file length and current load). The delay is measured in minutes after the file exists, fine for archives but fatal for anything live.^[6]

References

Andrusenko, A., Bataev, V., Grigoryan, L., Tadevosyan, N., Lavrukhin, V., & Ginsburg, B. (2026). Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization. arXiv preprint arXiv:2604.19079.
Moriya, T., Mimura, M., Matsui, K., & Sato, H. (2025). Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces. Interspeech.
Kuhn, K., Kersken, V., Reuter, B., Egger, N., & Schmidt, M. (2024). Measuring the Accuracy of Automatic Speech Recognition Solutions. ACM Transactions on Speech and Language Processing, 1–27.
Raj, D., Lu, L., Chen, Z., Gaur, Y., & Li, J. (2022). Continuous Streaming Multi-Talker ASR with Dual-Path Transducers. ICASSP 2022 — IEEE International Conference on Acoustics, Speech and Signal Processing, 7767–7771.
Soniox (2026). Real-time Speech-to-Text. Soniox.
Soniox (2026). Async Transcription. Soniox.