Picture the transcript of a four-person meeting with no speaker labels. One paragraph runs into the next, a question and its answer sit in the same block, and "yeah, I agree" floats with no way to tell who agreed with whom. Every word is correct, and the conversation is still illegible. Diarization adds the speaker labels that make the turns readable.
Diarization pipeline
The textbook diarization system runs in four stages, though the field has been folding them into single neural models since 2019.[1][2]
First, segmentation. You cut the audio into small pieces that each (you hope) contain a single speaker. Older systems detected speaker change points; modern ones slide a short window, often around 1.5 seconds, and accept that some windows will straddle a boundary.[1]
Second, speaker embeddings. Each segment is mapped to a fixed-length vector that captures voice characteristics while ignoring the words. The history is specific: i-vectors (Dehak et al., 2011)[3] borrowed from factor analysis; then neural d-vectors[4] and especially x-vectors (Snyder et al., 2018), which trained a time-delay neural network to produce embeddings that cluster cleanly by speaker.[5] X-vectors were the workhorse for years and are still a sensible baseline.
Third, clustering. You group the embeddings so that one cluster equals one speaker. Agglomerative hierarchical clustering (AHC) merges the closest segments bottom-up until a stopping threshold[6]; spectral clustering uses the eigenstructure of a similarity matrix and handles awkward cluster shapes better.[7] Neither is told how many speakers to expect, which is most of the difficulty.
Fourth, assignment: every frame of audio inherits the label of its cluster, and you have your who-spoke-when timeline.
Sources of diarization error
Diarization stayed unsolved long after ASR became usable, for four stubborn reasons.
You usually do not know how many speakers there are. A clustering algorithm handed the wrong speaker count produces confidently wrong output: merge two people into one cluster, or split one person across two, and every downstream label is off.[6]
Overlapping speech is one of the hardest problems. When two people talk at once, a system that assigns one label per frame has to pick one and is wrong about the other by construction.[2] For decades it has been one of the largest contributors to diarization error,[8] which is why so much recent work targets it directly.
Short turns starve the embeddings. A crisp x-vector needs a second or two of speech, while a clipped "right" or "no" gives the model almost nothing, and backchannel-heavy conversation is full of these.[1][9]
Similar voices defeat the geometry. Two speakers with close pitch and accent land near each other in embedding space, and the clustering step has no principled way to keep them apart.[10][7]
Diarization vs identification vs verification
These three get conflated constantly, so be precise. Diarization asks who-spoke-when with anonymous labels and no prior knowledge of the voices. Speaker identification asks "which known person is this?" and needs a database of enrolled voices to match against. Speaker verification asks "is this the specific person they claim to be?", a yes/no decision against one enrolled voiceprint, which is what your phone does when it unlocks to your voice.[12] Diarization can feed the other two (cluster first, then label each cluster with a name), but on its own it stays anonymous. The distinctions are in diarization vs speaker identification vs verification.
| Task | Question answered | Needs enrollment? | Output |
|---|---|---|---|
| Diarization | Who spoke when? | No | Anonymous labels over time |
| Identification | Which known person? | Yes (many voices) | A name from a set |
| Verification | Is it this person? | Yes (one voice) | Accept / reject |
Online vs offline
Offline (batch) diarization sees the whole recording before deciding anything, so it can cluster globally and revise early guesses once later evidence arrives. This is the easier, more accurate setting, and what you want for recorded calls, podcasts, and meeting archives.
Online (streaming) diarization must label each speaker as the audio arrives, with no peeking ahead. It cannot recluster the past, so an early mistake tends to persist, and it has to decide on the spot whether a new voice is genuinely new or just speaker 2 sounding different. It is generally harder and usually posts higher DER than its offline counterpart on the same audio.[8][1] If your product is a live captioning or agent-assist feature, this is the regime you are stuck with.
Combining diarization with ASR
Diarization gives you who spoke when; ASR gives you what was said when. The labeled transcript you want comes from intersecting the two on the time axis: each recognized word carries a start and end time (see timestamps and forced alignment), and you stamp it with whichever speaker owned that span.[13] Get the alignment slightly wrong and words land under the previous speaker, jarring at exactly the turn boundaries readers care about most. Run the two as one joint model and the boundaries tend to agree by design, not by stitching them together afterward.[14]
Common questions
Does diarization tell me people's names?
No. It gives anonymous labels like Speaker 1 and Speaker 2 that are consistent within one recording. Putting a name to a label is speaker identification, a separate step that needs enrolled voices.
How many speakers can it handle?
Two-person calls are the well-behaved case. Accuracy degrades as the count rises and as turns get shorter and more overlapped, because both the embeddings and the clustering have less clean signal. Many systems can estimate the count automatically, but the estimate gets shakier with crowded, noisy audio.
Why are the speaker labels different every time I rerun it?
Because the labels are relative, not identities. Speaker 1 in one file has no connection to Speaker 1 in another, and even a rerun can swap the numbering. Stable identities across recordings require identification on top.
Does background noise hurt diarization?
Yes. Noise and crosstalk corrupt the embeddings and blur speaker boundaries, raising DER. Telephony audio is its own challenge, narrowband and compressed; see transcribing noisy audio and telephony transcription.
Related concepts
- Diarization vs speaker identification vs verification
- What is speech recognition
- Timestamps and forced alignment
- Telephony transcription
- Transcribing noisy audio
- Conversation summarization
References
- Park, T. J., Kanda, N., Dimitriadis, D., et al. (2022). A Review of Speaker Diarization: Recent Advances with Deep Learning. Computer Speech & Language, 72.
- Fujita, Y., Kanda, N., Horiguchi, S., et al. (2019). End-to-End Neural Speaker Diarization with Permutation-Free Objectives. Interspeech 2019.
- Dehak, N., Kenny, P. J., Dehak, R., et al. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4).
- Variani, E., Lei, X., McDermott, E., et al. (2014). Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. ICASSP 2014 (IEEE).
- Snyder, D., Garcia-Romero, D., Sell, G., et al. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. ICASSP 2018 (IEEE).
- Anguera, X., Bozonnet, S., Evans, N., et al. (2012). Speaker Diarization: A Review of Recent Research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2).
- Wang, Q., Downey, C., Wan, L., et al. (2018). Speaker Diarization with LSTM. ICASSP 2018 (IEEE).
- Coria, J. M., Bredin, H., Ghannay, S., & Rosset, S. (2021). Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation. IEEE ASRU 2021.
- Kanagasundaram, A., Vogt, R., Dean, D., et al. (2011). i-vector Based Speaker Recognition on Short Utterances. Interspeech 2011.
- Fürer, L., Schenk, N., Roth, V., et al. (2020). Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research. Frontiers in Psychology, 11:1726.
- Fiscus, J. G., Ajot, J., Michel, M., & Garofolo, J. S. (2006). The Rich Transcription 2006 Spring Meeting Recognition Evaluation. NIST Rich Transcription (Springer LNCS).
- Bai, Z., & Zhang, X.-L. (2021). Speaker Recognition Based on Deep Learning: An Overview. Neural Networks, 140.
- Soniox (2026). Speaker Diarization. Soniox Docs.
- El Shafey, L., Soltau, H., & Shafran, I. (2019). Joint Speech Recognition and Speaker Diarization via Sequence Transduction. Interspeech 2019.
Building with Soniox? See the Soniox documentation for diarization, word-level timestamps, and real-time transcription.