What is speaker diarization? Who-spoke-when, explained

Picture the transcript of a four-person meeting with no speaker labels. One paragraph runs into the next, a question and its answer sit in the same block, and "yeah, I agree" floats with no way to tell who agreed with whom. Every word is correct, and the conversation is still illegible. Diarization adds the speaker labels that make the turns readable.

Diarization pipeline

The textbook diarization system runs in four stages, though the field has been folding them into single neural models since 2019.^[1]^[2]

It starts with segmentation: cut the audio into small pieces that each (you hope) contain a single speaker. Older systems detected speaker change points; modern ones slide a short window, often around 1.5 seconds, and accept that some windows will straddle a boundary.^[1]

Each segment then becomes a speaker embedding, a fixed-length vector that captures voice characteristics while ignoring the words. The history is specific: i-vectors (Dehak et al., 2011)^[3] borrowed from factor analysis; then neural d-vectors^[4] and especially x-vectors (Snyder et al., 2018), which trained a time-delay neural network to produce embeddings that cluster cleanly by speaker.^[5] X-vectors were the workhorse for years and are still a sensible baseline.

Clustering groups the embeddings so that one cluster equals one speaker. Agglomerative hierarchical clustering (AHC) merges the closest segments bottom-up until a stopping threshold^[6]; spectral clustering uses the eigenstructure of a similarity matrix and handles awkward cluster shapes better.^[7] Neither is told how many speakers to expect, which is most of the difficulty.

Assignment closes the loop: every frame of audio inherits the label of its cluster, and you have your who-spoke-when timeline.

flowchart LR A[Audio in] --> B[Segment<br/>into windows] B --> C[Embed<br/>x-vectors] C --> D[Cluster<br/>AHC or spectral] D --> E[Assign labels<br/>per frame] E --> F[Speaker timeline]

The classic four-stage diarization pipeline. End-to-end systems collapse these into one trained model, but the stages still describe what has to happen.

Sources of diarization error

Diarization stayed unsolved long after ASR became usable, for four stubborn reasons.

You usually do not know how many speakers there are. A clustering algorithm handed the wrong speaker count produces confidently wrong output: if it merges two people into one cluster or splits one person across two, every downstream label is off.^[6]

Overlapping speech is one of the hardest problems. When two people talk at once, a system that assigns one label per frame has to pick one and is wrong about the other by construction.^[2] For decades it has been one of the largest contributors to diarization error,^[8] which is why so much recent work targets it directly.

Short turns starve the embeddings. A crisp x-vector needs a second or two of speech, while a clipped "right" or "no" gives the model almost nothing, and backchannel-heavy conversation is full of these.^[1]^[9]

Similar voices defeat the geometry. Two speakers with close pitch and accent land near each other in embedding space, and the clustering step has no principled way to keep them apart.^[10]^[7]

What diarization is not

Diarization gets conflated with two neighbors it does not much resemble. Speaker identification puts a name on a voice by matching it against enrolled voiceprints, and speaker verification checks a voice against one claimed identity, the yes-or-no behind voice login.^[12] Diarization needs neither names nor enrollment. It can feed the other two (cluster first, then look up each cluster), but on its own it stays anonymous, which is precisely why it carries none of their biometric baggage. The three-way comparison, including what each needs up front, is in diarization vs speaker identification vs verification.

Online vs offline

Offline (batch) diarization sees the whole recording before deciding anything, so it can cluster globally and revise early guesses once later evidence arrives. This is the easier, more accurate setting, and what you want for recorded calls, podcasts, and meeting archives.

Online (streaming) diarization must label each speaker as the audio arrives, with no peeking ahead. It cannot recluster the past, so an early mistake tends to persist, and it has to decide on the spot whether a new voice is genuinely new or just speaker 2 sounding different. It is generally harder and usually posts higher DER than its offline counterpart on the same audio.^[8]^[1] If your product is a live captioning or agent-assist feature, this is the regime you are stuck with.

Combining diarization with ASR

Diarization gives you who spoke when; ASR gives you what was said when. The labeled transcript you want comes from intersecting the two on the time axis: each recognized word carries a start and end time (see timestamps and forced alignment), and you stamp it with whichever speaker owned that span.^[13] If the alignment is slightly wrong, words land under the previous speaker, jarring at exactly the turn boundaries readers care about most. If you run the two as one joint model instead, the boundaries tend to agree by design, not by stitching.^[14]

Common questions

Does diarization tell me people's names?

No. It gives anonymous labels like Speaker 1 and Speaker 2 that are consistent within one recording. Putting a name to a label is speaker identification, a separate step that needs enrolled voices.

How many speakers can it handle?

Two-person calls are the well-behaved case. Accuracy degrades as the count rises and as turns get shorter and more overlapped, because both the embeddings and the clustering have less clean signal. Many systems can estimate the count automatically, but the estimate gets shakier with crowded, noisy audio.

Why are the speaker labels different every time I rerun it?

Because the labels are relative, not identities. Speaker 1 in one file has no connection to Speaker 1 in another, and even a rerun can swap the numbering. Stable identities across recordings require identification on top.

Does background noise hurt diarization?

Yes. Noise and crosstalk corrupt the embeddings and blur speaker boundaries, raising DER. Telephony audio is its own challenge, narrowband and compressed; see transcribing noisy audio and telephony transcription.

References

Park, T. J., Kanda, N., Dimitriadis, D., et al. (2022). A Review of Speaker Diarization: Recent Advances with Deep Learning. Computer Speech & Language, 72.
Fujita, Y., Kanda, N., Horiguchi, S., et al. (2019). End-to-End Neural Speaker Diarization with Permutation-Free Objectives. Interspeech 2019.
Dehak, N., Kenny, P. J., Dehak, R., et al. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4).
Variani, E., Lei, X., McDermott, E., et al. (2014). Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. ICASSP 2014 (IEEE).
Snyder, D., Garcia-Romero, D., Sell, G., et al. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. ICASSP 2018 (IEEE).
Anguera, X., Bozonnet, S., Evans, N., et al. (2012). Speaker Diarization: A Review of Recent Research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2).
Wang, Q., Downey, C., Wan, L., et al. (2018). Speaker Diarization with LSTM. ICASSP 2018 (IEEE).
Coria, J. M., Bredin, H., Ghannay, S., & Rosset, S. (2021). Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation. IEEE ASRU 2021.
Kanagasundaram, A., Vogt, R., Dean, D., et al. (2011). i-vector Based Speaker Recognition on Short Utterances. Interspeech 2011.
Fürer, L., Schenk, N., Roth, V., et al. (2020). Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research. Frontiers in Psychology, 11:1726.
Fiscus, J. G., Ajot, J., Michel, M., & Garofolo, J. S. (2006). The Rich Transcription 2006 Spring Meeting Recognition Evaluation. NIST Rich Transcription (Springer LNCS).
Bai, Z., & Zhang, X.-L. (2021). Speaker Recognition Based on Deep Learning: An Overview. Neural Networks, 140.
Soniox (2026). Speaker Diarization. Soniox Docs.
El Shafey, L., Soltau, H., & Shafran, I. (2019). Joint Speech Recognition and Speaker Diarization via Sequence Transduction. Interspeech 2019.