Speaker diarization
Learn how to separate speakers in both real-time and asynchronous processing.
Overview
Soniox Speech-to-Text AI supports speaker diarization, the ability to automatically detect and separate individual speakers within an audio stream. This feature enables you to generate speaker-labeled transcriptions for conversations, meetings, interviews, and other multi-speaker audio content — without the need for manual labeling or additional metadata.
Speaker diarization is designed to work seamlessly in both real-time and asynchronous transcription modes.
What is speaker diarization?
Speaker diarization answers the question: Who spoke when?
When enabled, Soniox identifies speaker changes throughout the audio and assigns a speaker label (e.g., Speaker 1, Speaker 2) to each token. This allows you to organize the transcription into coherent speaker segments.
Example
Suppose the audio contains three speakers with the following content:
how are you I am fantastic what about you feeling great today hey everyone
With speaker diarization enabled, Soniox may return the transcript like this:
This makes it easy to follow multi-speaker conversations and attribute statements accurately.
How to enable speaker diarization
To enable speaker separation, set the following parameter in your API request:
Speaker diarization is supported in both:
- Asynchronous transcription
- Real-time transcription
Output format
Each transcribed token includes a speaker
field when speaker diarization is enabled:
You can use the speaker
field to group tokens into coherent speaker segments in your application or UI.
Real-time considerations
Real-time speaker diarization is inherently more challenging due to low-latency constraints. In real-time mode, the model may have less context to distinguish speakers, which can lead to:
- Slightly higher speaker attribution errors
- Temporary speaker switches that stabilize over time
- Potential delays in assigning speaker labels
Despite this, real-time diarization remains highly useful for live transcription, meetings, and voice interfaces.
Number of supported speakers
The model supports up to 15 different speakers in a single transcription session. However, as the number of speakers increases, the likelihood of similar-sounding voices increases, which can reduce separation accuracy.
Use cases
Use case | Description |
---|---|
Meeting transcription | Attribute dialogue to participants. |
Interview transcription | Identify interviewer vs. guest. |
Medical transcription | Identify doctor vs patient. |
Customer support calls | Distinguish agent and caller for training/QA. |
Podcast editing | Separate hosts and guests for structured transcripts. |
Legal proceedings | Track speaker statements for accurate documentation. |
Example
This example demonstrates how to transcribe a file with speaker separation and create a speaker labeled transcription.