Speaker diarization
Learn how to separate speakers in both real-time and asynchronous processing.
Overview
Soniox Speech-to-Text AI supports speaker diarization, the ability to automatically detect and separate individual speakers within an audio stream. This feature enables you to generate speaker-labeled transcriptions for conversations, meetings, interviews, and other multi-speaker audio content—without the need for manual labeling or additional metadata.
What is speaker diarization?
Speaker diarization answers the question: Who spoke when?
When enabled, Soniox identifies speaker changes throughout the audio and assigns a speaker label (e.g., Speaker-1, Speaker-2) to each token. This allows you to organize the transcription into coherent speaker segments.
Example
Suppose the audio contains three speakers with the following content:
how are you I am fantastic what about you feeling great today hey everyone
With speaker diarization enabled, Soniox may return the transcript like this:
This makes it easy to follow multi-speaker conversations and attribute statements accurately.
How to enable speaker diarization
To enable speaker separation, set the following parameter in your API request:
Speaker diarization is supported in both:
- Asynchronous transcription
- Real-time transcription
Output format
Each transcribed token includes a speaker
field when speaker diarization is enabled:
You can use the speaker
field to group tokens into coherent speaker segments in your application or UI.
Real-time considerations
Real-time speaker diarization is inherently more challenging due to low-latency constraints. In real-time mode, the model may have less context to distinguish speakers, which can lead to:
- Slightly higher speaker attribution errors
- Temporary speaker switches that stabilize over time
- Potential delays in assigning speaker labels
Despite this, real-time diarization remains highly useful for live transcription, meetings, and voice interfaces.
Number of supported speakers
The model supports up to 15 different speakers in a single transcription session. However, as the number of speakers increases, the likelihood of similar-sounding voices increases, which can reduce separation accuracy.
Use cases
Use case | Description |
---|---|
Meeting transcription | Attribute dialogue to participants. |
Interview transcription | Identify interviewer vs. guest. |
Medical transcription | Identify doctor vs patient. |
Customer support calls | Distinguish agent and caller for training/QA. |
Podcast editing | Separate hosts and guests for structured transcripts. |
Legal proceedings | Track speaker statements for accurate documentation. |
Example
This example demonstrates how to transcribe a file with speaker separation and create a speaker labeled transcription.