Soniox
Docs
Shared concepts

Speaker diarization

Learn how to separate speakers in both real-time and asynchronous processing.

Overview

Soniox Speech-to-Text AI supports speaker diarization — the ability to automatically detect and separate speakers in an audio stream. This allows you to generate speaker-labeled transcripts for conversations, meetings, interviews, podcasts, and other multi-speaker scenarios — without any manual labeling or extra metadata.


What is speaker diarization?

Speaker diarization answers the question: Who spoke when?

When enabled, Soniox automatically detects speaker changes and assigns each spoken segment to a speaker label (e.g., Speaker 1, Speaker 2). This lets you structure transcripts into clear, speaker-attributed sections.

Example

Input audio:

How are you? I am fantastic. What about you? Feeling great today. Hey everyone!

Output with diarization enabled:

Speaker 1: How are you?
Speaker 2: I am fantastic. What about you?
Speaker 1: Feeling great today.
Speaker 3: Hey everyone!

How to enable speaker diarization

Enable diarization by setting this parameter in your API request:

{
  "enable_speaker_diarization": true
}

Output format

When speaker diarization is enabled, each token includes a speaker field:

{"text": "How",    "speaker": "1"}
{"text": " are",   "speaker": "1"}
{"text": " you",   "speaker": "1"}
{"text": "?",      "speaker": "1"}
{"text": "I",      "speaker": "2"}
{"text": " am",    "speaker": "2"}
{"text": " fan",   "speaker": "2"}
{"text": "tastic", "speaker": "2"}
{"text": ".",      "speaker": "2"}

You can group tokens by speaker in your application to create readable segments, or display speaker labels directly in your UI.


Real-time considerations

Real-time speaker diarization is more challenging due to low-latency constraints. You may observe:

  • Higher speaker attribution errors compared to async mode.
  • Temporary speaker switches that stabilize as more context is available.

Even with these limitations, real-time diarization is valuable for live meetings, conferences, customer support calls, and conversational AI interfaces.


Number of supported speakers

  • Up to 15 different speakers are supported per transcription session.
  • Accuracy may decrease when many speakers have similar voice characteristics.

Best practice

For the most accurate and reliable speaker separation, use asynchronous transcription — it provides significantly higher diarization accuracy because the model has access to the full audio context. Real-time diarization is best when you need immediate speaker attribution, but expect lower accuracy due to low-latency constraints.


Supported languages

Speaker diarization is available for all supported languages.