Speaker diarization

Overview

Soniox Speech-to-Text AI supports speaker diarization, the ability to automatically detect and separate individual speakers within an audio stream. This feature enables you to generate speaker-labeled transcriptions for conversations, meetings, interviews, and other multi-speaker audio content — without the need for manual labeling or additional metadata.

Speaker diarization is designed to work seamlessly in both real-time and asynchronous transcription modes.

What is speaker diarization?

Speaker diarization answers the question: Who spoke when?

When enabled, Soniox identifies speaker changes throughout the audio and assigns a speaker label (e.g., Speaker 1, Speaker 2) to each token. This allows you to organize the transcription into coherent speaker segments.

Example

Suppose the audio contains three speakers with the following content:

how are you I am fantastic what about you feeling great today hey everyone

With speaker diarization enabled, Soniox may return the transcript like this:

Speaker 1: How are you?
Speaker 2: I am fantastic. What about you?
Speaker 1: Feeling great today.
Speaker 3: Hey everyone!

This makes it easy to follow multi-speaker conversations and attribute statements accurately.

How to enable speaker diarization

To enable speaker separation, set the following parameter in your API request:

{
  "enable_speaker_diarization": true
}

Speaker diarization is supported in both:

Asynchronous transcription
Real-time transcription

Output format

Each transcribed token includes a speaker field when speaker diarization is enabled:

{
  "tokens": [
    {"text": "How", "speaker": "1"},
    {"text": "are", "speaker": "1"},
    {"text": "you", "speaker": "1"},
    {"text": "I", "speaker": "2"},
    {"text": "am", "speaker": "2"},
    {"text": "fantastic.", "speaker": "2"}
  ]
}

You can use the speaker field to group tokens into coherent speaker segments in your application or UI.

Real-time considerations

Real-time speaker diarization is inherently more challenging due to low-latency constraints. In real-time mode, the model may have less context to distinguish speakers, which can lead to:

Slightly higher speaker attribution errors
Temporary speaker switches that stabilize over time
Potential delays in assigning speaker labels

Despite this, real-time diarization remains highly useful for live transcription, meetings, and voice interfaces.

Number of supported speakers

The model supports up to 15 different speakers in a single transcription session. However, as the number of speakers increases, the likelihood of similar-sounding voices increases, which can reduce separation accuracy.

Use cases

Use case	Description
Meeting transcription	Attribute dialogue to participants.
Interview transcription	Identify interviewer vs. guest.
Medical transcription	Identify doctor vs patient.
Customer support calls	Distinguish agent and caller for training/QA.
Podcast editing	Separate hosts and guests for structured transcripts.
Legal proceedings	Track speaker statements for accurate documentation.

Example

This example demonstrates how to transcribe a file with speaker separation and create a speaker labeled transcription.

import os
import time
 
import requests
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ["SONIOX_API_KEY"]
api_base = "https://api.soniox.com"
audio_url = "https://soniox.com/media/examples/coffee_shop.mp3"
 
session = requests.Session()
session.headers["Authorization"] = f"Bearer {api_key}"
 
 
def poll_until_complete(transcription_id):
    while True:
        res = session.get(f"{api_base}/v1/transcriptions/{transcription_id}")
        res.raise_for_status()
        data = res.json()
        if data["status"] == "completed":
            return
        elif data["status"] == "error":
            raise Exception(
                f"Transcription failed: {data.get('error_message', 'Unknown error')}"
            )
        time.sleep(1)
 
 
def main():
    print("Starting transcription...")
 
    res = session.post(
        f"{api_base}/v1/transcriptions",
        json={
            "audio_url": audio_url,
            "model": "stt-async-preview",
            "language_hints": ["en", "es"],
            "enable_speaker_diarization": True,
        },
    )
    res.raise_for_status()
    transcription_id = res.json()["id"]
    print(f"Transcription ID: {transcription_id}")
 
    # Poll until transcription is done
    poll_until_complete(transcription_id)
 
    # Get the transcript text
    res = session.get(f"{api_base}/v1/transcriptions/{transcription_id}/transcript")
    res.raise_for_status()
 
    # Prepare transcript with speakers
    text = ""
    speaker = ""
 
    for token in res.json()["tokens"]:
        token_text = token["text"]
 
        if token.get("speaker") and token.get("speaker") != speaker:
            if speaker:
                text += "\n"
            speaker = token["speaker"]
            text += f"Speaker {speaker}: "
            token_text = token_text.lstrip()
 
        text += token_text
 
    print("Transcript:")
    print(text)
 
    # Delete the transcription
    res = session.delete(f"{api_base}/v1/transcriptions/{transcription_id}")
    res.raise_for_status()
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output

On this page