Soniox
Docs
Speech-to-Text (legacy)/How-to guides

Separate speakers

This page explains how to use Soniox Speaker Diarization to get transcriptions with speaker labels.

To get transcriptions with speaker labels, use speaker diarization, which recognizes different speakers in the audio and assigns each recognized token a speaker tag.

Example

Speaker-1: Hi, how are you?
Speaker-2: I am great. What about you?
Speaker-1: Fantastic, thank you.

Speaker diarization recognizes different speakers, but it does not identify the speakers. In the example above, speaker diarization recognized two speakers in the audio ("Speaker-1" and "Speaker-2"), but it does not know who these speakers are.

Also, speaker diarization does not require any additional input to recognize different speakers. Audio input alone is sufficient.

Two modes of operation

ModeConfig fieldDescription
Global speaker diarizationenable_global_speaker_diarization

Optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service. There is no significant added latency.

Streaming speaker diarizationenable_streaming_speaker_diarization

Optimized for real-time low-latency transcription. There is no significant added latency.

Speaker diarization is enabled by setting either the enable_global_speaker_diarization or the enable_streaming_speaker_diarization TranscriptionConfig field to true. If low-latency results are not needed, it is recommended to use global speaker dirarization in order to achieve higher accuracy. Note that the accuracy of speech recognition is not affected by enabling speaker diarization.

When speaker diarization is enabled, a valid speaker number (>=1) will be assigned to tokens in the Word.speaker field.

With streaming speaker diarization, speaker recognition has a slightly greater latency than speech recognition itself; a non-final token might be first returned as non-final with speaker number 0 and a short time later returned with a valid speaker number.

Number of speakers

The min_num_speakers and max_num_speakers TranscriptionConfig fields specify the minimum and maximum number of speakers in the audio. Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range, for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 actual speakers, as that is likely to result in much lower accuracy.

If you are not sure about the number of speakers in the audio, it is best not to set the exact min and max number of speakers, but rather keep a "loose" range.

FieldDefault Value (if 0)Permitted Value
min_num_speakers1=< max_num_speakers
max_num_speakers10<= 20

Global speaker diarization

global_speaker_diarization.py

global_speaker_diarization.py
from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient
 
 
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        result = transcribe_file_short(
            "../test_data/test_audio_sd.flac",
            client,
            model="en_v2",
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
        )
 
    # Print results with each speaker segment on its own line.
 
    speaker = None
    line = ""
 
    for word in result.words:
        if word.speaker != speaker:
            if len(line) > 0:
                print(line)
 
            speaker = word.speaker
            line = f"Speaker {speaker}: "
 
            if word.text == " ":
                # Avoid printing leading space at speaker change.
                continue
 
        line += word.text
 
    print(line)
 
 
if __name__ == "__main__":
    main()

Run

Terminal
python3 global_speaker_diarization.py

Output

Speaker 1: First forward, a nationwide program started ...
Speaker 2: I would love to see all 115 community colleges ...
Speaker 3: If we can make that happen, it'll be fabulous.
Speaker 1: These students say college offers a chance to ...

Streaming speaker diarization

This examples demonstrates how to recognize speech and diarize speakers from a live stream in real-time and low-latency settings. We simulate the stream by reading a file in small chunks.

streaming_speaker_diarization.py

streaming_speaker_diarization.py
from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient
 
 
def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_sd.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio
 
 
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency",
            include_nonfinal=True,
            enable_streaming_speaker_diarization=True,
        ):
            print(" ".join(f"'{w.text}'/{w.speaker}" for w in result.words))
 
 
if __name__ == "__main__":
    main()

Run

Terminal
python3 streaming_speaker_diarization.py

Output

The script prints recognized tokens with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that token.

'First'/0
'First'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward,'/1 ' '/0 'a'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/0 'nation'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/1 'nationwide'/1

On this page