Separate Speakers#

To get transcriptions with speaker labels, use speaker diarization, which recognizes different speakers in the audio and assigns each recognized token a speaker tag.

Example#

Speaker-1: Hi, how are you?
Speaker-2: I am great. What about you?
Speaker-1: Fantastic, thank you.

Note: Speaker diarization recognizes different speakers, but it does not identify the speakers. In the example above, speaker diarization recognized two speakers in the audio (“Speaker-1” and “Speaker-2”), but it does not know who these speakers are.

Also, speaker diarization does not require any additional input to recognize different speakers. Audio input alone is sufficient.

Two Modes of Operation#

Mode Config field Description
Global speaker diarization enable_global_speaker_diarization Optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service.
Streaming speaker diarization enable_streaming_speaker_diarization Optimized for real-time low-latency transcription. There is no significant added latency.

Speaker diarization is enabled by setting either the enable_global_speaker_diarization or the enable_streaming_speaker_diarization TranscriptionConfig field to true. If low-latency results are not needed, it is recommended to use global speaker dirarization in order to achieve higher accuracy. Note that the accuracy of speech recognition is not affected by enabling speaker diarization.

When speaker diarization is enabled, a valid speaker number (>=1) will be assigned to tokens in the Word.speaker field.

With streaming speaker diarization, speaker recognition has a slightly greater latency than speech recognition itself; a non-final token might be first returned as non-final with speaker number 0 and a short time later returned with a valid speaker number.

Number of Speakers#

The min_num_speakers and max_num_speakers TranscriptionConfig fields specify the minimum and maximum number of speakers in the audio. Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range, for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 actual speakers, as that is likely to result in much lower accuracy.

If you are not sure about the number of speakers in the audio, it is best not to set the exact min and max number of speakers, but rather keep a “loose” range.

Field Default Value (if 0) Permitted Value
min_num_speakers 1 <=max_num_speakers
max_num_speakers 10 <=20

Global Speaker Diarization#

An example of transcribing a file with global speaker diarization using the transcribe_file_stream function.

global_speaker_diarization.py

from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        result = transcribe_file_short(
            "../test_data/test_audio_sd.flac",
            client,
            model="en_v2",
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
        )

    # Print results with each speaker segment on its own line.

    speaker = None
    line = ""

    for word in result.words:
        if word.speaker != speaker:
            if len(line) > 0:
                print(line)

            speaker = word.speaker
            line = f"Speaker {speaker}: "

            if word.text == " ":
                # Avoid printing leading space at speaker change.
                continue

        line += word.text

    print(line)


if __name__ == "__main__":
    main()

Run

python3 global_speaker_diarization.py

Output

Speaker 1: First forward, a nationwide program started ...
Speaker 2: I would love to see all 115 community colleges ...
Speaker 3: If we can make that happen, it'll be fabulous.
Speaker 1: These students say college offers a chance to ...

Streaming Speaker Diarization#

This examples demonstrates how to recognize speech and diarize speakers from a live stream in real-time and low-latency settings. We simulate the stream by reading a file in small chunks.

streaming_speaker_diarization.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient


def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_sd.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency",
            include_nonfinal=True,
            enable_streaming_speaker_diarization=True,
        ):
            print(" ".join(f"'{w.text}'/{w.speaker}" for w in result.words))


if __name__ == "__main__":
    main()

Run

python3 streaming_speaker_diarization.py

Output

The script prints recognized tokens with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that token.

'First'/0
'First'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward,'/1 ' '/0 'a'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/0 'nation'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/1 'nationwide'/1