Separate Speakers (Speaker Diarization)

To get transcriptions with speaker labels, use speaker diarization, which recognizes different speakers in the audio and assigns each recognized word a speaker label.

Example

     Speaker-1: Hi, how are you?
     Speaker-2: I am great. What about you?
     Speaker-1: Fantastic, thank you.

Note: Speaker diarization recognizes different speakers, but it does not identify the speakers. In the example above speaker diarization recognized there are two different speakers in the audio ("Speaker-1" and "Speaker-2"), but it does not know who these speakers are.

Also, speaker diarization does not require any additional input to recognize different speakers. Audio input alone is sufficient.

Two Modes of Operation

Mode Config Description
Global speaker diarization enable_global_speaker_diarization Optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service.
Streaming speaker diarization enable_streaming_speaker_diarization Optimized for real-time low-latency transcription.

Speaker diarization mode is enabled by setting enable_global_speaker_diarization or enable_streaming_speaker_diarization to true. If low-latency results are not needed, it is recommended to use global speaker dirarization in order to achieve the highest possible accuracy. Note that the accuracy of speech recognition is not affected by enabling speaker diarization.

When speaker diarization is enabled, a valid speaker number (>=1) will be assigned to each recognized word in the Word.speaker field.

Number of Speakers

The min_num_speakers and max_num_speakers options specify the minimum and maximum number of speakers in the audio to the service. Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range, for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 actual speakers, as that is likely to result in much lower accuracy.

If you are not sure on the number of speakers in the audio, it is best not to set the exact min and max number of speakers, but rather keep a "loose" range.

Field Default Value Permitted Value
min_num_speakers 1 <=20
max_num_speakers 10 >=min_num_speakers

Code: Global Speaker Diarization

An example of transcribing a file with global speaker diarization using the transcribe_file_stream function.

global_speaker_diarization.py

from soniox.transcribe_file import transcribe_file_stream
from soniox.speech_service import SpeechClient, set_api_key

set_api_key("<YOUR-API-KEY>")


def main():
    with SpeechClient() as client:
        result = transcribe_file_stream(
            "../test_data/test_audio_sd.flac",
            client,
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
        )

    speaker = None
    for word in result.words:
        if word.speaker != speaker:
            if speaker is not None:
                print()
            speaker = word.speaker
            print(f"Speaker {speaker}: ", end="")
        else:
            print(" ", end="")
        print(word.text, end="")
    print()


if __name__ == "__main__":
    main()

Run

python3 global_speaker_diarization.py

Output

Speaker 1: First forward , a nationwide program started by the ...
Speaker 2: I would love to see all 115 community colleges in the state ...
Speaker 3: All students should have access to these kinds of resources ...
Speaker 1: These students say college offers a chance to change ...

An example of transcribing a file with global speaker diarization using the transcribeStream function.

global_speaker_diarization.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your Soniox API key.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        let speaker = "";
        let sentence = "";

        for (const word of result.words) {
            if (word.speaker !== speaker) {
                console.log(sentence);
                speaker = word.speaker;
                sentence = `Speaker ${speaker}: ${word.text}`;
            } else {
                sentence += ` ${word.text}`;
            }
        }

        console.log(sentence);
    };

    const onEndHandler = (error) => {
        if (error) {
            console.log(error);
        }
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        {
            enable_global_speaker_diarization: true,
            min_num_speakers: 1,
            max_num_speakers: 6,
        },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_sd.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();

Run

node global_speaker_diarization.js

Output

Speaker 1: First forward , a nationwide program started by the ...
Speaker 2: I would love to see all 115 community colleges in the state ...
Speaker 3: All students should have access to these kinds of resources ...
Speaker 1: These students say college offers a chance to change ...

Code: Streaming Speaker Diarization

This examples demonstrates how to recognize speech and diarize speakers from a live stream in real-time and low-latency settings. We simulate the stream by reading a file in small chunks.

streaming_speaker_diarization.py

from soniox.transcribe_live import transcribe_capture
from soniox.capture_device import SimulatedCaptureDevice
from soniox.speech_service import SpeechClient, set_api_key

set_api_key("<YOUR-API-KEY>")


def main():
    with SpeechClient() as client:
        sim_capture = SimulatedCaptureDevice("../test_data/test_audio_sd.raw")
        for result in transcribe_capture(
            sim_capture, client, enable_streaming_speaker_diarization=True
        ):
            print(" ".join(f"{w.text}/{w.speaker}" for w in result.words))


if __name__ == "__main__":
    main()

Run

python3 streaming_speaker_diarization.py

Output

The script prints recognized words with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized word.

First/0
First/1 forward/0
First/1 forward/1
First/1 forward/1
First/1 forward/1 ,/1
First/1 forward/1 ,/1 a/0
First/1 forward/1 ,/1 a/1 nation/0
First/1 forward/1 ,/1 a/1 nationwide/1
First/1 forward/1 ,/1 a/1 nationwide/1 program/1

streaming_speaker_diarization.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your Soniox API key.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        console.log(result.words.map((word) => `${word.text}/${word.speaker}`).join(" "));
    };

    const onEndHandler = (error) => {
        if (error) {
            console.log(error);
        }
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        { 
            enable_streaming_speaker_diarization: true,
            include_nonfinal: true
        },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_sd.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();

Run

node streaming_speaker_diarization.js

Output

The script prints recognized words with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized word.

First/0
First/1 forward/0
First/1 forward/1
First/1 forward/1
First/1 forward/1 ,/1
First/1 forward/1 ,/1 a/0
First/1 forward/1 ,/1 a/1 nation/0
First/1 forward/1 ,/1 a/1 nationwide/1
First/1 forward/1 ,/1 a/1 nationwide/1 program/1
cookie Change your cookie preferences