Identify Speakers

Speaker identification associates recognized speakers with a unique speaker identity (e.g. Speaker-1 is Mike, Speaker-2 is John). Speaker Identification works by using pre-registered voice profiles. Note that speaker diarization does NOT require voice profiles.

Product Transcript
Speaker Diarization Speaker-1: I prescribed an antibiotic for your infection.
Speaker-2: Thank you doctor!
Speaker Diarization + Speaker Identification Dr. Spiegel: I prescribed an antibiotic for your infection.
Patient Bal: Thank you doctor!

Step 1: Register Voice Profiles

Before using speaker identification, speakers must be first registered using the Soniox speaker management API.

It is easiest to register speakers using the manage_speakers.py command-line application which is included with the Soniox Python package. Below is an example of adding speakers and their voice samples. To see all capabilities of this application, run it with the --help flag.

# Clone soniox_examples GitHub repository
cd ~/
git clone https://github.com/soniox/soniox_examples.git

# Set API key
export SONIOX_API_KEY="<YOUR-API-KEY>"

# Add speakers with their voice samples
python3 -m soniox.manage_speakers --add_speaker --speaker_name John
python3 -m soniox.manage_speakers --add_audio --speaker_name John --audio_name test --audio_fn ~/soniox_examples/test_data/test_audio_sd_spk1.flac

python3 -m soniox.manage_speakers --add_speaker --speaker_name Judy
python3 -m soniox.manage_speakers --add_audio --speaker_name Judy --audio_name test --audio_fn ~/soniox_examples/test_data/test_audio_sd_spk2.flac

python3 -m soniox.manage_speakers --list

Note that in a real application, speaker names can be arbitrary identifiers.

For more information about speaker management, refer to the Speaker Identification gRPC API.

Step 2: Use Speaker Identification

To use speaker identification when transcribing audio, the following must be done:

  • Speaker diarization must be enabled (set either enable_global_speaker_diarization or enable_global_speaker_diarization to true).
  • Speaker identification must be enabled by setting the enable_speaker_identification to true.
  • The names of registered speakers that might occur in the audio must be specified using the candidate_speaker_names field.

The maximum number of specified candidate speakers is 50. Speaker identification only considers the specified candidate speakers, not all the registered speakers.

When speaker identification is enabled, the Result.speakers field determines the associations between word-speaker-number and speaker-name. Note, this field does not contain entries for recognized speakers that were not associated with any of the specified candidate speaker voice profiles. See Response for more info.

Global Speaker Identification

This example demonstrates how to transcribe a file with speaker identification. Make sure you have successfully completed Step 1 of registering voice profiles for speakers John and Judy. The input audio has three speakers, two of them are identified (John and Judy) and the third speaker was recognized but not identified, which is the correct output.

global_speaker_diarization_speaker_id.py

from soniox.transcribe_file import transcribe_file_stream
from soniox.speech_service import SpeechClient, set_api_key

set_api_key("<YOUR-API-KEY>")


def main():
    with SpeechClient() as client:
        result = transcribe_file_stream(
            "../test_data/test_audio_sd.flac",
            client,
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        )

    speaker = None
    speaker_num_to_name = {}
    speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}
    for word in result.words:
        if word.speaker != speaker:
            if speaker is not None:
                print()
            speaker = word.speaker
            if speaker in speaker_num_to_name:
                speaker_name = speaker_num_to_name[speaker]
            else:
                speaker_name = "unknown"
            print(f"Speaker {speaker} ({speaker_name}): ", end="")
        else:
            print(" ", end="")
        print(word.text, end="")
    print()


if __name__ == "__main__":
    main()

Run

python3 global_speaker_diarization_speaker_id.py

Output

Speaker 1 (John): First forward , a nationwide program started ...
Speaker 2 (Judy): I would love to see all 115 community colleges ...
Speaker 3 (unknown): All students should have access to these kinds ...
Speaker 1 (John): These students say college offers a chance to change ...

global_speaker_diarization_speaker_id.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your Soniox API key.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        let speaker = "";
        let speaker_name = "";
        let sentence = "";
        let speaker_num_to_name = {}
        for (const entry of result.speakers) {
            speaker_num_to_name[entry.speaker] = entry.name
        }

        for (const word of result.words) {
            if (word.speaker !== speaker) {
                speaker = word.speaker;
                if (speaker in speaker_num_to_name) {
                    speaker_name = speaker_num_to_name[speaker]
                } else {
                    speaker_name = "unknown"
                }
                console.log(sentence);
                sentence = `Speaker ${speaker} (${speaker_name}): ${word.text}`;
            } else {
                sentence += ` ${word.text}`;
            }
        }

        console.log(sentence);
    };

    const onEndHandler = (error) => {
        if (error) {
            console.log(error);
        }
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        {
            enable_global_speaker_diarization: true,
            min_num_speakers: 1,
            max_num_speakers: 6,
            enable_speaker_identification: true,
            cand_speaker_names: ["John", "Judy"]
        },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_sd.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();

Run

node global_speaker_diarization_speaker_id.js

Output

Speaker 1 (John): First forward , a nationwide program started ...
Speaker 2 (Judy): I would love to see all 115 community colleges ...
Speaker 3 (unknown): All students should have access to these kinds ...
Speaker 1 (John): These students say college offers a chance to change ...

Streaming Speaker Identification

Our API also supports streaming speaker diarization and identification. This examples demonstrates how to recognize speech, diarize and identify speakers from a live stream in real-time and low-latency settings. We simulate the live stream by reading a file in small chunks.

streaming_speaker_diarization_speaker_id.py

from soniox.transcribe_live import transcribe_capture
from soniox.capture_device import SimulatedCaptureDevice
from soniox.speech_service import SpeechClient, set_api_key

set_api_key("<YOUR-API-KEY>")


def main():
    with SpeechClient() as client:
        sim_capture = SimulatedCaptureDevice("../test_data/test_audio_sd.raw")

        for result in transcribe_capture(
            sim_capture,
            client,
            enable_streaming_speaker_diarization=True,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        ):
            speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}

            def get_name(speaker):
                if speaker in speaker_num_to_name:
                    return speaker_num_to_name[speaker]
                else:
                    return "unknown"

            print(
                " ".join(f"{w.text}/{w.speaker}({get_name(w.speaker)})" for w in result.words)
            )


if __name__ == "__main__":
    main()    

Run

python3 streaming_speaker_diarization_speaker_id.py

Output

The script prints recognized words with assigned speaker numbers and names from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized word.

First/0(unknown)
First/1(John) forward/0(unknown)
First/1(John) forward/1(John)
First/1(John) forward/1(John)
First/1(John) forward/1(John) ,/1(John)
First/1(John) forward/1(John) ,/1(John) a/0(unknown)
First/1(John) forward/1(John) ,/1(John) a/1(John) nation/0(unknown)
First/1(John) forward/1(John) ,/1(John) a/1(John) nationwide/1(John)
First/1(John) forward/1(John) ,/1(John) a/1(John) nationwide/1(John) program/1(John)

streaming_speaker_diarization_speaker_id.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your Soniox API key.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        let speaker_num_to_name = {}
        for (const entry of result.speakers) {
            speaker_num_to_name[entry.speaker] = entry.name
        }

        const getName = (speaker) => {
            if (speaker in speaker_num_to_name) {
                return speaker_num_to_name[speaker]
            } else {
                return "unknown"
            }
        };

        console.log(result.words.map((word) =>
            `${word.text}/${word.speaker}(${getName(word.speaker)})`).join(" ")
        );
    };

    const onEndHandler = (error) => {
        if (error) {
            console.log(error);
        }
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        {
            enable_streaming_speaker_diarization: true,
            enable_speaker_identification: true,
            cand_speaker_names: ["John", "Judy"],
            include_nonfinal: true
        },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_sd.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();

Run

node streaming_speaker_diarization_speaker_id.js

Output

The script prints recognized words with assigned speaker numbers and names from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized word.

First/0(unknown)
First/1(John) forward/0(unknown)
First/1(John) forward/1(John)
First/1(John) forward/1(John)
First/1(John) forward/1(John) ,/1(John)
First/1(John) forward/1(John) ,/1(John) a/0(unknown)
First/1(John) forward/1(John) ,/1(John) a/1(John) nation/0(unknown)
First/1(John) forward/1(John) ,/1(John) a/1(John) nationwide/1(John)
First/1(John) forward/1(John) ,/1(John) a/1(John) nationwide/1(John) program/1(John)
cookie Change your cookie preferences