Identify Speakers#

Speaker identification associates recognized speakers with a unique speaker identity (e.g. Speaker-1 is Mike, Speaker-2 is John). Speaker Identification works by using pre-registered voice profiles. Note that speaker diarization does NOT require voice profiles.

Product Transcript
Speaker Diarization
Speaker-1: I prescribed an antibiotic for your infection.
Speaker-2: Thank you doctor!
Speaker Diarization + Speaker Identification
Dr. Spiegel: I prescribed an antibiotic for your infection.
Patient Bal: Thank you doctor!

Step 1: Register Voice Profiles#

Before using speaker identification, speakers must be registered using the Speaker Management API.

For testing, a command-line tool manage_speakers for speaker management is included in the Soniox Python package. If you have not installed the Soniox Python package yet, refer to Quickstart (Python Lib).

Below is an example of adding speakers and their voice samples. To see all capabilities of this tool, run it with the --help flag.

# Clone soniox_examples GitHub repository if not already.
git clone https://github.com/soniox/soniox_examples.git

# Enter the test_data directory in soniox_examples with test audio files.
cd soniox_examples/test_data

# Make sure to set your API key.
export SONIOX_API_KEY="<YOUR-API-KEY>"

# Add speakers with their voice samples.
python3 -m soniox.manage_speakers --add_speaker --speaker_name John
python3 -m soniox.manage_speakers --add_audio --speaker_name John --audio_name test --audio_fn test_audio_sd_spk1.flac

python3 -m soniox.manage_speakers --add_speaker --speaker_name Judy
python3 -m soniox.manage_speakers --add_audio --speaker_name Judy --audio_name test --audio_fn test_audio_sd_spk2.flac

python3 -m soniox.manage_speakers --list

Note that in a real application, speaker names can be arbitrary identifiers.

Step 2: Use Speaker Identification#

To use speaker identification when transcribing audio, the following must be done in TranscriptionConfig:

  • Speaker diarization must be enabled (set either enable_streaming_speaker_diarization or enable_global_speaker_diarization to true).
  • Speaker identification must be enabled by setting enable_speaker_identification to true.
  • The names of registered speakers that might occur in the audio must be specified using candidate_speaker_names.

The maximum number of specified candidate speakers is 50. Speaker identification only considers the specified candidate speakers, not all the registered speakers.

When speaker identification is enabled, the Result.speakers field determines the associations between speaker-number and speaker-name. Note, this field does not contain entries for recognized speakers that were not associated with any of the specified candidate speaker voice profiles. See Transcription Results for more info.

Global Speaker Identification#

This example demonstrates how to transcribe a file with speaker identification. Make sure you have successfully completed Step 1 of registering voice profiles for speakers John and Judy. The input audio has three speakers, two of them are identified (John and Judy) and the third speaker was recognized but not identified, which is the correct output.

global_speaker_diarization_speaker_id.py

from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        result = transcribe_file_short(
            "../test_data/test_audio_sd.flac",
            client,
            model="en_v2",
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        )

    # Build map from speaker number to name.
    speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}

    # Print results with each speaker segment on its own line.

    speaker = None
    line = ""

    for word in result.words:
        if word.speaker != speaker:
            if len(line) > 0:
                print(line)

            speaker = word.speaker

            if speaker in speaker_num_to_name:
                speaker_name = speaker_num_to_name[speaker]
            else:
                speaker_name = "unknown"

            line = f"Speaker {speaker} ({speaker_name}): "

            if word.text == " ":
                continue

        line += word.text

    print(line)


if __name__ == "__main__":
    main()

Run

python3 global_speaker_diarization_speaker_id.py

Output

Speaker 1 (John): First forward, a nationwide program started ...
Speaker 2 (Judy): I would love to see all 115 community colleges ...
Speaker 3 (unknown): All students should have access to these kinds ...
Speaker 1 (John): These students say college offers a chance to change ...

Streaming Speaker Identification#

Our API also supports streaming speaker diarization and identification. This examples demonstrates how to recognize speech, diarize and identify speakers from a live stream in real-time and low-latency settings. We simulate the live stream by reading a file in small chunks.

streaming_speaker_diarization_speaker_id.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient


def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_sd.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency",
            include_nonfinal=True,
            enable_streaming_speaker_diarization=True,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        ):
            speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}

            def get_name(speaker):
                if speaker in speaker_num_to_name:
                    return speaker_num_to_name[speaker]
                else:
                    return "unknown"

            print(" ".join(f"'{w.text}'/{w.speaker}({get_name(w.speaker)})" for w in result.words))


if __name__ == "__main__":
    main()

Run

python3 streaming_speaker_diarization_speaker_id.py

Output

The script prints recognized tokens with assigned speaker numbers and names from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized token.

'First'/0(unknown)
'First'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/0(unknown) 'a'/0(unknown)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/1(John) 'a'/1(John) ' '/0(unknown) 'nation'/0(unknown)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/1(John) 'a'/1(John) ' '/1(John) 'nationwide'/1(John)