Soniox
Docs
Speech-to-Text (legacy)/How-to guides

Identify speakers

In this example, we will identify different speakers using Soniox Speaker Identification.

Speaker identification associates recognized speakers with a unique speaker identity (e.g. Speaker-1 is Mike, Speaker-2 is John). Speaker Identification works by using pre-registered voice profiles.

Note that speaker diarization does NOT require voice profiles.

Speaker diarization

Speaker-1: I prescribed an antibiotic for your infection.
Speaker-2: Thank you doctor!

Speaker diarization + speaker sdentification

Speaker-1: I prescribed an antibiotic for your infection.
Speaker-2: Thank you doctor!

Register voice profiles

Before using speaker identification, speakers must be registered using the Speaker Management API.

For testing, a command-line tool manage_speakers for speaker management is included in the Soniox Python package. If you have not installed the Soniox Python package yet, refer to Quickstart (Python Lib).

Below is an example of adding speakers and their voice samples. To see all capabilities of this tool, run it with the --help flag.

Terminal
# Clone soniox_examples GitHub repository if not already.
git clone https://github.com/soniox/soniox_examples.git
 
# Enter the test_data directory in soniox_examples with test audio files.
cd soniox_examples/test_data
 
# Make sure to set your API key.
export SONIOX_API_KEY="<YOUR-API-KEY>"
 
# Add speakers with their voice samples.
python3 -m soniox.manage_speakers --add_speaker --speaker_name John
python3 -m soniox.manage_speakers --add_audio --speaker_name John --audio_name test --audio_fn test_audio_sd_spk1.flac
 
python3 -m soniox.manage_speakers --add_speaker --speaker_name Judy
python3 -m soniox.manage_speakers --add_audio --speaker_name Judy --audio_name test --audio_fn test_audio_sd_spk2.flac
 
python3 -m soniox.manage_speakers --list

Note that in a real application, speaker names can be arbitrary identifiers.

Use speaker identification

To use speaker identification when transcribing audio, the following must be done in TranscriptionConfig:

  • Speaker diarization must be enabled (set either enable_streaming_speaker_diarization or enable_global_speaker_diarization to true).
  • Speaker identification must be enabled by setting enable_speaker_identification to true.
  • The names of registered speakers that might occur in the audio must be specified using candidate_speaker_names.

The maximum number of specified candidate speakers is 50. Speaker identification only considers the specified candidate speakers, not all the registered speakers.

When speaker identification is enabled, the Result.speakers field determines the associations between speaker-number and speaker-name. Note, this field does not contain entries for recognized speakers that were not associated with any of the specified candidate speaker voice profiles. See Transcription Results for more info.

Global speaker identification

This example demonstrates how to transcribe a file with speaker identification. Make sure you have successfully completed Step 1 of registering voice profiles for speakers John and Judy. The input audio has three speakers, two of them are identified (John and Judy) and the third speaker was recognized but not identified, which is the correct output.

global_speaker_diarization_speaker_id.py

global_speaker_diarization_speaker_id.py
from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient
 
 
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        result = transcribe_file_short(
            "../test_data/test_audio_sd.flac",
            client,
            model="en_v2",
            enable_global_speaker_diarization=True,
            min_num_speakers=1,
            max_num_speakers=6,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        )
 
    # Build map from speaker number to name.
    speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}
 
    # Print results with each speaker segment on its own line.
 
    speaker = None
    line = ""
 
    for word in result.words:
        if word.speaker != speaker:
            if len(line) > 0:
                print(line)
 
            speaker = word.speaker
 
            speaker_name = None
            if speaker in speaker_num_to_name:
                speaker_name = speaker_num_to_name[speaker]
            else:
                speaker_name = "unknown"
 
            line = f"Speaker {speaker} ({speaker_name}): "
 
            if word.text == " ":
                continue
 
        line += word.text
 
    print(line)
 
 
if __name__ == "__main__":
    main()

Run

Terminal
python3 global_speaker_diarization_speaker_id.py

Output

Speaker 1 (John): First forward, a nationwide program started ...
Speaker 2 (Judy): I would love to see all 115 community colleges ...
Speaker 3 (unknown): All students should have access to these kinds ...
Speaker 1 (John): These students say college offers a chance to change ...

Streaming speaker identification

Our API also supports streaming speaker diarization and identification. This examples demonstrates how to recognize speech, diarize and identify speakers from a live stream in real-time and low-latency settings. We simulate the live stream by reading a file in small chunks.

streaming_speaker_diarization_speaker_id.py

streaming_speaker_diarization_speaker_id.py
from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient
 
 
def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_sd.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio
 
 
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency",
            include_nonfinal=True,
            enable_streaming_speaker_diarization=True,
            enable_speaker_identification=True,
            cand_speaker_names=["John", "Judy"],
        ):
            speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}
 
            def get_name(speaker):
                if speaker in speaker_num_to_name:
                    return speaker_num_to_name[speaker]
                else:
                    return "unknown"
 
            print(" ".join(f"'{w.text}'/{w.speaker}({get_name(w.speaker)})" for w in result.words))
 
 
if __name__ == "__main__":
    main()

Run

Terminal
python3 global_speaker_diarization_speaker_id.py

Output

The script prints recognized tokens with assigned speaker numbers and names from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that recognized token.

'First'/0(unknown)
'First'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward'/1(John)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/0(unknown) 'a'/0(unknown)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/1(John) 'a'/1(John) ' '/0(unknown) 'nation'/0(unknown)
'First'/1(John) ' '/1(John) 'forward,'/1(John) ' '/1(John) 'a'/1(John) ' '/1(John) 'nationwide'/1(John)

On this page