7. Speaker AI

In this example, we will use Soniox Speaker AI features:

  • Speaker diarization distinguishes speakers in the audio, assigning a speaker number to each recognized word.
  • Speaker identification associates speaker numbers with named speakers based on voice samples provided in advance.

Lean more about the differences between Speaker Diarization vs Speaker Identification.

Speaker Diarization

Soniox Speaker AI supports two modes of speaker diarization:

  • Global speaker diarization is optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service.
  • Streaming speaker diarization is optimized for real-time low-latency transcription.

Speaker diarization with a specific mode is enabled by setting the corresponding option to true (enable_global_speaker_diarization or enable_streaming_speaker_diarization). If low-latency results are not needed, it is recommended to use global speaker dirarization, in order to achieve the highest possible accuracy. Note that the accuracy of speech recognition is not affected by enabling speaker diarization.

When speaker diarization is enabled, a valid speaker number (>=1) will be returned with each transcribed word through the speaker field.

The min_num_speakers and max_num_speakers options specify the minimum and maximum expected number of speakers in the audio. If these options are not specified (i.e., are zero), they default to 1 and 10 respectively. Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range, for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 speakers, as that is likely to result in much lower accuracy. The maximum permitted value of max_num_speakers is 20.

Below is an example of transcribing a file with global speaker diarization using the transcribe_file_stream function.

from soniox.transcribe_file import transcribe_file_stream
from soniox.speech_service import Client, set_api_key
from soniox.test_data import TEST_AUDIO_SD_FLAC

set_api_key("<YOUR-API-KEY>")

def main():
    with Client() as client:
        results = list(
            transcribe_file_stream(
                TEST_AUDIO_SD_FLAC,
                client,
                enable_global_speaker_diarization=True,
                min_num_speakers=1,
                max_num_speakers=6,
            )
        )

    speaker = None
    for result in results:
        for word in result.words:
            if word.speaker != speaker:
                if speaker is not None:
                    print()
                speaker = word.speaker
                print(f"Speaker {speaker}: ", end="")
            else:
                print(" ", end="")
            print(word.text, end="")
    print()

if __name__ == "__main__":
    main()

Run!

examples/global_speaker_diarization.py GitHub

python3 global_speaker_diarization.py

Speaker Identification

Before using speaker identification, known speakers must first be registered using the Soniox speaker management API.

It is easiest to register speaker using the manage_speakers command-line application which is included with the Soniox Python package. Below is an example of adding speakers and their voice samples. Note that in a real application, speaker names can be arbitrary identifiers. To see all capabilities of this application, run it with the --help option.

python3 -m soniox.examples.manage_speakers --add_speaker --speaker_name John
python3 -m soniox.examples.manage_speakers --add_audio --speaker_name John --audio_name test --audio_fn ~/soniox_python/soniox/test_data_files/test_audio_sd_spk1.flac

python3 -m soniox.examples.manage_speakers --add_speaker --speaker_name Judy
python3 -m soniox.examples.manage_speakers --add_audio --speaker_name Judy --audio_name test --audio_fn ~/soniox_python/soniox/test_data_files/test_audio_sd_spk2.flac

python3 -m soniox.examples.manage_speakers --list

For more information about speaker management, refer to the Speaker Identification gRPC API.

To use speaker identification when transcribing audio, the following must be done:

  • Speaker diarization must be enabled (either global or streaming mode).
  • Speaker identification must be enabled by setting the enable_speaker_identification option to true.
  • The names of candidate speakers must be specified using the candidate_speaker_names option.

The maximum permitted number of candidate speakers is 50. Speaker identification only considers the specified candidate speakers, not other registered speakers.

When speaker identification is enabled, the speakers field in each result lists the determined speaker-number-to-speaker-name associations. This field does not contain entries for speakers in the audio that were not associated with a candidate speaker.

Below is an example of transcribing a file with speaker identification.

from soniox.transcribe_file import transcribe_file_stream
from soniox.speech_service import Client, set_api_key
from soniox.test_data import TEST_AUDIO_SD_FLAC

set_api_key("<YOUR-API-KEY>")

def main():
    with Client() as client:
        results = list(
            transcribe_file_stream(
                TEST_AUDIO_SD_FLAC,
                client,
                enable_global_speaker_diarization=True,
                min_num_speakers=1,
                max_num_speakers=6,
                enable_speaker_identification=True,
                candidate_speaker_names=["John", "Mary"],
            )
        )

    speaker = None
    speaker_num_to_name = {}
    for result in results:
        speaker_num_to_name = {entry.speaker: entry.name for entry in result.speakers}
        for word in result.words:
            if word.speaker != speaker:
                if speaker is not None:
                    print()
                speaker = word.speaker
                if speaker in speaker_num_to_name:
                    speaker_name = speaker_num_to_name[speaker]
                else:
                    speaker_name = "unknown"
                print(f"Speaker {speaker} ({speaker_name}): ", end="")
            else:
                print(" ", end="")
            print(word.text, end="")
    print()

if __name__ == "__main__":
    main()

Run!

examples/global_speaker_diarization_speaker_id.py GitHub

python3 global_speaker_diarization_speaker_id.py