Separate speakers
This page explains how to use Soniox Speaker Diarization to get transcriptions with speaker labels.
To get transcriptions with speaker labels, use speaker diarization, which recognizes different speakers in the audio and assigns each recognized token a speaker tag.
Example
Speaker diarization recognizes different speakers, but it does not identify the speakers. In the example above, speaker diarization recognized two speakers in the audio ("Speaker-1" and "Speaker-2"), but it does not know who these speakers are.
Also, speaker diarization does not require any additional input to recognize different speakers. Audio input alone is sufficient.
Two modes of operation
Mode | Config field | Description |
---|---|---|
Global speaker diarization | enable_global_speaker_diarization | Optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service. There is no significant added latency. |
Streaming speaker diarization | enable_streaming_speaker_diarization | Optimized for real-time low-latency transcription. There is no significant added latency. |
Speaker diarization is enabled by setting either the enable_global_speaker_diarization
or the
enable_streaming_speaker_diarization
TranscriptionConfig
field to true
.
If low-latency results are not needed, it is recommended to use global speaker dirarization in order to achieve higher accuracy.
Note that the accuracy of speech recognition is not affected by enabling speaker diarization.
When speaker diarization is enabled, a valid speaker number (>=1) will be assigned to tokens in the Word.speaker
field.
With streaming speaker diarization, speaker recognition has a slightly greater latency than speech recognition itself; a non-final token might be first returned as non-final with speaker number 0 and a short time later returned with a valid speaker number.
Number of speakers
The min_num_speakers
and max_num_speakers
TranscriptionConfig
fields specify the minimum and maximum number of speakers
in the audio.
Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range,
for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 actual speakers, as that is likely to result in much lower accuracy.
If you are not sure about the number of speakers in the audio, it is best not to set the exact min and max number of speakers, but rather keep a "loose" range.
Field | Default Value (if 0) | Permitted Value |
---|---|---|
min_num_speakers | 1 | =< max_num_speakers |
max_num_speakers | 10 | <= 20 |
Global speaker diarization
Streaming speaker diarization
This examples demonstrates how to recognize speech and diarize speakers from a live stream in real-time and low-latency settings. We simulate the stream by reading a file in small chunks.
streaming_speaker_diarization.py
Run
Output
The script prints recognized tokens with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that token.