Separate Speakers#
To get transcriptions with speaker labels, use speaker diarization, which recognizes different speakers in the audio and assigns each recognized token a speaker tag.
Example#
Note: Speaker diarization recognizes different speakers, but it does not identify the speakers. In the example above, speaker diarization recognized two speakers in the audio (“Speaker-1” and “Speaker-2”), but it does not know who these speakers are.
Also, speaker diarization does not require any additional input to recognize different speakers. Audio input alone is sufficient.
Two Modes of Operation#
Mode | Config field | Description |
---|---|---|
Global speaker diarization | enable_global_speaker_diarization | Optimized for highest accuracy. In this mode, transcription results will be returned only after all audio has been sent to the service. |
Streaming speaker diarization | enable_streaming_speaker_diarization | Optimized for real-time low-latency transcription. There is no significant added latency. |
Speaker diarization is enabled by setting either the enable_global_speaker_diarization
or the
enable_streaming_speaker_diarization
TranscriptionConfig
field to true
.
If low-latency results are not needed, it is recommended to use global speaker dirarization in order to achieve higher accuracy.
Note that the accuracy of speech recognition is not affected by enabling speaker diarization.
When speaker diarization is enabled, a valid speaker number (>=1) will be assigned to tokens in the Word.speaker
field.
With streaming speaker diarization, speaker recognition has a slightly greater latency than speech recognition itself; a non-final token might be first returned as non-final with speaker number 0 and a short time later returned with a valid speaker number.
Number of Speakers#
The min_num_speakers
and max_num_speakers
TranscriptionConfig
fields specify the minimum and maximum number of speakers
in the audio.
Specifying numbers that are closer to the actual number of speakers may result in higher accuracy. It is important to not specify an incorrect range,
for example, to specify that there are 1-to-2 or 4-to-5 speakers in an audio with 3 actual speakers, as that is likely to result in much lower accuracy.
If you are not sure about the number of speakers in the audio, it is best not to set the exact min and max number of speakers, but rather keep a “loose” range.
Field | Default Value (if 0) | Permitted Value |
---|---|---|
min_num_speakers | 1 | <=max_num_speakers |
max_num_speakers | 10 | <=20 |
Global Speaker Diarization#
An example of transcribing a file with global speaker diarization using the transcribe_file_stream
function.
from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
with SpeechClient() as client:
result = transcribe_file_short(
"../test_data/test_audio_sd.flac",
client,
model="en_v2",
enable_global_speaker_diarization=True,
min_num_speakers=1,
max_num_speakers=6,
)
# Print results with each speaker segment on its own line.
speaker = None
line = ""
for word in result.words:
if word.speaker != speaker:
if len(line) > 0:
print(line)
speaker = word.speaker
line = f"Speaker {speaker}: "
if word.text == " ":
# Avoid printing leading space at speaker change.
continue
line += word.text
print(line)
if __name__ == "__main__":
main()
Run
python3 global_speaker_diarization.py
Output
Speaker 1: First forward, a nationwide program started ...
Speaker 2: I would love to see all 115 community colleges ...
Speaker 3: If we can make that happen, it'll be fabulous.
Speaker 1: These students say college offers a chance to ...
An example of transcribing a file with global speaker diarization using the transcribeStream
function.
const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");
// Do not forget to set your API key in the SONIOX_API_KEY environment variable.
const speechClient = new SpeechClient();
(async function () {
const result = await speechClient.transcribeFileShort(
"../test_data/test_audio_sd.flac",
{
model: "en_v2",
enable_global_speaker_diarization: true,
min_num_speakers: 1,
max_num_speakers: 6,
}
);
// Print results with each speaker segment on its own line.
let speaker = 0;
let line = "";
for (const word of result.words) {
if (word.speaker !== speaker) {
if (line.length > 0) {
console.log(line);
}
speaker = word.speaker;
line = `Speaker ${speaker}: `;
if (word.text == " ") {
// Avoid printing leading space at speaker change.
continue;
}
}
line += word.text;
}
console.log(line);
})();
Run
node global_speaker_diarization.js
Output
Speaker 1: First forward, a nationwide program started ...
Speaker 2: I would love to see all 115 community colleges ...
Speaker 3: If we can make that happen, it'll be fabulous.
Speaker 1: These students say college offers a chance to ...
Streaming Speaker Diarization#
This examples demonstrates how to recognize speech and diarize speakers from a live stream in real-time and low-latency settings. We simulate the stream by reading a file in small chunks.
streaming_speaker_diarization.py
from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient
def iter_audio() -> Iterable[bytes]:
# This function should yield audio bytes from your stream.
# Here we simulate the stream by reading a file in small chunks.
with open("../test_data/test_audio_sd.flac", "rb") as fh:
while True:
audio = fh.read(1024)
if len(audio) == 0:
break
yield audio
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
with SpeechClient() as client:
for result in transcribe_stream(
iter_audio(),
client,
model="en_v2_lowlatency",
include_nonfinal=True,
enable_streaming_speaker_diarization=True,
):
print(" ".join(f"'{w.text}'/{w.speaker}" for w in result.words))
if __name__ == "__main__":
main()
Run
python3 streaming_speaker_diarization.py
Output
The script prints recognized tokens with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that token.
'First'/0
'First'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward,'/1 ' '/0 'a'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/0 'nation'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/1 'nationwide'/1
streaming_speaker_diarization.js
const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");
// Do not forget to set your API key in the SONIOX_API_KEY environment variable.
const speechClient = new SpeechClient();
(async function () {
const onDataHandler = async (result) => {
console.log(result.words.map((word) => `'${word.text}'/${word.speaker}`).join(" "));
};
const onEndHandler = (error) => {
if (error) {
console.log(`Transcription error: ${error}`);
}
};
// transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
// Use them to send data and end the stream when done.
const stream = speechClient.transcribeStream(
{
model: "en_v2_lowlatency",
include_nonfinal: true,
enable_streaming_speaker_diarization: true
},
onDataHandler,
onEndHandler
);
// Here we simulate the stream by reading a file in small chunks.
const CHUNK_SIZE = 1024;
const readable = fs.createReadStream("../test_data/test_audio_sd.flac", {
highWaterMark: CHUNK_SIZE,
});
for await (const chunk of readable) {
await stream.writeAsync(chunk);
}
stream.end();
})();
Run
node streaming_speaker_diarization.js
Output
The script prints recognized tokens with assigned speaker numbers from a live audio stream. Speaker number 0 means the speaker has not been assigned yet to that token.
'First'/0
'First'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward'/1
'First'/1 ' '/1 'forward,'/1 ' '/0 'a'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/0 'nation'/0
'First'/1 ' '/1 'forward,'/1 ' '/1 'a'/1 ' '/1 'nationwide'/1