Multi-Channel Audio#

Certain audio recordings have multiple channels. For example, audio of two people talking over the phone may contain two channels, where each line is recorded separately.

By default, multiple audio channels are mixed into one channel before transcription. But it is possible to transcribe each channel individually.

To transcribe audio with separate recognition per channel, specify the following TranscriptionConfig fields:

  1. Set num_audio_channels to the number of channels in your audio.
  2. Set enable_separate_recognition_per_channel to true.

Separate recognition per channel is supported with all transcription APIs: Transcribe (short audio), TranscribeAsync (files) and TranscribeStream (live streams). When using separate recognition per channel, Transcribe and TranscribeAsync return a list of Result objects for consecutive channels instead of a single result. In all cases, each Result object has the channel field indicating the channel it is for.

The maximum number of channels for separate recognition per channel is 4. If you require a higher limit, contact our support team.

Using separate recognition per channel increases transcription cost. Transcription with separate recognition per channel and N channels is billed the same as transcrption of N times that duration of audio without separate recognition per channel.

Transcribe Short Audio#

In this example, we will transcribe a short audio file (< 60 seconds) with two channels using separate recognition per channel.

transcribe_file_short_separate_recognition.py

from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        channel_results = transcribe_file_short(
            "../test_data/test_audio_multi_channel.flac",
            client,
            model="en_v2",
            num_audio_channels=2,
            enable_separate_recognition_per_channel=True,
        )
        for result in channel_results:
            print(f"Channel {result.channel}: " + "".join(word.text for word in result.words))


if __name__ == "__main__":
    main()

Note that transcribe_file_short() now returns a list of Result objects, one for each channel.

Run

python3 transcribe_file_short_separate_recognition.py

Output

Channel 0: But there is always a stronger sense of life. And now he is pouring down his beams
Channel 1: He was two years out from the east.

Transcribe Files#

transcribe_file_async_separate_recognition.py

The code is nearly identical to our previous example with the exception that num_audio_channels and enable_separate_recognition_per_channel have to be set in transcribe_file_async() call.

Note that GetTranscribeAsyncResult() will return a list of Result objects (one for each channel) when separate recognition per channel was enabled.

Run

python3 transcribe_file_async_separate_recognition.py

Output

Uploading file.
File ID: 3476
Calling GetTranscribeAsyncFileStatus.
Status: QUEUED
Calling GetTranscribeAsyncFileStatus.
Status: TRANSCRIBING
Calling GetTranscribeAsyncFileStatus.
Status: COMPLETED
Calling GetTranscribeAsyncResult
Channel 0: But there is always a stronger sense of life. And now he is pouring down his beams
Channel 1: He was two years out from the east.
Calling DeleteTranscribeAsyncFile.

Transcribe Streams#

When transcribing a multi-channel stream with separate recognition per channel, all the channels are being transcribed in parallel in real-time and low-latency. This makes it especially suitable for transcribing a meeting with a fixed number of participants, where each participant is transcribed independently of others to achieve the highest level of accuracy.

In this example, we will transcribe a live stream with two channels, each channel being transcribed independently. We will simulate the stream by reading a file in small chunks.

transcribe_stream_separate_recognition.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient


def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_multi_channel.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency",
            include_nonfinal=True,
            num_audio_channels=2,
            enable_separate_recognition_per_channel=True,
        ):
            print(f"Channel {result.channel}: " + "".join(w.text for w in result.words))


if __name__ == "__main__":
    main()

Run

python3 transcribe_stream_separate_recognition.py

Output

Channel 0: But
Channel 1:
Channel 0: But there
Channel 1:
Channel 1:
Channel 0: But there is
Channel 1:
Channel 0: But there is always
Channel 1:
Channel 0: But there is always a
Channel 1:
Channel 0: But there is always a strong
Channel 1:
Channel 1:
Channel 0: But there is always a stronger
Channel 1:
Channel 0: But there is always a stronger sense
Channel 1:
Channel 0: But there is always a stronger sense of
Channel 1:
Channel 0: But there is always a stronger sense of life
Channel 0: But there is always a stronger sense of life
Channel 1:
Channel 0: But there is always a stronger sense of life
Channel 1: He was
Channel 0: But there is always a stronger sense of life
Channel 1: He was
Channel 0: But there is always a stronger sense of life
Channel 1: He was two