Transcribe Streams#

In this example, we will transcribe a stream in bidirectional streaming mode. We will simulate the stream by reading a file in small chunks. This will serve as a demonstration how to transcribe any stream of data including real-time streams.

transcribe_any_stream.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient


def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_long.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency", # Do not forget to specify the model!
            include_nonfinal=True,
        ):
            print("".join(w.text for w in result.words))


if __name__ == "__main__":
    main()

To transcribe any stream, you need to provide an iterable over successive audio chunks. In our example, we define a generator function iter_audio that reads audio chunks from a file.

We start transcription by calling transcribe_stream(), which returns an iterable over transcription results. We iterate this to obtain the results as soon as they become available.

Run

python3 transcribe_any_stream.py

Output

But
But there
But there is always
But there is always
But there is always a
But there is always a stronger
But there is always a stronger sense
But there is always a stronger sense of

transcribe_any_stream.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your API key in the SONIOX_API_KEY environment variable.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        console.log(result.words.map((word) => word.text).join(""));
    };

    const onEndHandler = (error) => {
        if (error) {
            console.log(`Transcription error: ${error}`);
        }
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        {
            model: "en_v2_lowlatency", // Do not forget to specify the model!
            include_nonfinal: true,
        },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_long.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();

To transcribe any stream, first start the stream transcription by calling speechClient.transcribeStream(), providing the transcription configuration and requisite user-defined callbacks. This returns an object representing the stream (stream). Then, call await stream.writeAsync(chunk) for successive audio chunks as they become available. At the end, call stream.end() to indicate the end of audio.

Consecutive transcription results are returned by calling the onDataHandler callback. When the transcription has finished, the user-supplied onEndHandler callback is called. Any error will be indicated using the error argument of this callback.

Run

node transcribe_any_stream.js

Output

But
But there
But there is always
But there is always
But there is always a
But there is always a stronger
But there is always a stronger sense
But there is always a stronger sense of

TranscribeAnyStream.cs

using System.Runtime.CompilerServices;
using Soniox.Client;
using Soniox.Client.Proto;

// Do not forget to set your API key in the SONIOX_API_KEY environment variable.
using var client = new SpeechClient();

// TranscribeStream requires the user to provide the audio to transcribe
// as an IAsyncEnumerable<bytes[]> instance. This can be implemented as
// an async function that uses "yield return". This example function
// reads a file in chunks.
async IAsyncEnumerable<byte[]> EnumerateAudioChunks(
    [EnumeratorCancellation] CancellationToken cancellationToken = default(CancellationToken)
)
{
    string filePath = "../../test_data/test_audio_long.flac";
    int bufferSize = 1024;

    await using var fileStream = new FileStream(
        filePath, FileMode.Open, FileAccess.Read, FileShare.Read,
        bufferSize: bufferSize, useAsync: true
    );

    while (true)
    {
        byte[] buffer = new byte[bufferSize];
        int numRead = await fileStream.ReadAsync(buffer, cancellationToken);
        if (numRead == 0)
        {
            break;
        }
        Array.Resize(ref buffer, numRead);
        yield return buffer;
    }
}

IAsyncEnumerable<Result> resultsEnumerable = client.TranscribeStream(
    EnumerateAudioChunks(),
    new TranscriptionConfig
    {
        Model = "en_v2_lowlatency", // Do not forget to specify the model!
        IncludeNonfinal = true,
    });

await foreach (var result in resultsEnumerable)
{
    // Note result.Words contains final and non-final tokens,
    // we do not print this this information in this example.
    var text = string.Join("", result.Words.Select(word => word.Text).ToArray());
    Console.WriteLine(text);
}

To transcribe any stream, you need to provide an async generator (IAsyncEnumerable<byte[]>) over successive audio chunks. In our example, we define a generator function EnumerateAudioChunks that reads audio chunks from a file.

We start transcription by calling TranscribeStream(), which returns an async iterable over transcription results (IAsyncEnumerable<Result>). We iterate this to obtain the results as soon as they become available.

Run

cd TranscribeAnyStream
dotnet run

Output

But
But there
But there is always
But there is always
But there is always a
But there is always a stronger
But there is always a stronger sense
But there is always a stronger sense of

Minimizing Latency#

When transcribing a real-time stream, the lowest latency is achieved with raw audio encoded using PCM 16-bit little endian (pcm_s16le) at 16 kHz sample rate. The example below shows how to transcribe such audio.

transcribe_any_stream_audio_format.py

for result in transcribe_stream(
    iter_audio(),
    client,
    model="en_v2_lowlatency",
    include_nonfinal=True,
    audio_format="pcm_s16le",
    sample_rate_hertz=16000,
    num_audio_channels=1,
):

transcribe_any_stream_audio_format.js

const stream = speechClient.transcribeStream(
    {
        model: "en_v2_lowlatency",
        audio_format: "pcm_s16le",
        sample_rate_hertz: 16000,
        num_audio_channels: 1,
        include_nonfinal: true
    },
    onDataHandler,
    onEndHandler
);

TranscribeAnyStreamAudioFormat.cs

IAsyncEnumerable<Result> resultsEnumerable = client.TranscribeStream(
    EnumerateAudioChunks(),
    new TranscriptionConfig
    {
        Model = "en_v2_lowlatency",
        IncludeNonfinal = true,
        AudioFormat = "pcm_s16le",
        SampleRateHertz = 16000,
        NumAudioChannels = 1,
    });

It is possible to use other PCM formats or configurations as listed here at the cost of a small increase of latency.