Transcribe Streams

In this example, we will transcribe a stream in bidirectional streaming mode. We will simulate the stream by reading a file in small chunks. This will serve as a demonstration how to transcribe any stream of data including real-time streams.

transcribe_any_stream.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient, set_api_key

set_api_key("<YOUR-API-KEY>")


def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.

    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_long.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio


def main():
    with SpeechClient() as client:
        for result in transcribe_stream(iter_audio(), client):
            print(" ".join(w.text for w in result.words))


if __name__ == "__main__":
    main()

To transcribe any stream, you only need to define a generator over the audio chunks from the stream. In our example, we simulate this by reading audio chunks from a file. We then pass this generator to transcribe_stream() which returns the transcription results as soon as they become available.

Run

python3 transcribe_any_stream.py

Output

But
But there
But there is
But there is always
But there is always a
But there is always a strong
But there is always a stronger
But there is always a stronger sense
But there is always a stronger sense of
But there is always a stronger sense of life
But there is always a stronger sense of life

transcribe_any_stream.js

const fs = require("fs");
const { SpeechClient } = require("@soniox/soniox-node");

// Do not forget to set your Soniox API key.
const speechClient = new SpeechClient();

(async function () {
    const onDataHandler = async (result) => {
        console.log(`Words: ${result.words.map((word) => word.text).join(" ")}`);
    };

    const onEndHandler = (error) => {
        console.log("END!", error);
    };

    // transcribeStream() returns object with ".writeAsync()" and ".end()" methods.
    // Use them to send data and end the stream when done.
    const stream = speechClient.transcribeStream(
        { include_nonfinal: true },
        onDataHandler,
        onEndHandler
    );

    // Here we simulate the stream by reading a file in small chunks.
    const CHUNK_SIZE = 1024;
    const readable = fs.createReadStream("../test_data/test_audio_long.flac", {
        highWaterMark: CHUNK_SIZE,
    });

    for await (const chunk of readable) {
        await stream.writeAsync(chunk);
    }

    stream.end();
})();    

To transcribe any stream, you only need to define a generator over the audio chunks from the stream. In our example, we simulate this by reading audio chunks from a file. When audio chunks become available, they are passed to stream.writeAsync() function for transcription, and as soon as the transcription results become available, function onDataHandler() is being called.

Run

node transcribe_any_stream.js

Output

Words: 
Words: But
Words: But there
Words: But there is
Words: But there is always
Words: But there is always a
Words: But there is always a strong
Words: But there is always a stronger
Words: But there is always a stronger sense
Words: But there is always a stronger sense of
Words: But there is always a stronger sense of life
Words: But there is always a stronger sense of life

TranscribeAnyStream.cs

using System.Linq;
using System.Runtime.CompilerServices;
using Soniox.Types;
using Soniox.Client;
using Soniox.Client.Proto;

using var client = new SpeechClient();

// TranscribeStream requires the user to provide the audio to transcribe
// as an IAsyncEnumerable<bytes[]> instance. This can be implemented as
// an async function that uses "yield return". This example function
// reads a file in chunks.
async IAsyncEnumerable<byte[]> EnumerateAudioChunks(
    [EnumeratorCancellation] CancellationToken cancellationToken = default(CancellationToken)
)
{
    string filePath = "../../test_data/test_audio_long.flac";
    int bufferSize = 1024;

    await using var fileStream = new FileStream(
        filePath, FileMode.Open, FileAccess.Read, FileShare.Read,
        bufferSize: bufferSize, useAsync: true
    );

    while (true)
    {
        byte[] buffer = new byte[bufferSize];
        int numRead = await fileStream.ReadAsync(buffer, cancellationToken);
        if (numRead == 0)
        {
            break;
        }
        Array.Resize(ref buffer, numRead);
        yield return buffer;
    }
}

IAsyncEnumerable<Result> resultsEnumerable = client.TranscribeStream(
    EnumerateAudioChunks(),
    new TranscriptionConfig
    {
        IncludeNonfinal = true,
    });

await foreach (var result in resultsEnumerable)
{
    // Note result.Words contains final and non-final words,
    // we do not print this this information in this example.
    var wordsStr = string.Join(" ", result.Words.Select(word => word.Text).ToArray());
    Console.WriteLine($"Words: {wordsStr}");
}

To transcribe any stream, you only need to define a generator over the audio chunks from the stream. In our example, we simulate this by reading audio chunks from a file. We then pass this generator to TranscribeStream() which returns the transcription results as soon as they become available.

Run

cd soniox_examples/csharp/TranscribeAnyStream
dotnet run

Output

Words: 
Words: But
Words: But there
Words: But there is
Words: But there is always
Words: But there is always a
Words: But there is always a strong
Words: But there is always a stronger
Words: But there is always a stronger sense
Words: But there is always a stronger sense of
Words: But there is always a stronger sense of life
Words: But there is always a stronger sense of life

Minimizing Latency

When transcribing a real-time stream, the lowest latency is achieved with raw audio encoded using PCM 16-bit little endian (pcm_s16le) at 16 kHz sample rate. The example below shows how to transcribe such audio.

transcribe_any_stream_audio_format.py

for result in transcribe_stream(
        iter_audio(), 
        client, 
        audio_format="pcm_s16le",
        sample_rate_hertz=16000,
        num_audio_channels=1):

transcribe_any_stream_audio_format.js

const stream = speechClient.transcribeStream(
    { 
        audio_format: "pcm_s16le",
        sample_rate_hertz: 16000,
        num_audio_channels: 1,
        include_nonfinal: true
    },
    onDataHandler,
    onEndHandler
);

TranscribeAnyStreamAudioFormat.cs

IAsyncEnumerable<Result> resultsEnumerable = client.TranscribeStream(
    EnumerateAudioChunks(),
    new TranscriptionConfig
    {
        IncludeNonfinal = true,
        AudioFormat = "pcm_s16le",
        SampleRateHertz = 16000,
        NumAudioChannels = 1,
    });

It is possible to use other PCM formats or configurations as listed here at the cost of a small increase of latency.

cookie Change your cookie preferences