Transcribe stream

In this example, we will transcribe a file in bidirectional streaming mode with non-final words.

This API is deprecated and is being phased out. Please switch to our new multilingual Speech-to-Text API.

In this example, we will transcribe a stream in bidirectional streaming mode. We will simulate the stream by reading a file in small chunks. This will serve as a demonstration how to transcribe any stream of data including real-time streams.

transcribe_any_stream.py

from typing import Iterable
from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import SpeechClient
 
 
def iter_audio() -> Iterable[bytes]:
    # This function should yield audio bytes from your stream.
    # Here we simulate the stream by reading a file in small chunks.
    with open("../test_data/test_audio_long.flac", "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio
 
 
# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        for result in transcribe_stream(
            iter_audio(),
            client,
            model="en_v2_lowlatency", # Do not forget to specify the model!
            include_nonfinal=True,
        ):
            print("".join(w.text for w in result.words))
 
 
if __name__ == "__main__":
    main()

To transcribe any stream, you need to provide an iterable over successive audio chunks. In our example, we define a generator function iter_audio that reads audio chunks from a file.

We start transcription by calling transcribe_stream(), which returns an iterable over transcription results. We iterate this to obtain the results as soon as they become available.

Run

Terminal

python3 transcribe_any_stream.py

Output

But
But there
But there is always
But there is always
But there is always a
But there is always a stronger
But there is always a stronger sense
But there is always a stronger sense of

Minimizing latency

When transcribing a real-time stream, the lowest latency is achieved with raw audio encoded using PCM 16-bit little endian (pcm_s16le) at 16 kHz sample rate. The example below shows how to transcribe such audio.

transcribe_any_stream_audio_format.py

for result in transcribe_stream(
    iter_audio(),
    client,
    model="en_v2_lowlatency",
    include_nonfinal=True,
    audio_format="pcm_s16le",
    sample_rate_hertz=16000,
    num_audio_channels=1,
):

It is possible to use other PCM formats or configurations as listed here at the cost of a small increase of latency.

Minimizing latency

On this page