4. Transcribe Any Stream

In this example, we will transcribe a file in bidirectional streaming mode with non-final words. This will serve as a demonstration how to transcribe any stream of data including real-time streams.

examples/transcribe_any_stream.py GitHub

from soniox.transcribe_live import transcribe_stream
from soniox.speech_service import Client, set_api_key
from soniox.test_data import TEST_AUDIO_LONG_FLAC

set_api_key("<YOUR-API-KEY>")

def iter_audio():
    with open(TEST_AUDIO_LONG_FLAC, "rb") as fh:
        while True:
            audio = fh.read(1024)
            if len(audio) == 0:
                break
            yield audio

def main():
    with Client() as client:
        for result in transcribe_stream(iter_audio(), client):
            # Variable result contains final and non-final words.
            print(" ".join(w.text for w in result.words))

if __name__ == "__main__":
    main()

To transcribe any stream, we only need to define a generator over the audio chunks from the stream. In our example, we simulate this by reading audio chunks from a file. We then use transcribe_stream() with this generator which returns the transcription results as they become available.

The difference between transcribe_stream() and transcribe_file_stream() is that the former returns non-final words, but the latter only final words.

Minimizing Latency

When transcribing a real-time stream, the lowest latency is achieved with raw audio encoded using PCM 16-bit little endian (pcm_s16le) at 16 kHz sample rate and using one audio channel. The example below shows how to transcribe such audio. It is possible to use other PCM formats or configurations as listed here at the cost of a small increase of latency.

for result in transcribe_stream(
        iter_audio(), 
        client, 
        audio_format="pcm_s16le",
        sample_rate_hertz=16000,
        num_audio_channels=1):