3. Transcribe From Microphone

In the previous two examples, the audio was already captured and stored in an audio file, which was then transcribed. In this example, we will perform real-time low-latency live transcription of audio from your microphone.

examples/transcribe_microphone_split.py GitHub

from soniox.transcribe_live import transcribe_microphone
from soniox.speech_service import Client, set_api_key

set_api_key("<YOUR-API-KEY>")

def main():
    with Client() as client:
        print("Transcribing from your microphone ...")
        all_final_words = []
        for result in transcribe_microphone(client):
            # Split current result response into final words and non-final words.
            final_words = []
            non_final_words = []
            for word in result.words:
                if word.is_final:
                    final_words.append(word.text)
                else:
                    non_final_words.append(word.text)

            # Append current final words to the list of all final words.
            all_final_words += final_words

            # Print all final words and current non-final words.
            all_final_words_str = " ".join(all_final_words)
            non_final_words_str = " ".join(non_final_words)
            print(f"Final: {all_final_words_str}")
            print(f"Non-final: {non_final_words_str}")
            print("-----")

if __name__ == "__main__":
    main()

We call transcribe_microphone() generator in a for loop, which captures the audio from your microphone and sends it to Soniox Cloud for transcription. The generator provides transcription results as they become available.

On each received result, we split the recognized words into final and non-final words. We then add the new final words to the list of all final words, and print out all final words and the current non-final words.

Final vs Non-final Words

Final words are words that will not change in the future and their recognition has been completely determined.

Non-final words are words that can change in the future once more audio is available. In each received result, non-final words always follow any final words.

The full transcript of a stream can be obtained by joining:

  1. all final words from all previous results,
  2. all words from the last received result (final and non-final words).

Typically, when a word is first recognized, it is returned as non-final and it may be returned as non-final a number of times. After a certain period, the word is returned as final. However, a word returned as non-final may later be returned as a different non-final word or may disappear. Users should not make any assumption about the relations of non-final words in subsequent received results.

Run!

python3 transcribe_microphone_split.py

Output (sample)

Transcribing from your microphone ...  
Final:   
Non-final:  hello can you hear me   
-----
Final: hello  
Non-final: can you hear me   
-----
Final: hello can you hear me  
Non-final:   
-----

Processed Audio Duration

The total duration of the audio transcribed so far is available in the total_proc_time_ms field (including non-final words). The duration of the audio which has been transribed into final words is available in the final_proc_time_ms field.

final_duration_ms = result.final_proc_time_ms
total_duration_ms = result.total_proc_time_ms
non_final_duration_ms = total_duration_ms - final_duration_ms

The following invariants always hold with regards to processed audio durations and word timestamps:

  1. final_duration_ms <= total_duration_ms
  2. final_duration_ms and total_duration_ms always increase or stay equal in subsequent results.
  3. All finals words end before or at final_duration_ms.
  4. All non-final words start at or after final_duration_ms.
  5. All final and non-final words in subsequent results will start at or after final_duration_ms.