Transcription Results#

This page describes data structures used to represent transcription results.

These data structures are defined based on Google Protocol Buffers (protobuf). If you are using one of the Soniox client libraries, you do not need to deal with protobuf directly, since the client library provides the data structure definitions and integration with protobuf. Data structures are called “messages” in protobuf nomenclature; numbers at the end of field definitions (e.g. “= 1;”) are field numbers and are not relevant for API users.


The Result structure is returned when transcribing audio and contains recognized words and other data. It represents either a complete or partial transcription result, depending on the API call or client library function used.

message Result {
    repeated Word words = 1;
    int32 final_proc_time_ms = 2;
    int32 total_proc_time_ms = 3;
    repeated ResultSpeaker speakers = 6;
    int32 channel = 7;

The words field contains a sequence of Word structures representing recognized words (see the Word section).

The final_proc_time_ms and total_proc_time_ms fields indicate the duration of processed audio from the start in milliseconds, resulting in final and all words respectively. These can only be different with streaming transcription; refer to Final vs Non-Final Words.

For the speakers field see the ResultSpeaker section below.

When separate recognition per channel is enabled, the channel field indicates the audio channel that the result is associated with (starting with 0).


message Word {
    string text = 1;
    int32 start_ms = 2;
    int32 duration_ms = 3;
    bool is_final = 4;
    int32 speaker = 5;
    string orig_text = 8;
    double confidence = 9;

The Word structure represents an individual recognized word, which is given in the text field.

The start_ms and duration_ms fields represent the time interval of the word in the audio, When these are understood as half-open intervals [start_ms, start_ms + duration_ms), it is guaranteed that there are no overlaps between transcribed words.

The is_final field specifies if the word is final. This distinction is relevant only when using TranscribeStream with include_nonfinal=true; in other cases is_final is always true. Refer to Final vs Non-Final Words.

The speaker field indicates the speaker number. Valid speaker numbers are greater than 0. Speaker information is only available when using Speaker Diarization.

The orig_text field indicates the original word when the word in text was masked for content moderation, otherwise it is empty. Refer to Moderate Content.

The confidence field is the estimated probability that the word was recognized correctly and has a value between 0 and 1 inclusive. If it is equal to 0, it means that confidence is not available for this word, which can happen in rare cases.


message ResultSpeaker {
    int32 speaker = 1;
    string name = 2;

If using Speaker Identification, the Result.speakers field contains associations between speaker numbers and names of candidate speakers as specified in the transcription configurations. These associations should be used to map speaker numbers appearing in recognized words to candidate speakers. Note that this field does not contain entries for recognized speakers that were not associated with any of the candidate speakers.

With streaming transcription, the Result.speakers field contains the latest or best associations for all words from the start of the audio, not just for words in the latest result. It is important to understand that speaker associations may change at any time, including for speaker numbers with one or more final words.