Transcription Results#

This page describes data structures used to represent transcription results.

These data structures are defined based on Google Protocol Buffers (protobuf). If you are using one of the Soniox client libraries, you do not need to deal with protobuf directly, since the client library provides the data structure definitions and integration with protobuf. Data structures are called “messages” in protobuf nomenclature; numbers at the end of field definitions (e.g., = 1;) are field numbers and are not relevant for users.


The Result structure is returned when transcribing audio and contains recognized tokens and other data. It represents either a complete or partial transcription result, depending on the API call or client library function used.

message Result {
    repeated Word words = 1;
    int32 final_proc_time_ms = 2;
    int32 total_proc_time_ms = 3;
    repeated ResultSpeaker speakers = 6;
    int32 channel = 7;

The words field contains a sequence of Word structures representing recognized tokens (see the Word section). These are not just words as they include punctuation and spaces; the term “word” is used in data structures for historical reasons.

The final_proc_time_ms and total_proc_time_ms fields indicate the duration of processed audio from the start in milliseconds, resulting in final and all tokens respectively. These can only be different with streaming transcription; refer to Final vs Non-Final Tokens.

For the speakers field see the ResultSpeaker section below.

When separate recognition per channel is enabled, the channel field indicates the audio channel that the result is associated with (starting with 0).


message Word {
    string text = 1;
    int32 start_ms = 2;
    int32 duration_ms = 3;
    bool is_final = 4;
    int32 speaker = 5;
    double confidence = 9;

The Word structure represents an individual recognized token, which is given in the text field.

The text field is the text of the token. It is defined such that concatenating text from consecutive tokens (without adding anything in between) yields the transcribed text. This works because spaces are returned explicitly as tokens with text equal to a single space.

The start_ms and duration_ms fields represent the time interval of the token in the audio. When these are understood as half-open intervals [start_ms, start_ms + duration_ms), it is guaranteed that there are no overlaps between returned tokens. Note that space tokens do not have meaningful time information (time information for space tokens should be ignored).

The is_final field specifies if the token is final. This distinction is relevant only when using stream transcription in low-latency mode (include_nonfinal=true); in other cases is_final is always true. Refer to Final vs Non-Final Tokens.

The speaker field indicates the speaker number. Valid speaker numbers are greater than 0. Speaker information is only available when using Speaker Diarization.

The confidence field is the estimated probability that the token was recognized correctly and has a value between 0 and 1 inclusive.


message ResultSpeaker {
    int32 speaker = 1;
    string name = 2;

If using Speaker Identification, the Result.speakers field contains associations between speaker numbers and names of candidate speakers as specified in the transcription configurations. These associations should be used to map speaker numbers appearing in recognized tokens to candidate speakers. Note that this field does not contain entries for recognized speakers that were not associated with any of the candidate speakers.

With streaming transcription, the Result.speakers field contains the latest or best associations for all tokens from the start of the audio, not just for tokens in the latest result. It is important to understand that speaker associations may change at any time, including for speaker numbers with one or more final tokens.