Transcription Results#
This page describes data structures used to represent transcription results.
These data structures are defined based on Google Protocol Buffers (protobuf).
If you are using one of the Soniox client libraries, you do not need to deal with protobuf directly,
since the client library provides the data structure definitions and integration with protobuf.
Data structures are called “messages” in protobuf nomenclature; numbers at the end of
field definitions (e.g., = 1;
) are field numbers and are not relevant for users.
Result#
The Result
structure is returned when transcribing audio and contains recognized tokens and other data.
It represents either a complete or partial transcription result, depending on the API call or client
library function used.
message Result {
repeated Word words = 1;
int32 final_proc_time_ms = 2;
int32 total_proc_time_ms = 3;
repeated ResultSpeaker speakers = 6;
int32 channel = 7;
}
The words
field contains a sequence of Word
structures representing recognized tokens
(see the Word section). These are not just words as they include punctuation and
spaces; the term “word” is used in data structures for historical reasons.
The final_proc_time_ms
and total_proc_time_ms
fields indicate the duration of
processed audio from the start in milliseconds, resulting in final and all tokens respectively.
These can only be different with streaming transcription; refer to
Final vs Non-Final Tokens.
For the speakers
field see the ResultSpeaker section below.
When separate recognition per channel is enabled, the channel
field indicates the audio channel
that the result is associated with (starting with 0).
Word#
message Word {
string text = 1;
int32 start_ms = 2;
int32 duration_ms = 3;
bool is_final = 4;
int32 speaker = 5;
double confidence = 9;
}
The Word
structure represents an individual recognized token, which is given in the
text
field.
The text
field is the text of the token. It is defined such that concatenating text from
consecutive tokens (without adding anything in between) yields the transcribed text.
This works because spaces are returned explicitly as tokens with text equal to a single space.
The start_ms
and duration_ms
fields represent the time interval of the token in the audio.
When these are understood as half-open intervals [start_ms, start_ms + duration_ms)
, it is
guaranteed that there are no overlaps between returned tokens. Note that space tokens do not have
meaningful time information (time information for space tokens should be ignored).
The is_final
field specifies if the token is final. This distinction is relevant only when using
stream transcription in low-latency mode (include_nonfinal=true
); in other cases is_final
is always true. Refer to Final vs Non-Final Tokens.
The speaker
field indicates the speaker number. Valid speaker numbers are greater than 0.
Speaker information is only available when using Speaker Diarization.
The confidence
field is the estimated probability that the token was recognized correctly
and has a value between 0 and 1 inclusive.
ResultSpeaker#
message ResultSpeaker {
int32 speaker = 1;
string name = 2;
}
If using Speaker Identification,
the Result.speakers
field contains associations between speaker numbers and
names of candidate speakers as specified in the transcription configurations.
These associations should be used to map speaker numbers appearing in recognized tokens
to candidate speakers. Note that this field does not contain entries for recognized
speakers that were not associated with any of the candidate speakers.
With streaming transcription, the Result.speakers
field contains the
latest or best associations for all tokens from the start of the audio, not just
for tokens in the latest result. It is important to understand that speaker associations
may change at any time, including for speaker numbers with one or more final tokens.