Transcription Results#
This page describes data structures used to represent transcription results.
These data structures are defined based on Google Protocol Buffers (protobuf). If you are using one of the Soniox client libraries, you do not need to deal with protobuf directly, since the client library provides the data structure definitions and integration with protobuf. Data structures are called “messages” in protobuf nomenclature; numbers at the end of field definitions (e.g. “= 1;”) are field numbers and are not relevant for API users.
Result#
The Result
structure is returned when transcribing audio and contains recognized words and other data.
It represents either a complete or partial transcription result, depending on the API call or client
library function used.
message Result {
repeated Word words = 1;
int32 final_proc_time_ms = 2;
int32 total_proc_time_ms = 3;
repeated ResultSpeaker speakers = 6;
int32 channel = 7;
}
The words
field contains a sequence of Word structures representing recognized words
(see the Word section).
The final_proc_time_ms
and total_proc_time_ms
fields indicate the duration of
processed audio from the start in milliseconds, resulting in final and all words respectively.
These can only be different with streaming transcription; refer to
Final vs Non-Final Words.
For the speakers
field see the ResultSpeaker section below.
When separate recognition per channel is enabled, the channel
field indicates the audio channel
that the result is associated with (starting with 0).
Word#
message Word {
string text = 1;
int32 start_ms = 2;
int32 duration_ms = 3;
bool is_final = 4;
int32 speaker = 5;
string orig_text = 8;
double confidence = 9;
}
The Word
structure represents an individual recognized word, which is given in the
text
field.
The start_ms
and duration_ms
fields represent the time interval of the word in the audio,
When these are understood as half-open intervals [start_ms, start_ms + duration_ms)
, it is
guaranteed that there are no overlaps between transcribed words.
The is_final
field specifies if the word is final. This distinction is relevant only when using
TranscribeStream
with include_nonfinal=true;
in other cases is_final
is always true.
Refer to Final vs Non-Final Words.
The speaker
field indicates the speaker number. Valid speaker numbers are greater than 0.
Speaker information is only available when using Speaker Diarization.
The orig_text
field indicates the original word when the word in text
was masked
for content moderation, otherwise it is empty. Refer to Moderate Content.
The confidence
field is the estimated probability that the word was recognized correctly
and has a value between 0 and 1 inclusive. If it is equal to 0, it means that confidence is not available
for this word, which can happen in rare cases.
ResultSpeaker#
message ResultSpeaker {
int32 speaker = 1;
string name = 2;
}
If using Speaker Identification,
the Result.speakers
field contains associations between speaker numbers and
names of candidate speakers as specified in the transcription configurations.
These associations should be used to map speaker numbers appearing in recognized words
to candidate speakers. Note that this field does not contain entries for recognized
speakers that were not associated with any of the candidate speakers.
With streaming transcription, the Result.speakers
field contains the
latest or best associations for all words from the start of the audio, not just
for words in the latest result. It is important to understand that speaker associations
may change at any time, including for speaker numbers with one or more final words.