Stored Data#
We store data in a structured format for ease of use in downstream applications. See the sections below for details.
StoredObject#
StoredObject
is a structure that contains all data about the given object (except the audio).
It contains certain information specified by the user in StorageConfig
at the start of the
transcription requests as well as other information.
message StoredObject {
string object_id = 1;
map<string, string> metadata = 2;
string title = 3;
google.protobuf.Timestamp datetime = 4;
google.protobuf.Timestamp stored_datetime = 5;
int32 duration_ms = 6;
int32 num_audio_channels = 10;
bool audio_stored = 11;
Transcript transcript = 7;
}
object_id
is the object ID as specified by the user inStorageConfig
or auto-generated if it was not specified.metadata
,title
anddatetime
are as specified by the user inStorageConfig
, except that ifdatetime
was not specified, it will be equal tostored_datetime
.stored_datetime
is the datetime when the object was stored.duration_ms
is the duration of the audio in milliseconds.num_audio_channels
is the number of audio channels transcribed and possibly stored.audio_stored
indicates whether audio is stored.transcript
is structured representation of the transcript. If transcript is not stored, this field is not present.
Transcript#
Transcript
contains all information about recognized speech and speakers in a structured format.
message Transcript {
string text = 1;
repeated Token tokens = 2;
map<int32, string> speaker_names = 7;
}
Transcript.text
is the entire transcript as one string. It represents recognized speech in written form.
The transcript is composed of tokens; Token
represents a recognized unit in the audio.
message Token {
int32 idx = 1;
int32 text_start = 2;
int32 text_end = 3;
string text = 4;
int32 start_ms = 5;
int32 duration_ms = 6;
double confidence = 7;
int32 speaker_id = 8;
}
Tokens here are much like those in Transcription Results. Notably, spaces in text are explicitly represented as space tokens.
Token.idx
is the index of the token in the transcript starting with 0.Token.text_start
andToken.text_end
denote a range inTranscript.text
that corresponds to the token (inclusive/non-inclusive respectively).Token.text
is the token text. Tokens are typically words possibly with adjacent punctuations, or spaces as their own tokens.Token.start_ms
andToken.duration_ms
denote where the token was recognized in the audio.Token.confidence
is the estimated probability that the token was recognized correctly and has a value between 0 and 1 inclusive.Token.speaker_id
is the speaker ID or 0 if not available; refer to Speaker Information below.
Speaker Information#
Transcript
may contain speaker information, in the form of Token.speaker_id
and Transcript.speaker_names
, where the latter contains mappings from speaker ID
to speaker name as available.
Speaker information will be available in two cases:
- If separate recognition per channel was used. In this case, each audio channel
is considered to represents a distinct speaker. Speaker IDs 1, 2, 3… represent
consecutive audio channels 0, 1, 2… and
Transcript.speaker_names
contains an entry for each channel in the form (N+1, “Channel N”). - If speaker diarization was used but not separate recognition per channel,
then speaker information is based on speaker diarization. If speaker
identification was also used, then
Transcript.speaker_names
will contain any determined speaker ID to speaker name mappings, otherwise it will be empty.