Stored Data#
We store data in a structured format for ease of use in downstream applications. See the sections below for details.
StoredObject#
StoredObject
is a structure that contains all data about the given object (except the audio).
It contains certain information specified by the user in StorageConfig
at the start of the
transcription requests as well as other information.
message StoredObject {
string object_id = 1;
map<string, string> metadata = 2;
string title = 3;
google.protobuf.Timestamp datetime = 4;
google.protobuf.Timestamp stored_datetime = 5;
int32 duration_ms = 6;
int32 num_audio_channels = 10;
bool audio_stored = 11;
Transcript transcript = 7;
}
object_id
is the object ID as specified by the user inStorageConfig
or auto-generated if it was not specified.metadata
,title
anddatetime
are as specified by the user inStorageConfig
, except that ifdatetime
was not specified, it will be equal tostored_datetime
.stored_datetime
is the datetime when the object was stored.duration_ms
is the duration of the audio in milliseconds.num_audio_channels
is the number of audio channels transcribed and possibly stored.audio_stored
indicates whether audio is stored.transcript
is structured representation of the transcript. If transcript is not stored, this field is not present.
Transcript#
Transcript
contains all information about recognized speech and speakers in a structured format.
message Transcript {
string text = 1;
repeated Token tokens = 2;
repeated Sentence sentences = 3;
repeated Paragraph paragraphs = 4;
repeated Keyterm keyterms = 6;
map<int32, string> speaker_names = 7;
}
Transcript.text
is the entire transcript as one string. It represents recognized speech in written form (not spoken).
The transcript is composed of tokens, which are then grouped into sentences, which are grouped into paragraphs, similar to how we would write a document.
Token
represents a recognized unit in the audio.
message Token {
int32 idx = 1;
int32 text_start = 2;
int32 text_end = 3;
// Spoken text.
string text = 4;
int32 start_ms = 5;
int32 duration_ms = 6;
double confidence = 7;
int32 speaker_id = 8;
bool profane = 9;
}
Token.idx
is the index of the token in the transcript starting with 0.Token.text_start
andToken.text_end
denote a range inTranscript.text
that corresponds to the token (inclusive/non-inclusive respectively).Token.text
is the token in spoken form. Tokens are typically words and punctuations and are always in spoken form. For example, “twenty”, “three” are two tokens in spoken form, while “23” could be its corresponding written form inTranscript.text
.Token.start_ms
andToken.duration_ms
denote where the token was recognized in the audio.Token.confidence
is the estimated probability that the token was recognized correctly and has a value between 0 and 1 inclusive. If it is equal to 0, it means that confidence is not available for this token, which can happen in rare cases.Token.speaker_id
is the speaker ID or 0 if not available; refer to Speaker Information below.Token.profane
indicates if the token is part of a word or phrase that was masked as per content moderation settings; refer to Moderate Content. Note thatToken.text
is not masked in that case, only the associated part ofTranscript.text
is.
Sentence
represents a sentence, containing one or more tokens.
message Sentence {
int32 token_start = 1;
int32 token_end = 2;
}
Sentence.token_start
andSentence.token_end
denote a range of tokens that are part of the sentence (inclusive/non-inclusive respectively).
Paragraph
represents a paragraph, containing one or more sentences.
message Paragraph {
int32 sentence_start = 1;
int32 sentence_end = 2;
}
Paragraph.sentence_start
andParagraph.sentence_end
denote a range of tokens that are part of the sentence (inclusive/non-inclusive respectively).
Keyterm
represents an important word in the transcript.
The top 10 keyterms can give a quick summary about the transcript.
Each keyterm has a score
, indicating its importance, and token_start_indexes
,
which are indexes into tokens
where the keyterm occurs.
message Keyterm {
string text = 1;
double score = 2;
repeated int32 token_start_indexes = 3;
}
Speaker Information#
Transcript
may contain speaker information, in the form of Token.speaker_id
and Transcript.speaker_names, where the latter contains mappings from speaker ID
to speaker name as available.
Speaker information will be available in two cases:
- If separate recognition per channel was used. In this case, each audio channel is considered to represents a distinct speaker. Speaker IDs 1, 2, 3… represent consecutive audio channels 0, 1, 2… and:code:Transcript.speaker_names contains an entry for each channel in the form (N+1, “Channel N”).
- If speaker diarization was used but not separate recognition per channel,
then speaker information is based on speaker diarization. If speaker
identification was also used, then
Transcript.speaker_names
will contain any determined speaker ID to speaker name mappings, otherwise it will be empty.
When speaker information is available, a speaker change creates a new paragraph. Thus, all the tokens within any paragraph are for the same speaker.