Stored Data#

We store data in a structured format for ease of use in downstream applications. See the sections below for details.


StoredObject is a structure that contains all data about the given object (except the audio). It contains certain information specified by the user in StorageConfig at the start of the transcription requests as well as other information.

message StoredObject {
    string object_id = 1;
    map<string, string> metadata = 2;
    string title = 3;
    google.protobuf.Timestamp datetime = 4;
    google.protobuf.Timestamp stored_datetime = 5;
    int32 duration_ms = 6;
    int32 num_audio_channels = 10;
    bool audio_stored = 11;
    Transcript transcript = 7;
  • object_id is the object ID as specified by the user in StorageConfig or auto-generated if it was not specified.
  • metadata, title and datetime are as specified by the user in StorageConfig, except that if datetime was not specified, it will be equal to stored_datetime. stored_datetime is the datetime when the object was stored.
  • duration_ms is the duration of the audio in milliseconds. num_audio_channels is the number of audio channels transcribed and possibly stored.
  • audio_stored indicates whether audio is stored.
  • transcript is structured representation of the transcript. If transcript is not stored, this field is not present.


Transcript contains all information about recognized speech and speakers in a structured format.

message Transcript {
    string text = 1;
    repeated Token tokens = 2;
    repeated Sentence sentences = 3;
    repeated Paragraph paragraphs = 4;
    repeated Keyterm keyterms = 6;
    map<int32, string> speaker_names = 7;

Transcript.text is the entire transcript as one string. It represents recognized speech in written form (not spoken).

The transcript is composed of tokens, which are then grouped into sentences, which are grouped into paragraphs, similar to how we would write a document.

Token represents a recognized unit in the audio.

message Token {
    int32 idx = 1;
    int32 text_start = 2;
    int32 text_end = 3;
    // Spoken text.
    string text = 4;
    int32 start_ms = 5;
    int32 duration_ms = 6;
    double confidence = 7;
    int32 speaker_id = 8;
    bool profane = 9;
  • Token.idx is the index of the token in the transcript starting with 0.
  • Token.text_start and Token.text_end denote a range in Transcript.text that corresponds to the token (inclusive/non-inclusive respectively).
  • Token.text is the token in spoken form. Tokens are typically words and punctuations and are always in spoken form. For example, “twenty”, “three” are two tokens in spoken form, while “23” could be its corresponding written form in Transcript.text.
  • Token.start_ms and Token.duration_ms denote where the token was recognized in the audio.
  • Token.confidence is the estimated probability that the token was recognized correctly and has a value between 0 and 1 inclusive. If it is equal to 0, it means that confidence is not available for this token, which can happen in rare cases.
  • Token.speaker_id is the speaker ID or 0 if not available; refer to Speaker Information below.
  • Token.profane indicates if the token is part of a word or phrase that was masked as per content moderation settings; refer to Moderate Content. Note that Token.text is not masked in that case, only the associated part of Transcript.text is.

Sentence represents a sentence, containing one or more tokens.

message Sentence {
    int32 token_start = 1;
    int32 token_end = 2;
  • Sentence.token_start and Sentence.token_end denote a range of tokens that are part of the sentence (inclusive/non-inclusive respectively).

Paragraph represents a paragraph, containing one or more sentences.

message Paragraph {
    int32 sentence_start = 1;
    int32 sentence_end = 2;
  • Paragraph.sentence_start and Paragraph.sentence_end denote a range of tokens that are part of the sentence (inclusive/non-inclusive respectively).

Keyterm represents an important word in the transcript. The top 10 keyterms can give a quick summary about the transcript. Each keyterm has a score, indicating its importance, and token_start_indexes, which are indexes into tokens where the keyterm occurs.

message Keyterm {
    string text = 1;
    double score = 2;
    repeated int32 token_start_indexes = 3;

Speaker Information#

Transcript may contain speaker information, in the form of Token.speaker_id and Transcript.speaker_names, where the latter contains mappings from speaker ID to speaker name as available.

Speaker information will be available in two cases:

  • If separate recognition per channel was used. In this case, each audio channel is considered to represents a distinct speaker. Speaker IDs 1, 2, 3… represent consecutive audio channels 0, 1, 2… and:code:Transcript.speaker_names contains an entry for each channel in the form (N+1, “Channel N”).
  • If speaker diarization was used but not separate recognition per channel, then speaker information is based on speaker diarization. If speaker identification was also used, then Transcript.speaker_names will contain any determined speaker ID to speaker name mappings, otherwise it will be empty.

When speaker information is available, a speaker change creates a new paragraph. Thus, all the tokens within any paragraph are for the same speaker.