Stored Data#

We store data in a structured format for ease of use in downstream applications. See the sections below for details.


StoredObject is a structure that contains all data about the given object (except the audio). It contains certain information specified by the user in StorageConfig at the start of the transcription requests as well as other information.

message StoredObject {
    string object_id = 1;
    map<string, string> metadata = 2;
    string title = 3;
    google.protobuf.Timestamp datetime = 4;
    google.protobuf.Timestamp stored_datetime = 5;
    int32 duration_ms = 6;
    int32 num_audio_channels = 10;
    bool audio_stored = 11;
    Transcript transcript = 7;
  • object_id is the object ID as specified by the user in StorageConfig or auto-generated if it was not specified.
  • metadata, title and datetime are as specified by the user in StorageConfig, except that if datetime was not specified, it will be equal to stored_datetime. stored_datetime is the datetime when the object was stored.
  • duration_ms is the duration of the audio in milliseconds. num_audio_channels is the number of audio channels transcribed and possibly stored.
  • audio_stored indicates whether audio is stored.
  • transcript is structured representation of the transcript. If transcript is not stored, this field is not present.


Transcript contains all information about recognized speech and speakers in a structured format.

message Transcript {
    string text = 1;
    repeated Token tokens = 2;
    map<int32, string> speaker_names = 7;

Transcript.text is the entire transcript as one string. It represents recognized speech in written form.

The transcript is composed of tokens; Token represents a recognized unit in the audio.

message Token {
    int32 idx = 1;
    int32 text_start = 2;
    int32 text_end = 3;
    string text = 4;
    int32 start_ms = 5;
    int32 duration_ms = 6;
    double confidence = 7;
    int32 speaker_id = 8;

Tokens here are much like those in Transcription Results. Notably, spaces in text are explicitly represented as space tokens.

  • Token.idx is the index of the token in the transcript starting with 0.
  • Token.text_start and Token.text_end denote a range in Transcript.text that corresponds to the token (inclusive/non-inclusive respectively).
  • Token.text is the token text. Tokens are typically words possibly with adjacent punctuations, or spaces as their own tokens.
  • Token.start_ms and Token.duration_ms denote where the token was recognized in the audio.
  • Token.confidence is the estimated probability that the token was recognized correctly and has a value between 0 and 1 inclusive.
  • Token.speaker_id is the speaker ID or 0 if not available; refer to Speaker Information below.

Speaker Information#

Transcript may contain speaker information, in the form of Token.speaker_id and Transcript.speaker_names, where the latter contains mappings from speaker ID to speaker name as available.

Speaker information will be available in two cases:

  • If separate recognition per channel was used. In this case, each audio channel is considered to represents a distinct speaker. Speaker IDs 1, 2, 3… represent consecutive audio channels 0, 1, 2… and Transcript.speaker_names contains an entry for each channel in the form (N+1, “Channel N”).
  • If speaker diarization was used but not separate recognition per channel, then speaker information is based on speaker diarization. If speaker identification was also used, then Transcript.speaker_names will contain any determined speaker ID to speaker name mappings, otherwise it will be empty.