gRPC API Reference

Basic Transcription

These requests are used to transcribe audio. The Transcribe request is suitable for transcription of shorter audio segments, while the TranscribeStream request is suitable for transcription of possibly long audio as well as real-time audio.

rpc Transcribe

rpc Transcribe(TranscribeRequest) returns (TranscribeResponse) {}

message TranscribeRequest {
  string api_key = 1;
  TranscribeConfig config = 2;
  bytes audio = 3;
}

message TranscribeConfig {
  string audio_format = 1;
  int32 sample_rate_hertz = 2;
  int32 num_audio_channels = 3;

  SpeechContext speech_context = 4;

  bool enable_profanity_filter = 8;

  bool enable_global_speaker_diarization = 12;
  bool enable_streaming_speaker_diarization = 5;
  int32 min_num_speakers = 13;
  int32 max_num_speakers = 14;

  bool enable_speaker_identification = 16;
  repeated string cand_speaker_names = 17;
}

message TranscribeResponse {
  Result result = 1;
}

The Transcribe request transcribes an audio of limited duration and returns the complete transcription at once.

The audio is provided in the audio field of the request message. By default, the audio is assumed to use a container format which is automatically inferred. The maximum audio size is 5 * 10242 bytes and the maximum audio duration is 60 seconds (exceeding these limits will result in an error and no transcription). The result of the transcription is available in the result field in the response, which is a Result message. Words returned in a Transcribe are always final (is_final=true).

The following container formats are supported: aac, aiff, amr, asf, flac, mp3, ogg, wav. If config.audio_format is set to one of these formats, that format will be used without auto-detection.

PCM audio can be provided if config.audio_format is set to one of the supported PCM formats, which are: pcm_f32le, pcm_f32be, pcm_s32le, pcm_s32be, pcm_s16le, pcm_s16be. For example, pcm_f32le means float-32 little endian. When using a PCM format, config.sample_rate_hertz and config.num_audio_channels must be set. Supported sample rates are 2000 to 96000 Hz and supported numbers of channels are 1 and 2.

If desired, a speech context can be specified using config.speech_context. There are two methods of specifying a speech context. First, the speech context entries can be provided directly in config.speech_context.entries. Second, a speech context stored in the context of the user's Soniox account can be referenced using config.speech_context.name. It is in error if both entries and name are non-empty. For information about speech context, refer to Speech Adaptation.

The profanity filter can be enabled by setting config.enable_profanity_filter to true. If enabled, the service will attempt to detect profane words and return a masked word, which is only the first letter followed by asterisks (for example, f***). If this field is set to false or not set, the service will not attempt to filter profanities. When a word is masked by the profanity filter, the original non-masked word can be obtained from the orig_text field in Word.

For information about speaker diarization and identification, refer to Speaker AI.

rpc TranscribeStream

rpc TranscribeStream(stream TranscribeStreamRequest) returns (stream TranscribeStreamResponse) {}

message TranscribeStreamRequest {
  string api_key = 1;
  TranscribeStreamConfig config = 2;
  bytes audio = 3;
}

message TranscribeStreamConfig {
  string audio_format = 1;
  int32 sample_rate_hertz = 2;
  int32 num_audio_channels = 3;

  bool include_nonfinal = 5;

  SpeechContext speech_context = 4;

  bool enable_profanity_filter = 9;

  bool enable_global_speaker_diarization = 13;
  bool enable_streaming_speaker_diarization = 6;
  int32 min_num_speakers = 14;
  int32 max_num_speakers = 15;

  bool enable_speaker_identification = 17;
  repeated string cand_speaker_names = 18;
}

message TranscribeStreamResponse {
  Result result = 1;
}

The TranscribeStream request transcribes audio in streaming mode, either optimized for throughput or low-latency.

The client sends a sequence of TranscribeStreamRequest requests to the service. The api_key and config are provided in the first request and must not be included in later requests. The audio is provided in chunks using the audio field, which may be empty in any request.

TranscribeStream supports the same audio formats as Transcribe. If a container format is used, the audio chunks together must form a single container (audio file) and the boundaries between them have no significance. If a PCM format is used, then each audio chunk must contain a whole number of samples. The maximum audio size in a single request is 5 * 10242 bytes.

TranscribeStream can use a speech context the same as Transcribe.

The service returns a sequence of TranscribeStreamResponse messages, with transcription results available in the result field. The result field may be missing, and it is important for the client to check for its presence before interpreting it (e.g. has_result() in C++, HasField("result") in Python). If result is not present, then any non-final words returned in the previous response (if any) are still valid.

The client should not make any assumption about the corresponsence of requests to responses.

If include_nonfinal is false, TranscribeStream returns only final words and does not offer any latency guarantees. This mode is optimized for throughput and is essentially a streaming version of Transcribe, enabling transcription of audio of longer duration. The complete sequence of words is obtained by joining the words from all results.

If include_nonfinal is true, TranscribeStream returns both final and non-final words while minimizing the recognition latency. This mode is suitable for transcription of real-time audio streams where transcribed words need to be received as soon possible after the audio where they are pronounced has been sent to the service. The current available transcription from the start of the audio is obtained by joining final words in responses before the last one followed by all words in the last response (equivalently, by joining final words from all responses followed by non-final words in the last response).

Additionally, if include_nonfinal is true:

  • Audio should not be sent at a rate faster than real-time. If it is, the service may throttle processing or return an error. There are margins such that this should not occur for real-time streams under normal circumstances.
  • Minimum latency is achieved when using the PCM format pcm_s16le with sample rate 16 kHz and one audio channel. Alternatively any supported PCM format can be used with only a small increase of latency. Using a container format in this mode is not recommended due to possibly large latency resulting from audio decoding.

For information about speaker diarization and identification, refer to Speaker AI.

message Result

message Result {
  repeated Word words = 1;
  int32 final_proc_time_ms = 2;
  int32 total_proc_time_ms = 3;
  repeated ResultSpeaker speakers = 6;
}

message ResultSpeaker {
  int32 speaker = 1;
  string name = 2;
}

The Result message represents a speech recognition result received in a response from Transcribe or TranscribeStream, containing transcribed words and other data.

The words field contains a sequence of Word messages representing transcribed words.

The final_proc_time_ms and total_proc_time_ms fields determine the duration of processed audio in milliseconds, resulting in final and all words respectively. In a Transcribe request, both values are equal. In a TranscribeStream request, these behave as described in Processed Audio Duration.

If using Speaker Identification, the speakers field contains the latest associations between speaker numbers and candidate speakers (for all words from the start of the audio, not just words in this result).

message Word

message Word {
  string text = 1;
  int32 start_ms = 2;
  int32 duration_ms = 3;
  bool is_final = 4;
  int32 speaker = 5;
  string orig_text = 8;
}

The Word message represents an individual recognized word, which is given in the text field.

start_ms and duration_ms represent the time interval of the word in the audio. Understood as half-open interval [start_ms, start_ms+duration_ms), it is guaranteed that there is no overlap between transcribed words from the start.

is_final specifies if the word is final. This disinction relevant only when using TranscribeStream with include_nonfinal=true; in other cases is_final is always true, Refer to Final vs Non-final Words.

The orig_text field indicates the original word when the word in text was masked by the profanity filter, otherwise it is empty. Refer to Transcribe.

If using Speaker Diarization, the speaker field indicates the speaker number. Valid speaker numbers are greater than 0.

Asynchronous Transcription

These requests allow a client to upload files to be transcribed asynchronously and to retrieve the transcription results later. This feature supports a variety of media file formats including video files.

A file is uploaded for transcription using TranscribeAsync. The status of the transcription can be queried using GetTranscribeAsyncStatus. The result of the transcription is retrieved using GetTranscribeAsyncResult.

A file that has been uploaded (and not yet deleted) is in one of the following states:

  • QUEUED: The file is queued to be transcribed.
  • TRANSCRIBING: The file is being transcribed.
  • COMPLETED: The file has been transcribed successfully, the result is available.
  • FAILED: Transcription has failed, the result is not and will not be available.

A file that is not in the TRANSCRIBING state can be deleted using DeleteTranscribeAsyncFile. It is the responsibility of the user to delete files, they are not deleted automatically.

There is a limit on the numbers of files in the QUEUED and TRANSCRIBING states (by default 20) and also a limit on the total number of files (by default 200). If any of these limits is reached, further TranscribeAsync requests will be rejected with gRPC status code RESOURCE_EXHAUSTED and details message starting with <too_many_pending_files> or <too_many_files> respectively. It is the responsibility of the user to prevent or handle these errors.

rpc TranscribeAsync

rpc TranscribeAsync(stream TranscribeAsyncRequest) returns (TranscribeAsyncResponse) {}

message TranscribeAsyncRequest {
  string api_key = 1;
  string reference_name = 3;
  bytes audio = 4;
}

message TranscribeAsyncResponse {
  string file_id = 1;
}

The TranscribeAsync request is used to upload a file for asyncronous transcription. The client sends a sequence of TranscribeAsyncRequest messages, where the first message specifies the api_key and reference_name and may contain an audio chunk, and any further messages contain only an audio chunk. The audio chunks are concatenated to form the complete audio file. The maximum size of an audio chunk is 5 _ 10242. The maximum total size is 100 _ 10242 bytes. The maximum total duration of audio is 2 hours.

Audio is extracted from the file as a part of TranscribeAsync. For larger files, it may take a few seconds to decode the audio after all the audio data has been received by the service. If there is an error during decoding, the TranscribeAsync request will fail with gRPC status code UNKNOWN and details message starting with <invalid_media_file>.

The reference_name specified in the first request is intended to enable the user to identify the file after it has been uploaded. It can be any string not longer than 256 characters, including the empty string, and duplicates are allowed. The service does not use this field, it is only for reference.

If TranscribeAsync succeeds, the automatically assigned file_id is returned. The current state of the file can be queried by calling GetTranscribeAsyncStatus with the file_id.

The file will initially be in the QUEUED state and will transition to the TRANSCRIBING state when the transcription starts. The time that this takes depends on the current service load. When transcription has completed, the file will transition to the COMPLETED state and the results can be retrieved using GetTranscribeAsyncResult. If transcription fails, the file will instead transition to the FAILED state. The file will then remain in the COMPLETED or FAILED state until it is deleted using DeleteTranscribeAsyncFile

rpc GetTranscribeAsyncStatus

rpc GetTranscribeAsyncStatus(GetTranscribeAsyncStatusRequest) returns (GetTranscribeAsyncStatusResponse) {}

message GetTranscribeAsyncStatusRequest {
  string api_key = 1;
  string file_id = 2;
}

message GetTranscribeAsyncStatusResponse {
  repeated TranscribeAsyncFileStatus files = 1;
}

message TranscribeAsyncFileStatus {
  string file_id = 1;
  string reference_name = 2;
  // One of: QUEUED, TRANSCRIBING, COMPLETED, FAILED
  string status = 3;
  // UTC timestamp
  google.protobuf.Timestamp created_time = 4;
}

The GetTranscribeAsyncStatus request returns the state and other information for a specific file or for all existing files.

If file_id in the request is non-empty, information for a file with that ID is returned. In this case, if there is no file with that ID, the request fails with gRPC status code NOT_FOUND and details message starting with <file_id_not_found>. If file_id is empty, information about all existing files is returned.

The files field in the response is a sequence of TranscribeAsyncFileStatus messages. If file_id in the request was non-empty, there will be exactly one element, otherwise there will be one element for each existing file ordered by increasing created_time.

The following information is returned for each file in the response file_id, reference_name, status (state) and created_time (UTC timestamp when TranscribeAsync has completed).

rpc GetTranscribeAsyncResult

rpc GetTranscribeAsyncResult(GetTranscribeAsyncResultRequest) returns (stream GetTranscribeAsyncResultResponse) {}

message GetTranscribeAsyncResultRequest {
  string api_key = 1;
  string file_id = 2;
}

message GetTranscribeAsyncResultResponse {
  Result result = 1;
}

The GetTranscribeAsyncResult request retrieves the the transcription results for a file in the COMPLETED state.

The file for which to retrieve results is specified by file_id. If there is no file with that ID, the request fails with gRPC status code NOT_FOUND and details message starting with <file_id_not_found>.

If the file is still in the QUEUED or TRANSCRIBING state, the request fails with gRPC status code FAILED_PRECONDITION and details message starting with <file_not_transcribed_yet>. If the file is in the FAILED state, it fails with gRPC status code FAILED_PRECONDITION and details message starting with <file_transcription_failed>.

The transcription results are returned as a sequence of Result messages embedded in a sequence of GetTranscribeAsyncResultResponse responses, similar to TranscribeStream. The user can assemble the complete result by concatenating the words from all responses, which are guaranteed to be final, and taking final_proc_time_ms and total_proc_time_ms from the last response, which are equal and represent the audio duration.

rpc DeleteTranscribeAsyncFile

rpc DeleteTranscribeAsyncFile(DeleteTranscribeAsyncFileRequest) returns (DeleteTranscribeAsyncFileResponse) {}

message DeleteTranscribeAsyncFileRequest {
  string api_key = 1;
  string file_id = 2;
}

message DeleteTranscribeAsyncFileResponse {
}

The DeleteTranscribeAsyncFile request deletes a specific file.

The file to delete is specified by file_id. If there is no file with that ID, the request fails with gRPC status code NOT_FOUND and details message starting with <file_id_not_found>.

A file can be deleted as long as it is not in the TRANSCRIBING state. If it is, the request fails with gRPC status code FAILED_PRECONDITION and details message starting with <file_being_transcribed>.

Transcription of Meetings

The TranscribeMeeting request is provided for the purpose of transcribing a meeting with a separate audio stream for each participant in real time. A requirement for using this is that the application performs voice activity detection and sends only segments of audio with voice activity detected.

The term stream is used as synonymous with meeting participant. The term segment means a contiguous audio recording sent to the service in the context of a specific stream. Each segment is itself sent to the service in a number of requests, to enable low-latency operation.

rpc TranscribeMeeting

rpc TranscribeMeeting(stream TranscribeMeetingRequest) returns (stream TranscribeMeetingResponse) {}

message TranscribeMeetingRequest {
  string api_key = 1;
  TranscribeMeetingConfig config = 2;
  int32 seq_num = 3;
  int32 stream_id = 4;
  bool start_of_segment = 5;
  bytes audio = 6;
  bool end_of_segment = 7;
}

message TranscribeMeetingConfig {
  string audio_format = 1;
  int32 sample_rate_hertz = 2;
  int32 num_audio_channels = 3;
  SpeechContext speech_context = 4;
  bool enable_profanity_filter = 5;
}

message TranscribeMeetingResponse {
  int32 seq_num = 1;
  int32 stream_id = 2;
  bool start_of_segment = 3;
  bool end_of_segment = 4;
  Result result = 5;
  string error = 6;
}

The client sends a sequence of TranscribeMeetingRequest requests to the service. The api_key and config are provided in the first request and must not be included in later requests. The configuration specified in config applies to all streams and has the same meaning as in TranscribeStreamRequest. However, non-final results are always enabled, so there is no include_nonfinal option.

The seq_num is an opaque value which is returned in the response. More information about responses is given below.

The stream_id determines which stream the fields start_of_segment, audio and end_of_segment apply to. A stream_id of 0 means no stream, and in that case these fields must have default values. It is recommended to send at least one request every 10 seconds to prevent the session from timing out; using stream_id 0 enables doing so when no audio needs to be sent to the service.

Assuming that stream_id is not 0, the audio field contains new audio data for the stream, if any. The start_of_segment and end_of_segment flags indicate whether an audio segment starts before audio, or ends after audio, respectively. These flags must be consistent within a stream. Specifically: start_of_segment must be true in the first request for the stream, and start_of_segment must also be true if end_of_segment was true in the previous request for the same stream. If this is not the case, some of the audio will not be processed.

Note that, effectively, an audio segment is defined as the concatenation of audio data starting from a request where start_of_segment is true up to first request (the same or a later one) where end_of_segment is true, considering only requests for the same stream.

IMPORTANT: Each audio segment is decoded into audio samples independently. If using a container format (e.g., OGG), each audio segment must be encoded independently of previous segments in the same stream.

IMPORTANT: Active streams, that is streams where end_of_segment was false in the last request for that stream, occupy resources on the service. Make sure to terminate an active stream when it is no longer relevant by sending a request with end_of_segment equal to true (for example, when the meeting participant disconnects).

Transcription results are returned as a stream of TranscribeMeetingResponse responses. For each request, the service will send exactly one response, which will have the same seq_num, stream_id, start_of_segment and end_of_segment. Within the same stream, the order of responses will match the order of requests, but this is not generally true within different streams. The seq_num field can be used to reliably match responses to requests.

IMPORTANT: In the future, the behavior may change such that there might not be one response for each request, but one response could represent a number of consecutive requests for the same stream belonging to the same segment. In such a response, start_of_segment would be that of the first of these requests, while end_of_segment and seq_num would be that of the last of these requests.

The actual transcription results are given in the result field in the same manner as for TranscribeStream, but they must be interpreted in the context of the specific stream. Note that the client must check whether result is present before interpreting it (refer to TranscribeStream). For the special stream ID 0, result will never be present.

The transcript is always finalized at the end of each segment. Specifically, in a response where end_of_segment is true, result will be present and will not contain any non-final words.

Speech Context

These requests are used to manage the user's speech contexts stored in the Soniox Cloud. Storing a speech context enables using it in a Transcribe or TranscribeStream request as an alternative to directly specifying it. For general information about speech contexts, refer to Speech Adaptation.

Stored speech contexts exist in the context of a user account, where they are uniquely identified by a user-specified name. The user account is inferred from the API key used in the request.

rpc CreateSpeechContext

rpc CreateSpeechContext(CreateSpeechContextRequest) returns (CreateSpeechContextResponse) {}

message CreateSpeechContextRequest {
  string api_key = 1;
  SpeechContext speech_context = 2;
}

message CreateSpeechContextResponse {
}

The CreateSpeechContext request creates a stored speech context.

The name of the speech context to create is specified as a part of the speech context in speech_context.name, which must be non-empty.

If a speech context with that name already exists, an error with status code ALREADY_EXISTS is returned. If the speech_context does not satisfy the SpeechContext requirements, an error with status code INVALID_ARGUMENT and a message describing the problem is returned.

rpc UpdateSpeechContext

rpc UpdateSpeechContext(UpdateSpeechContextRequest) returns (UpdateSpeechContextResponse) {}

message UpdateSpeechContextRequest {
  string api_key = 1;
  SpeechContext speech_context = 2;
}

message UpdateSpeechContextResponse {
}

The UpdateSpeechContext request updates an existing stored speech context.

The name of the speech context to update is specified as a part of the speech context in speech_context.name.

If there is no speech context with that name, an error with status code NOT_FOUND is returned. If the speech_context does not satisfy the SpeechContext requirements, an error with status code INVALID_ARGUMENT and a message describing the problem is returned.

rpc DeleteSpeechContext

rpc DeleteSpeechContext(DeleteSpeechContextRequest) returns (DeleteSpeechContextResponse) {}

message DeleteSpeechContextRequest {
  string api_key = 1;
  string name = 2;
}

message DeleteSpeechContextResponse {
}

The DeleteSpeechContext deletes a stored speech context.

The name of the speech context to delete is specified in name. If there is no speech context with that name, an error with status code NOT_FOUND is returned.

rpc ListSpeechContextNames

rpc ListSpeechContextNames(ListSpeechContextNamesRequest) returns (ListSpeechContextNamesResponse) {}

message ListSpeechContextNamesRequest {
  string api_key = 1;
}

message ListSpeechContextNamesResponse {
  repeated string names = 1;
}

The ListSpeechContextNames request returns the names of all stored speech contexts.

The names are returned in the names field in no specific order.

rpc GetSpeechContext

rpc GetSpeechContext(GetSpeechContextRequest) returns (GetSpeechContextResponse) {}

message GetSpeechContextRequest {
  string api_key = 1;
  string name = 2;
}

message GetSpeechContextResponse {
  SpeechContext speech_context = 1;
}

The GetSpeechContext request retrieves a stored speech context.

The name of the speech context to retrieve is sepecified in the name field. If there is no speech context with that name, an error with status code NOT_FOUND is returned.

message SpeechContext

message SpeechContext {
  repeated SpeechContextEntry entries = 1;
  string name = 2;
}

The SpeechContext message represents a speech context.

The entries field contains the entries defining the speech context, represented by SpeechContextEntry messages. The name field represents the name of the speech context.

The presence of entries and name depends on the context:

  • When a speech context is sent in CreateSpeechContext or UpdateSpeechContext, or returned in GetSpeechContext, both are required or guaranteed to be non-empty respectively.
  • In a Transcribe or TranscribeStream request, either both or exactly one must be non-empty. If both are empty, no speech context is used; if entries is non-empty, these entries are used; if name is non-empty, a stored speech context with that name is used.

Requirements:

  • The size of the name must not exceed 50 characters.
  • For each entry: SpeechContextEntry requirements.
  • The number of phrases in the entire speech context must not exceed 100.
  • There must be no duplicate phrases in the entire speech context, after removing leading, trailing and repeated spaces.

message SpeechContextEntry

message SpeechContextEntry {
  repeated string phrases = 1;
  double boost = 2;
}

The SpeechContextEntry message represents an entry in a speech context, defined by a list of phrases and a single boost value that applies to these phrases. Words in each phrase are separated by spaces.

Requirements:

  • There must be at least phrase.
  • The size of a phrase must not exceed 100 characters.
  • The number of words in a phrase must be between 1 and 5
  • The size of a word must not exceed 25 characters.
  • A phrase may contain only the following characters: a-z (lower-case only), ' (apostrophe), - (hyphen/minus), (space).
  • The boost value must be between -30 and 30 inclusive.

Speaker AI

Speaker Diarization

Speaker diarization distinguishes speakers based on their voice. Please refer to the Speaker AI tutorial page for an introduction and general information.

Speaker diarization is available for the Transcribe and TranscribeStream requests. It is enabled by setting config.enable_global_speaker_diarization or config.enable_streaming_speaker_diarization to true, to use global or streaming speaker diarization mode respectively. When speaker diarization is enabled, a speaker number is included with each returned word (speaker field in Word).

When global speaker diarization is used with TranscribeStream, specific restrictions and considerations apply:

  • config.include_nonfinal must be false. Therefore, real-time speaker operation is not possible.
  • Transcription results will be returned only after the end of the request stream. It may take some time before these are returned, depending on the audio duration.
  • The total audio duration is limited to no more than 2 hours.

Stereaming speaker diarization does not have the restrictions above, but generally has lower accuracy, since it is optimized for low-latency real-time transcription.

When using speaker diarization, the minimum and maximum number of speakers can be specified by setting config.min_num_speakers and config.max_num_speakers respectively. By default (if these values are 0), the service assumes that there are between 1 and 10 speakers. The maximum permitted value of max_num_speakers is 20. Note that if the actual number of speakers in the audio is outside of the specified (or default) range, the accuracy of speaker diarization may be low.

Speaker Identification

Speaker identification works together with speaker diarization to associate numbered speakers with named candidate speakers based on voice samples provided in advance by the user.

A set of gRPC API calls are available for speaker management:

  • AddSpeaker: Add a new speaker.
  • GetSpeaker: Return information about a specific speaker.
  • RemoveSpeaker: Remove a specific speaker.
  • ListSpeakers: Return a list of registered speakers.
  • AddSpeakerAudio: Add a new audio for a specific speaker.
  • GetSpeakerAudio: Retrieve a specific audio of a specific speaker.
  • RemoveSpeakerAudio: Remove a specific audio of a specific speaker.

Each speaker is identified by a speaker name, and each of the speaker's audios is identified by an audio name. Speaker names are unique in the context of the Soniox user account, while audio names are unique in the context of the speaker they belong to.

A simple command-line application manage_speakers is provided as a frontend to the speaker management API. This application can be used to add speakers and audios for testing purposes, and it is also a good reference for using these API calls directly.

In order to use speaker identification with Transcribe or TranscribeStream, the following must be done:

  • Speaker Diarization must be enabled (either global or streaming mode).
  • config.enable_speaker_identification must be set to true.
  • Names of candidate speakers must be provided in config.cand_speaker_names.

Each of the candidate speakers specified must be an existing speaker as added using AddSpeaker (or manage_speakers --add_speaker). If this is not the case, and error will be returned. However, if some of these speakers do not have any audios, no error will be returned, but it will not be possible to identify those speakers.

Results of speaker identification are provided in the speakers field in the Result structure. This is a list of associations between speaker numbers and candidate speakers. This list will not include entries for speaker numbers that were not associated with a candidate speaker.