gRPC API Reference#

This document specifies the Soniox gRPC API in terms of RPC calls and data structures. For an introduction to the Soniox gRPC API, refer to gRPC. If you are just getting started using Soniox, it s highly recommended to go through the How-to Guides first.

The API specification consists of definitions of data structures (message), API calls (rpc) and information about their meaning and behavior.

Transcription Configuration#

message TranscriptionConfig#

TranscriptionConfig is used with all transcription API calls and encapsulates various configuration parameters.

message TranscriptionConfig {
    // Optional field to enable the client to identify the transcription
    // request in API logs.
    string client_request_reference = 19;

    // Input options
    string audio_format = 1;
    int32 sample_rate_hertz = 2;
    int32 num_audio_channels = 3;

    // Output options
    bool include_nonfinal = 4;
    bool enable_separate_recognition_per_channel = 16;

    // Customization
    SpeechContext speech_context = 5;

    // Speaker diarization
    bool enable_streaming_speaker_diarization = 8;
    bool enable_global_speaker_diarization = 9;
    int32 min_num_speakers = 10;
    int32 max_num_speakers = 11;

    // Speaker identification
    bool enable_speaker_identification = 12;
    repeated string cand_speaker_names = 13;

    // Model options
    string model = 14;

    // Storage and Search options
    StorageConfig storage_config = 1006;
}

The client_request_reference field can be used to identify the transcription request in API logs. It can be any string value up to 256 characters long. Uniqueness of this value is not verified or enforced.

The audio_format, sample_rate_hertz and num_audio_channels fields specify information about the input audio. Refer to How-to Guides / Audio Format.

include_nonfinal specifies whether to enable low-latency recognition and include non-final tokens in results. It is only valid for TranscribeStream. Refer to How-to Guides / Final vs Non-final Tokens.

enable_separate_recognition_per_channel specifies whether to perform separate speech recognition for each audio channel. When used with Transcribe, a separate Result for each channel is returned in the response. When used with TranscribeStream, results for different channels are multiplexed in the response stream, and result.channel must be used to associate the result to a specific channel. The same applies when retrieving the result of an asynchronous transcription using GetTranscribeAsyncResult.

speech_context is used to configure customization; refer to How-to Guides / Customization.

enable_streaming_speaker_diarization and enable_global_speaker_diarization are used to enable either mode of speaker diarization; refer to How-to Guides / Separate Speakers. If speaker diarization is enabled, min_num_speakers and max_num_speakers can be used to set the minimum and maximum number of speakers, where 0 means to use the default, which is 1 and 10 respectively.

enable_speaker_identification is used to enable speaker identification; refer to How-to Guides / Identify Speakers. This requires also enabling either mode of speaker diarization. When using speaker identification, the names of candidate speakers for which voice profiles were generated must be specified in cand_speaker_names (though specifying none is not an error).

The model field specifies the model to use. Refer to How-to Guides / Models and Languages for available models. It is important to specify a model, as specifing no model will result in a legacy English model being used.

storage_config enables and configures Storage and Search. Refer to How-to Guides / Storage and Search.

Transcription Results#

These data structures are used to represent transcription results. For more information, refer to How-to Guides / Transcription Results.

message Result {
    repeated Word words = 1;
    int32 final_proc_time_ms = 2;
    int32 total_proc_time_ms = 3;
    repeated ResultSpeaker speakers = 6;
    int32 channel = 7;
}

message Word {
    string text = 1;
    int32 start_ms = 2;
    int32 duration_ms = 3;
    bool is_final = 4;
    int32 speaker = 5;
    double confidence = 9;
}

message ResultSpeaker {
    int32 speaker = 1;
    string name = 2;
}

message TranscriptionMetadata#

TranscriptionMetadata is returned at the start or end of a transcription (depending on the API call) and contains supplementary information about the transcription.

message TranscriptionMetadata {
    string package_version = 1;
}

package_version is the version of the Soniox speech recognition package used for transcription. This is an informational field intended to help diagnose issues.

Synchronous Transcription#

These API calls are used to transcribe audio synchronously. The Transcribe call is suitable for transcription of short audio, while the TranscribeStream call is suitable for transcription of possibly long audio as well as real-time audio.

rpc Transcribe#

Transcribe transcribes the provided audio and returns the complete transcription result at once.

rpc Transcribe(TranscribeRequest) returns (TranscribeResponse) {}

message TranscribeRequest {
    string api_key = 1;
    TranscriptionConfig config = 4;
    bytes audio = 3;
}

message TranscribeResponse {
    Result result = 1;
    repeated Result channel_results = 2;
    TranscriptionMetadata metadata = 3;
}

Transcription configuration is specified using the config field in the request; refer to TranscriptionConfig.

Audio data is provided in the audio field. By default, the audio is assumed to use a container format, which is inferred. For raw PCM formats, the specific format, sample rate and number of channels must be specified. For more information about formats and related configuration parameters, refer to How-to Guides / Audio Format.

The maximum size of the audio field is 5 MB, while the maximum audio duration is 60 seconds. Exceeding these limits will result in an error and no transcription result.

The result of the transcription is returned in the result field of the response as a Result structure. However, if separate recognition per channel is enabled, results for consecutive audio channels are returned in the channel_results field, and the result field is not present.

The metadata field in the response is used to return supplementary information about the transcription process; refer to TranscriptionMetadata.

rpc TranscribeStream#

TranscribeStream transcribes audio in streaming mode, either optimized for throughput or low-latency.

rpc TranscribeStream(stream TranscribeStreamRequest) returns (stream TranscribeStreamResponse) {}

message TranscribeStreamRequest {
    string api_key = 1;
    TranscriptionConfig config = 4;
    bytes audio = 3;
}

message TranscribeStreamResponse {
    Result result = 1;
    TranscriptionMetadata metadata = 2;
}

Transcription configuration is specified using the config field in the first request; refer to TranscriptionConfig.

The client sends a sequence of TranscribeStreamRequest requests to the service. The api_key and config are specified in the first request and must not be present in later requests. The audio is sent in chunks using the audio field, which may be empty or non-empty in any of the requests.

TranscribeStream supports the same audio formats as Transcribe; refer to How-to Guides / Audio Format. If a container format is used, data from chunks in consecutive requests is effectively concatenated (the precise locations where the client splits the audio into chunks is not important for the final result of the transcription). However, if one of the supported raw PCM formats is used, then each audio chunk must contain a whole number of frames (frame being a sequence of samples for each channel).

The maximum size of the audio field in a single request is 5 MB, while the maximum total audio duration is 5 hours. If the audio field is too large, the call will return an error. If the maximum total audio duration is exceeded, audio up to the maximum duration will be processed and then the call will return an error.

The call returns a sequence of TranscribeStreamResponse responses, with transcription results available in the result field. The result field may not be present, and it is important for the client to check for its presence before interpreting it (e.g. response.has_result() in C++, response.HasField("result") in Python). This is important for correct handling of non-final tokens (see below).

The client should not make any assumptions about the correspondence of requests to responses or the presence of result. Even if it appears that the service generates responses in a specific manner, there are no guarantees that any such nonspecified behavior would be maintained.

If config.include_nonfinal is false, TranscribeStream returns only final tokens and does not offer any latency guarantees. This mode is optimized for throughput and is essentially a streaming version of Transcribe, enabling transcription of longer audio. The complete sequence of tokens is obtained by joining the tokens from all results.

If config.include_nonfinal is true, TranscribeStream returns both final and non-final tokens while minimizing the recognition latency. This mode is suitable for transcription of real-time audio where transcribed tokens need to be received as soon possible after the associated audio has been sent. For more information, refer to How-to Guides / Final vs Non-final Tokens.

If config.include_nonfinal is true, the following also applies:

  • Audio should not be sent at a rate faster than real-time. If it is, the service may throttle processing or return an error. There are margins such that this should not occur for real-time streams under normal circumstances.
  • Minimum latency is achieved when using the PCM format pcm_s16le with sample rate 16 kHz and one audio channel. Alternatively, any supported PCM format can be used with a negligible effect on latency. Using a container format in this mode is not recommended due to the latency introduced by audio decoding.

If separate recognition per channel is enabled, different audio channels are transcribed independently (as if each channel was transcribed with its own TranscribeStream call). The result.channel field indicates which channel a result is for; consulting this field is essential to correctly interpret the results. There are no guarantees about the relative order or timing of results for different channels.

metadata is present in the first response (only) and provides supplementary information about the transcription process; refer to TranscriptionMetadata.

Asynchronous Transcription#

These API calls allow a user to upload files to be transcribed asynchronously and to retrieve the transcription results later. This feature supports a variety of media file formats including video files.

A file is uploaded for transcription using TranscribeAsync. The status of the transcription can be queried using GetTranscribeAsyncStatus. The result of the transcription is retrieved using GetTranscribeAsyncResult.

A file that has been uploaded (and not yet deleted) is in one of the following statuses:

  • QUEUED: The file is queued for transcription.
  • TRANSCRIBING: The file is being transcribed.
  • COMPLETED: The file has been transcribed successfully, the result is available.
  • FAILED: Transcription has failed, the result is not and will not be available.

A file that is not in the TRANSCRIBING status can be deleted using DeleteTranscribeAsyncFile. It is the responsibility of the user to delete files, they are not deleted automatically.

There is a limit on the numbers of files that have been uploaded but not yet deleted, which is 100. It the limit is reached, further TranscribeAsync calls will be rejected with gRPC status code RESOURCE_EXHAUSTED and error message starting with <too_many_files>. It is the responsibility of the user to prevent or handle these errors.

There are limits on the maximum number of files based on the file status:

  • The maximum number of uploaded files pending transcription (in QUEUED or TRANSCRIBING status) is 100. If this limit is reached, further TranscribeAsync calls will be rejected with gRPC status code RESOURCE_EXHAUSTED and error message starting with <too_many_pending_files>.
  • The maximum number of uploaded non-deleted files (in any status) is 2000. If this limit is reached, further TranscribeAsync calls will be rejected with gRPC status code RESOURCE_EXHAUSTED and error message starting with <too_many_files>.

rpc TranscribeAsync#

TranscribeAsync is used to upload a file for asyncronous transcription.

rpc TranscribeAsync(stream TranscribeAsyncRequest) returns (TranscribeAsyncResponse) {}

message TranscribeAsyncRequest {
    string api_key = 1;
    string reference_name = 3;
    TranscriptionConfig config = 5;
    bool enable_eof = 6;
    bool eof = 7;
    bytes audio = 4;
}

message TranscribeAsyncResponse {
    string file_id = 1;
}

The client sends a sequence of TranscribeAsyncRequest responses, where the first response specifies the api_key, reference_name and config and may contain an audio chunk, and any further responses contain only an audio chunk. The audio chunks are concatenated to form the complete audio file. The maximum size of an audio chunk is 5 MB. The maximum total size is 500 MB. The maximum total duration of audio is 5 hours.

Audio is extracted (decoded) from the file as a part of TranscribeAsync. For larger files, it may take a few seconds to extract the audio after all audio chunks have been sent. If there is an error extracting audio, the TranscribeAsync call will return an error with gRPC status code INVALID_ARGUMENT and error message starting with <invalid_media_file>.

The reference_name specified in the first request allows the user to identify the file after it has been uploaded. It can be any string not longer than 256 characters, including the empty string. This field does not affect transcription and there is no requirement regarding uniqueness.

In the first request, enable_eof must be set to true. End-of-file must be indicated by setting eof to true in the last request (which may be a request without audio). The service will only consider the upload to be successful if it has received a request with eof equal to true. This mechanism ensures that only completely uploaded files will be transcribed, ensuring that any interruption is detected as an error.

If TranscribeAsync succeeds, the automatically assigned file_id is returned. The current status of the file can be queried by calling GetTranscribeAsyncStatus with the file_id.

The file will initially be in QUEUED status and will transition to TRANSCRIBING status when the transcription starts. The time that this takes depends on the current service load. When transcription has completed, the file will transition to COMPLETED status and the results can be retrieved using GetTranscribeAsyncResult. If transcription fails, the file will instead transition to FAILED status. The file will then remain in COMPLETED or FAILED status until it is deleted using DeleteTranscribeAsyncFile.

rpc GetTranscribeAsyncStatus#

GetTranscribeAsyncStatus returns the status and other information for a specific file or for all existing files.

rpc GetTranscribeAsyncStatus(GetTranscribeAsyncStatusRequest) returns (GetTranscribeAsyncStatusResponse) {}

message GetTranscribeAsyncStatusRequest {
    string api_key = 1;
    string file_id = 2;
}

message GetTranscribeAsyncStatusResponse {
    repeated TranscribeAsyncFileStatus files = 1;
}

message TranscribeAsyncFileStatus {
    string file_id = 1;
    string reference_name = 2;
    string status = 3; // one of: QUEUED, TRANSCRIBING, COMPLETED, FAILED
    google.protobuf.Timestamp created_time = 4;
    string error_message = 5;
}

If file_id in the request is non-empty, information for a file with that ID is returned. In this case, if there is no file with that ID, the call fails with gRPC status code NOT_FOUND and error message starting with <file_id_not_found>. If file_id is empty, information about all existing files is returned.

The files field in the response is a sequence of TranscribeAsyncFileStatus structures. If file_id in the request was non-empty, there will be exactly one element, otherwise there will be one element for each existing file ordered by increasing created_time.

The following information is returned for each file in the response file_id, reference_name, status, created_time (UTC timestamp when TranscribeAsync has completed) and error_message (error information if the status is FAILED).

rpc GetTranscribeAsyncResult#

GetTranscribeAsyncResult retrieves the transcription results for a file in the COMPLETED status.

rpc GetTranscribeAsyncResult(GetTranscribeAsyncResultRequest) returns (stream GetTranscribeAsyncResultResponse) {}

message GetTranscribeAsyncResultRequest {
    string api_key = 1;
    string file_id = 2;
}

message GetTranscribeAsyncResultResponse {
    bool separate_recognition_per_channel = 2;
    Result result = 1;
    TranscriptionMetadata metadata = 3;
}

The file for which to retrieve results is specified by file_id. If there is no file with that ID, the call fails with gRPC status code NOT_FOUND and error message starting with <file_id_not_found>.

If the file is still in the QUEUED or TRANSCRIBING status, the call fails with gRPC status code FAILED_PRECONDITION and error message starting with <file_not_transcribed_yet>. If the file is in the FAILED status, the call fails with gRPC status code FAILED_PRECONDITION and error message starting with <file_transcription_failed>.

The transcription results are returned as a sequence of Result structures embedded in a sequence of GetTranscribeAsyncResultResponse responses, similar to TranscribeStream. The user can assemble the complete result by concatenating the tokens (words) from all responses, which are guaranteed to be final, and taking the fields final_proc_time_ms, total_proc_time_ms and speakers from the last response.

If separate recognition per channel was enabled, as is indicated by the field separate_recognition_per_channel in each response (having the same value in all responses), the assembly of results described above must be done on a per-channel basis according to result.channel in each response.

rpc DeleteTranscribeAsyncFile#

DeleteTranscribeAsyncFile deletes a specific file.

rpc DeleteTranscribeAsyncFile(DeleteTranscribeAsyncFileRequest) returns (DeleteTranscribeAsyncFileResponse) {}

message DeleteTranscribeAsyncFileRequest {
    string api_key = 1;
    string file_id = 2;
}

message DeleteTranscribeAsyncFileResponse {
}

The file to delete is specified by file_id. If there is no file with that ID, the call fails with gRPC status code NOT_FOUND and error message starting with <file_id_not_found>.

A file can be deleted as long as it is not in the TRANSCRIBING status. If it is, the call fails with gRPC status code FAILED_PRECONDITION and error message starting with <file_being_transcribed>.

Speaker Management#

This section defines the API for managing speakers (voice profiles) for Speaker Identification. Refer to How-to Guides / Identify Speakers.

Each speaker is identified by a speaker name, and each of the speaker’s audios is identified by an audio name. Speaker names are unique in the context of the Soniox project, while audio names are unique in the context of the speaker. Speaker and audio names can be up to 100 characters long.

For an example of using these API calls, check the manage_speakers command-line application in the Soniox Python library.

rpc AddSpeaker#

AddSpeaker adds a new speaker with the specified name. The speaker will not have any audios, those need to be added separately using AddSpeakerAudio.

rpc AddSpeaker(AddSpeakerRequest) returns (AddSpeakerResponse) {}

message AddSpeakerRequest {
    string api_key = 1;
    string name = 2;
}

message AddSpeakerResponse {
    string name = 1;
    google.protobuf.Timestamp created = 2;
}

If a speaker with the specified name already exists, the call will fail with gRPC status code ALREADY_EXISTS.

rpc GetSpeaker#

GetSpeaker returns information about the speaker with the specified name, including audios for this speaker.

rpc GetSpeaker(GetSpeakerRequest) returns (GetSpeakerResponse) {}

message GetSpeakerRequest {
    string api_key = 1;
    string name = 2;
}

message GetSpeakerResponse {
    string name = 1;
    google.protobuf.Timestamp created = 2;
    repeated GetSpeakerResponseAudio audios = 3;
}

message GetSpeakerResponseAudio {
    string audio_name = 1;
    google.protobuf.Timestamp created = 2;
    int32 duration_ms = 3;
}

If a speaker with the specified name does not exist, the call will fail with gRPC status code NOT_FOUND.

rpc RemoveSpeaker#

RemoveSpeaker removes the speaker with the specified name, including audios for this speaker.

rpc RemoveSpeaker(RemoveSpeakerRequest) returns (RemoveSpeakerResponse) {}

message RemoveSpeakerRequest {
    string api_key = 1;
    string name = 2;
}

message RemoveSpeakerResponse {
}

If a speaker with the specified name does not exist, the call will fail with gRPC status code NOT_FOUND.

rpc ListSpeakers#

ListSpeakers returns the list of registered speakers, including audios for each speaker.

rpc ListSpeakers(ListSpeakersRequest) returns (ListSpeakersResponse) {}

message ListSpeakersRequest {
    string api_key = 1;
}

message ListSpeakersResponse {
    repeated ListSpeakersResponseSpeaker speakers = 1;
}

message ListSpeakersResponseSpeaker {
    string name = 1;
    google.protobuf.Timestamp created = 2;
    int32 num_audios = 3;
}

rpc AddSpeakerAudio#

AddSpeakerAudio add a new audio with the specified name for the speaker with the specified name.

rpc AddSpeakerAudio(AddSpeakerAudioRequest) returns (AddSpeakerAudioResponse) {}

message AddSpeakerAudioRequest {
    string api_key = 1;
    string speaker_name = 2;
    string audio_name = 3;
    bytes audio = 4;
}

message AddSpeakerAudioResponse {
    string speaker_name = 1;
    string audio_name = 2;
    google.protobuf.Timestamp created = 3;
    int32 duration_ms = 4;
}

If a speaker with the specified name does not exist, the call will fail with gRPC status code NOT_FOUND. If an audio with the specified name already exists for the speaker, the call will fail with gRPC status code ALREADY_EXISTS.

rpc GetSpeakerAudio#

GetSpeakerAudio retrieves the audio with the specified name for the speaker with the specified name.

rpc GetSpeakerAudio(GetSpeakerAudioRequest) returns (GetSpeakerAudioResponse) {}

message GetSpeakerAudioRequest {
    string api_key = 1;
    string speaker_name = 2;
    string audio_name = 3;
}

message GetSpeakerAudioResponse {
    string speaker_name = 1;
    string audio_name = 2;
    google.protobuf.Timestamp created = 3;
    int32 duration_ms = 4;
    bytes audio = 5;
}

If a speaker with the specified name does not exist, or an audio with the specified name does not exist for the speaker, the call will fail with gRPC status code NOT_FOUND.

rpc RemoveSpeakerAudio#

RemoveSpeakerAudio removes the audio with the specified name for the speaker with the specified name.

rpc RemoveSpeakerAudio(RemoveSpeakerAudioRequest) returns (RemoveSpeakerAudioResponse) {}

message RemoveSpeakerAudioRequest {
    string api_key = 1;
    string speaker_name = 2;
    string audio_name = 3;
}

message RemoveSpeakerAudioResponse {
}

If a speaker with the specified name does not exist, or an audio with the specified name does not exist for the speaker, the call will fail with gRPC status code NOT_FOUND.

Temporary API Keys#

Temporary API keys allow authorizing a specific API call without revealing the regular Soniox API key to the client performing the call.

rpc CreateTemporaryApiKey#

CreateTemporaryApiKey creates a new temporary API key.

rpc CreateTemporaryApiKey(CreateTemporaryApiKeyRequest) returns (CreateTemporaryApiKeyResponse) {}

message CreateTemporaryApiKeyRequest {
    string api_key = 1;
    string usage_type = 2;
    int32 expires_in_s = 4;
    string client_request_reference = 3;
}

message CreateTemporaryApiKeyResponse {
    string key = 1;
    google.protobuf.Timestamp expires_datetime = 2;
}

The intended usage of the temporary API key must be specified in the usage_type field. Currently there is only one option:

  • transcribe_websocket: Create a temporary API key for transcription using the WebSocket API.

The expiration time of the temporary API key must be specified in the expires_in_s field. The maximum expiration time is 3600 seconds.

The optional client_request_reference field can be used to identify the future API call (that uses the returned temporary API key) in API logs. Refer to the same field in TranscriptionConfig.

The created temporary API key is returned in the key field, and its actual expiration datetime is indicated in the expires_datetime field.

When using a temporary API key for an API call, the following error codes may be returned (the error code will appear at the beginning of the error message):

  • <invalid_temp_api_key>: The temporary API key is not valid or has expired.
  • <invalid_temp_api_key_usage>: The temporary API key does not allow performing the requested action.