gRPC API reference
gRPC API reference, documentation and examples.
This document specifies the Soniox gRPC API in terms of RPC calls and data structures. For an introduction to the Soniox gRPC API, refer to the gRPC API page. If you are just getting started using Soniox, it s highly recommended to go through the How-to guides first.
The API specification consists of definitions of data structures (message
), API calls (rpc
) and
information about their meaning and behavior.
Transcription Configuration
message TranscriptionConfig
TranscriptionConfig
is used with all transcription API calls and encapsulates
various configuration parameters.
The client_request_reference
field can be used to identify the transcription request
in API logs. It can be any string value up to 256 characters long. Uniqueness of this value is
not verified or enforced.
The audio_format
, sample_rate_hertz
and num_audio_channels
fields specify information
about the input audio. Refer to How-to Guides / Audio Format.
include_nonfinal
specifies whether to enable low-latency recognition and include non-final
tokens in results. It is only valid for TranscribeStream
. Refer to How-to Guides / Speech Context.
enable_separate_recognition_per_channel
specifies whether to perform separate speech
recognition for each audio channel. When used with :ref:Transcribe <rpc-Transcribe>
, a separate
Result
for each channel is returned in the response. When used with
TranscribeStream
, results for different channels are multiplexed
in the response stream, and result.channel
must be used to associate the result to a
specific channel. The same applies when retrieving the result of an asynchronous
transcription using GetTranscribeAsyncResult
.
speech_context
is used to configure customization; refer to
How-to Guides / Customization.
enable_streaming_speaker_diarization
and enable_global_speaker_diarization
are used to enable either mode of speaker diarization; refer to
How-to Guides / Separate Speakers.
If speaker diarization
is enabled, min_num_speakers
and max_num_speakers
can be used
to set the minimum and maximum number of speakers, where 0 means to use the default, which
is 1 and 10 respectively.
enable_speaker_identification
is used to enable speaker identification; refer to
How-to Guides / Identify Speakers. This requires also
enabling either mode of speaker diarization. When using speaker identification,
the names of candidate speakers for which voice profiles were generated must be specified
in cand_speaker_names
(though specifying none is not an error).
The model
field specifies the model to use. Refer to
How-to Guides / Models and Languages
for available models.
It is important to specify a model, as specifing no model will result in a legacy English model
being used.
Transcription Results
These data structures are used to represent transcription results. For more information, refer to How-to Guides / Transcription Results.
message TranscriptionMetadata
TranscriptionMetadata
is returned at the start or end of a transcription
(depending on the API call) and contains supplementary information about the transcription.
package_version
is the version of the Soniox Speech Recognition package used for
transcription. This is an informational field intended to help diagnose issues.
Synchronous Transcription
These API calls are used to transcribe audio synchronously. The Transcribe
call is suitable for transcription of short audio, while the TranscribeStream
call is suitable for transcription of possibly long audio as well as real-time audio.
rpc Transcribe
Transcribe
transcribes the provided audio and returns the complete
transcription result at once.
Transcription configuration is specified using the config
field in the request;
refer to TranscriptionConfig
.
Audio data is provided in the audio
field. By default, the audio is assumed to use
a container format, which is inferred. For raw PCM formats, the specific format, sample
rate and number of channels must be specified. For more information about formats and
related configuration parameters, refer to
How-to Guides / Audio Format.
The maximum size of the audio
field is 5 MB, while the maximum audio duration is 60
seconds. Exceeding these limits will result in an error and no transcription result.
The result of the transcription is returned in the result
field of the response as a
Result
structure. However, if separate recognition per channel is
enabled, results for consecutive audio channels are returned in the channel_results
field, and the result
field is not present.
The metadata
field in the response is used to return supplementary information about
the transcription process; refer to TranscriptionMetadata
.
rpc TranscribeStream
TranscribeStream
transcribes audio in streaming mode, either optimized for
throughput or low-latency.
Transcription configuration is specified using the config
field in the first request;
refer to TranscriptionConfig
.
Audio data is provided in the audio
field. By default, the audio is assumed to use
a container format, which is inferred. For raw PCM formats, the specific format, sample
rate and number of channels must be specified. For more information about formats and
related configuration parameters, refer to
How-to Guides / Audio Format.
The maximum size of the audio
field is 5 MB, while the maximum audio duration is 60
seconds. Exceeding these limits will result in an error and no transcription result.
The result of the transcription is returned in the result
field of the response as a
Result
structure. However, if separate recognition per channel is
enabled, results for consecutive audio channels are returned in the channel_results
field, and the result
field is not present.
The metadata
field in the first response (only) is used to return supplementary
information about the transcription process; refer to
TranscriptionMetadata
.
Asynchronous Transcription
These API calls allow a user to upload files to be transcribed asynchronously and to retrieve the transcription results later. This feature supports a variety of media file formats including video files.
A file is uploaded for transcription using TranscribeAsync
. The
status of the transcription can be queried using GetTranscribeAsyncStatus
.
The result of the transcription is retrieved using GetTranscribeAsyncResult
.
A file that has been uploaded (and not yet deleted) is in one of the following statuses:
QUEUED
: The file is queued for transcription.TRANSCRIBING
: The file is being transcribed.COMPLETED
: The file has been transcribed successfully, the result is available.FAILED
: Transcription has failed, the result is not and will not be available.
A file that is not in the TRANSCRIBING
status can be deleted using
DeleteTranscribeAsyncFile
. It is the responsibility
of the user to delete files, they are not deleted automatically.
There is a limit on the numbers of files that have been uploaded but not yet deleted,
which is 100. It the limit is reached, further TranscribeAsync
calls will be rejected
with gRPC status code RESOURCE_EXHAUSTED
and error message starting with
<too_many_files>
. It is the responsibility of the user to prevent or handle these errors.
There are limits on the maximum number of files based on the file status:
-
The maximum number of uploaded files pending transcription (in
QUEUED
orTRANSCRIBING
status) is 100. If this limit is reached, furtherTranscribeAsync
calls will be rejected with gRPC status codeRESOURCE_EXHAUSTED
and error message starting with<too_many_pending_files>
. -
The maximum number of uploaded non-deleted files (in any status) is 2000. If this limit is reached, further
TranscribeAsync
calls will be rejected with gRPC status codeRESOURCE_EXHAUSTED
and error message starting with<too_many_files>
.
rpc TranscribeAsync
TranscribeAsync
is used to upload a file for asyncronous transcription.
The client sends a sequence of TranscribeAsyncRequest
responses, where the first response
specifies the api_key
, reference_name
and config
and may contain an audio
chunk,
and any further responses contain only an audio
chunk. The audio chunks are concatenated
to form the complete audio file. The maximum size of an audio chunk is 5 MB.
The maximum total size is 500 MB. The maximum total duration of audio is 5 hours.
Audio is extracted (decoded) from the file as a part of TranscribeAsync
. For larger files, it may
take a few seconds to extract the audio after all audio chunks have been sent.
If there is an error extracting audio, the TranscribeAsync
call will return
an error with gRPC status code INVALID_ARGUMENT
and error message starting with
<invalid_media_file>
.
The reference_name
specified in the first request allows the user to
identify the file after it has been uploaded. It can be any string not longer than 256
characters, including the empty string. This field does not affect transcription and there
is no requirement regarding uniqueness.
In the first request, enable_eof
must be set to true. End-of-file must be indicated
by setting eof
to true in the last request (which may be a request without audio
).
The service will only consider the upload to be successful if it has received a request
with eof
equal to true. This mechanism ensures that only completely uploaded files will
be transcribed, ensuring that any interruption is detected as an error.
If TranscribeAsync
succeeds, the automatically assigned file_id
is returned. The
current status of the file can be queried by calling
GetTranscribeAsyncStatus
with the file_id
.
The file will initially be in QUEUED
status and will transition to TRANSCRIBING
status when the transcription starts. The time that this takes depends on the current service
load. When transcription has completed, the file will transition to COMPLETED
status
and the results can be retrieved using GetTranscribeAsyncResult
.
If transcription fails, the file will instead transition to FAILED
status. The file will
then remain in COMPLETED
or FAILED
status until it is deleted using
DeleteTranscribeAsyncFile
.
rpc GetTranscribeAsyncStatus
GetTranscribeAsyncStatus
returns the status and other information for a
specific file or for all existing files.
If file_id
in the request is non-empty, information for a file with that ID is returned.
In this case, if there is no file with that ID, the call fails with gRPC status code
NOT_FOUND
and error message starting with <file_id_not_found>
. If file_id
is empty,
information about all existing files is returned.
The files
field in the response is a sequence of TranscribeAsyncFileStatus
structures.
If file_id
in the request was non-empty, there will be exactly one element, otherwise
there will be one element for each existing file ordered by increasing created_time
.
The following information is returned for each file in the response file_id
,
reference_name
, status
, created_time
(UTC timestamp when TranscribeAsync
has completed) and error_message
(error information if the status is FAILED
).
rpc GetTranscribeAsyncResult
GetTranscribeAsyncResult
retrieves the transcription results for a file
in the COMPLETED
status.
The file for which to retrieve results is specified by file_id
. If there is no file
with that ID, the call fails with gRPC status code NOT_FOUND
and error message
starting with <file_id_not_found>
.
If the file is still in the QUEUED
or TRANSCRIBING
status, the call fails with
gRPC status code FAILED_PRECONDITION
and error message starting with
<file_not_transcribed_yet>
. If the file is in the FAILED
status, the call fails
with gRPC status code FAILED_PRECONDITION
and error message starting with
<file_transcription_failed>
.
The transcription results are returned as a sequence of Result
structures embedded in a sequence of GetTranscribeAsyncResultResponse
responses, similar
to TranscribeStream
. The user can assemble the complete result
by concatenating the tokens (words
) from all responses, which are guaranteed to be final, and
taking the fields final_proc_time_ms
, total_proc_time_ms
and speakers
from the
last response.
If separate recognition per channel was enabled, as is indicated by the field
separate_recognition_per_channel
in each response (having the same value in
all responses), the assembly of results described above must be done on a
per-channel basis according to result.channel
in each response.
rpc DeleteTranscribeAsyncFile
DeleteTranscribeAsyncFile
deletes a specific file.
The file to delete is specified by file_id
. If there is no file with that ID, the call
fails with gRPC status code NOT_FOUND
and error message starting with
<file_id_not_found>
.
A file can be deleted as long as it is not in the TRANSCRIBING
status. If it is, the
call fails with gRPC status code FAILED_PRECONDITION
and error message starting
with <file_being_transcribed>
.
Speaker Management
This section defines the API for managing speakers (voice profiles) for Speaker Identification. Refer to How-to Guides / Identify Speakers.
Each speaker is identified by a speaker name, and each of the speaker's audios is identified by an audio name. Speaker names are unique in the context of the Soniox project, while audio names are unique in the context of the speaker. Speaker and audio names can be up to 100 characters long.
For an example of using these API calls, check the manage_speakers command-line application in the Soniox Python library.
rpc AddSpeaker
AddSpeaker
adds a new speaker with the specified name. The speaker will not have
any audios, those need to be added separately using AddSpeakerAudio
.
If a speaker with the specified name already exists, the call will fail with gRPC status code
ALREADY_EXISTS
.
rpc GetSpeaker
GetSpeaker
returns information about the speaker with the specified name, including
audios for this speaker.
If a speaker with the specified name does not exist, the call will fail with gRPC status code
NOT_FOUND
.
rpc RemoveSpeaker
RemoveSpeaker
removes the speaker with the specified name, including audios for this
speaker.
If a speaker with the specified name does not exist, the call will fail with gRPC status code
NOT_FOUND
.
rpc ListSpeakers
ListSpeakers
returns the list of registered speakers, including audios for each speaker.
rpc AddSpeakerAudio
AddSpeakerAudio
add a new audio with the specified name for the speaker with the
specified name.
If a speaker with the specified name does not exist, the call will fail with gRPC status code
NOT_FOUND
. If an audio with the specified name already exists for the speaker, the
call will fail with gRPC status code ALREADY_EXISTS
.
rpc GetSpeakerAudio
GetSpeakerAudio
retrieves the audio with the specified name for the speaker with
the specified name.
If a speaker with the specified name does not exist, or an audio with the specified
name does not exist for the speaker, the call will fail with gRPC status code
NOT_FOUND
.
rpc RemoveSpeakerAudio
RemoveSpeakerAudio
removes the audio with the specified name for the speaker with
the specified name.
If a speaker with the specified name does not exist, or an audio with the specified
name does not exist for the speaker, the call will fail with gRPC status code
NOT_FOUND
.
Temporary API Keys
Temporary API keys allow authorizing a specific API call without revealing the regular Soniox API key to the client performing the call.
rpc CreateTemporaryApiKey
CreateTemporaryApiKey
creates a new temporary API key.
The intended usage of the temporary API key must be specified in the usage_type
field. Currently there is only one option:
transcribe_websocket
: Create a temporary API key for transcription using theWebSocket
API.
The expiration time of the temporary API key must be specified in the expires_in_s
field. The maximum expiration time is 3600 seconds.
The optional client_request_reference
field can be used to identify the
future API call (that uses the returned temporary API key) in API logs. Refer to the
same field in TranscriptionConfig
.
The created temporary API key is returned in the key
field, and its actual
expiration datetime is indicated in the expires_datetime
field.
When using a temporary API key for an API call, the following error codes may be returned (the error code will appear at the beginning of the error message):
<invalid_temp_api_key>
: The temporary API key is not valid or has expired.<invalid_temp_api_key_usage>
: The temporary API key does not allow performing the requested action.
gRPC
Soniox provides a gRPC-based API for the entire Soniox Speech Recognition service.
WebSocket
Soniox provides a WebSocket-based API for real-time and low-latency speech recognition using Soniox Speech Recognition. This enables you to integrate with Soniox from any programming language or framework that supports WebSockets (e.g. Swift).