gRPC API Reference
Basic Transcription
These requests are used to transcribe audio synchronously. The
Transcribe
request is suitable for transcription of short audio,
while the TranscribeStream
request is suitable for
transcription of possibly long audio as well as real-time audio.
rpc Transcribe
rpc Transcribe(TranscribeRequest) returns (TranscribeResponse) {}
message TranscribeRequest {
string api_key = 1;
TranscriptionConfig config = 4;
bytes audio = 3;
}
message TranscribeResponse {
Result result = 1;
repeated Result channel_results = 2;
}
The Transcribe
request transcribes the provided audio and returns the complete
transcription at once.
Transcription configuration is specified using the config
field in the request;
refer to TranscriptionConfig
.
Audio data is provided in the audio
field. By default, the audio is assumed to use
a container format, which is inferred. For raw PCM formats, the specific format, sample
rate and number of channels must be specified. For more information about formats and
related configuration parameters, refer to
Audio Format.
The maximum size of the audio
field is 5 MB, while the maximum audio duration is 60
seconds. Exceeding these limits will result in an error and no transcription.
The result of the transcription is returned in the result
field of the response as a
Result
message. However, if separate recognition per channel is
enabled, results for consecutive audio channels are returned in the channel_results
field, and the result
field is not present.
rpc TranscribeStream
rpc TranscribeStream(stream TranscribeStreamRequest) returns (stream TranscribeStreamResponse) {}
message TranscribeStreamRequest {
string api_key = 1;
TranscriptionConfig config = 4;
bytes audio = 3;
}
message TranscribeStreamResponse {
Result result = 1;
}
The TranscribeStream
request transcribes audio in streaming mode, either optimized for
throughput or low-latency.
Transcription configuration is specified using the config
field in the first request;
refer to TranscriptionConfig
.
The client sends a sequence of TranscribeStreamRequest
requests to the service. The
api_key
and config
are specified in the first request and must not be present in
later requests. The audio is provided in chunks using the audio
field, which may be
empty or non-empty in any of the requests.
TranscribeStream
supports the same audio formats as Transcribe
;
refer to Audio Format.
If a container format is used, data from chunks in consecutive requests is
effectively concatenated (the precise locations where the client splits the audio
into chunks is not important). However, if one of the supported raw PCM formats is used,
then each audio chunk must contain a whole number of frames (a frame is a sequence of
samples for each channel).
The maximum size of the audio
field in a single request is 5 MB, while the maximum
total audio duration is 5 hours. If the audio
field is too large, an error will be
returned immediately. If the maximum total audio duration is exceeded, audio up to the
maximum duration will be processed and then an error will be returned.
The service returns a sequence of TranscribeStreamResponse
messages, with transcription
results available in the result
field. The result
field may not be present, and it is
important for the client to check for its presence before interpreting it (e.g.
response.has_result()
in C++, response.HasField("result")
in Python). This is
important for correct handling of non-final words (see below).
The client should not make any assumptions about the correspondence of requests to
responses or the presence of result
. Even if it appears that the service generates
responses in a specific manner, there are no guarantees that any such nonspecified
behavior would be maintained.
If config.include_nonfinal
is false, TranscribeStream
returns only final words
and does not offer any latency guarantees. This mode is optimized for throughput and is
essentially a streaming version of Transcribe
, enabling transcription of longer audio.
The complete sequence of words is obtained by joining the words from all results.
If config.include_nonfinal
is true, TranscribeStream
returns both final and
non-final words while minimizing the recognition latency. This mode is suitable for
transcription of real-time audio where transcribed words need to be received as
soon possible after the associated audio has been sent to the service. The current
transcription from the start of the audio can be constructed by joining (in order) final
words from results before the last result and all words from the last result. For more
information, refer to
Final vs Non-final Words.
If config.include_nonfinal
is true, the following also applies:
- Audio should not be sent at a rate faster than real-time. If it is, the service may throttle processing or return an error. There are margins such that this should not occur for real-time streams under normal circumstances.
- Minimum latency is achieved when using the PCM format
pcm_s16le
with sample rate 16 kHz and one audio channel. Alternatively, any supported PCM format can be used with a negligible effect on latency. Using a container format in this mode is not recommended due to the latency introduced by audio decoding.
If separate recognition per channel is enabled, different audio channels are transcribed
independently (as if each channel was transcribed with its own TranscribeStream
request). The result.channel
field indicates which channel a result is for; consulting
this field is essential to correctly interpret the results. There are no guarantees about
the relative order or timing of results for different channels.
Transcription Configuration
message TranscriptionConfig
message TranscriptionConfig {
// Optional field to enable the client to identify this request
// in API logs.
string client_request_reference = 19;
// Input options
string audio_format = 1;
int32 sample_rate_hertz = 2;
int32 num_audio_channels = 3;
// Output options
bool include_nonfinal = 4;
bool enable_separate_recognition_per_channel = 16;
bool enable_endpoint_detection = 18;
// Speech adaptation
SpeechContext speech_context = 5;
// Content moderation
bool enable_profanity_filter = 6;
repeated string content_moderation_phrases = 7;
// Speaker diarization
bool enable_streaming_speaker_diarization = 8;
bool enable_global_speaker_diarization = 9;
int32 min_num_speakers = 10;
int32 max_num_speakers = 11;
// Speaker identification
bool enable_speaker_identification = 12;
repeated string cand_speaker_names = 13;
// Model options
string model = 14;
bool enable_dictation = 15;
}
The TranscriptionConfig
message is used with all transcription requests and specifies various
configuration parameters.
The client_request_reference
field can be used to identify the request in
API logs. It can be any value up to 32 characters long. Uniqueness of this
value is not verified or enforced.
The audio_format
, sample_rate_hertz
and num_audio_channels
fields specify information
about the input audio. Refer to
Audio Format.
include_nonfinal
specifies whether to enable low-latency recognition and include non-final
words in results. It is only valid for TranscribeStream
. Also
refer to Final vs Non-final Words.
enable_separate_recognition_per_channel
specifies whether to perform separate speech
recognition for each audio channel. When used with Transcribe
, a separate
result for each channel is explicitly returned in the response. When used with
TranscribeStream
, results for different channels are multiplexed
in the response stream, and result.channel
must be used to associate the result to a
specific channel. The same applies when retrieving the result of an asynchronous
transcription using GetTranscribeAsyncResult
.
enable_endpoint_detection
enables endpoint detection for interactive voice applications,
When the end of the utterance is detected, an <end>
word is returned, and all words
up to and including the <end>
word are returned as final. This feature is designed to
be used together with the IVR Domain,
but can also be used with other models.
speech_context
is used to specify the custom vocabulary; refer to
Custom Vocabulary for general
information. There are two methods of specifying a speech context. First, speech context
entries can be included directly in speech_context.entries
. Second, a speech context
stored in Soniox Cloud (in the context of the user account) can be referenced using
speech_context.name
; the API for managing stored speech contexts is described in
Speech Context Management. It is not allowed to specify
both entries
and name
.
enable_profanity_filter
enables the profanity filter to mask profane words/phrases,
while content_moderation_phrases
specifies custom words/phrases that should be
masked. These features can be used on their own or together, and both result in
specific words being masked. Masking is performed such that all characters in a word
except the first are replaced by asterisks (for example, f***); the original
words can still be obtained using the orig_text
field in Word
.
Words in content_moderation_phrases
may contain only the following characters:
lower-case English letters, -
and '
.
For fields related to speaker diarization and identification, refer to Speaker AI.
The model
field specifies the model to use. Valid models are precision
,
precision_medical
and precision_ivr
. Refer to
Medical Domain
and IVR Domain.
Transcription Results
message Result
message Result {
repeated Word words = 1;
int32 final_proc_time_ms = 2;
int32 total_proc_time_ms = 3;
repeated ResultSpeaker speakers = 6;
int32 channel = 7;
}
message ResultSpeaker {
int32 speaker = 1;
string name = 2;
}
The Result
message represents a speech recognition result, containing transcribed words
and other data.
The words
field contains a sequence of Word
messages representing
transcribed words.
The final_proc_time_ms
and total_proc_time_ms
fields determine the duration of processed
audio in milliseconds, resulting in final and all words respectively. In a Transcribe
request, both values are equal. In a TranscribeStream
request, these are consistent with
the timestamps of words, such that final words are in the interval from 0 to
final_proc_time_ms
and non-final words are in the interval from final_proc_time_ms
to total_proc_time_ms
. These values never decrease in subsequent results for the
same transcription.
If using Speaker Identification, the speakers
field
contains the latest associations between speaker numbers and candidate speakers (for
all words from the start of the audio, not just words in this result).
When separate recognition per channel is enabled, the channel
field indicates the
audio channel that the result is associated with. Audio channels are numbered starting
with 0.
message Word
message Word {
string text = 1;
int32 start_ms = 2;
int32 duration_ms = 3;
bool is_final = 4;
int32 speaker = 5;
string orig_text = 8;
double confidence = 9;
}
The Word
message represents an individual recognized word, which is given in the
text
field.
Punctuation symbols are represented as individual. When converting a transcription
result to text for human consumption, it is important to recognize punctuation symbols
and not add a space between the previous word and the punctuation symbol. For this
purpose, the following text
values should be treated as punctuation symbols:
,
, ;
, .
, ?
, !
.
start_ms
and duration_ms
represent the time interval of the word in the audio.
These are consistent with the order of words, such that start_ms
of the next word
is always greater than or equal to start_ms
+ duration_ms
of the current word.
is_final
specifies if the word is final. This disinction is relevant only when using
TranscribeStream
with include_nonfinal
equal to true; in other cases is_final
is
always true. Refer to
Final vs Non-final Words.
The orig_text
field indicates the original word in if the word in text
is masked.
Refer to Profanity Filter and
Custom Content Moderation.
If using Speaker Diarization, the speaker
field indicates the
speaker number. Valid speaker numbers are greater than or equal to 1.
The confidence
field specifies the confidence, in the range between 0 and 1. The value
of 0 means that confidence information is not available for the word (this may be the
case for punctuation words).
Asynchronous Transcription
These requests allow a client to upload files to be transcribed asynchronously and to retrieve the transcription results later. This feature supports a variety of media file formats including video files.
A file is uploaded for transcription using TranscribeAsync
. The
status of the transcription can be queried using
GetTranscribeAsyncStatus
. The result of the transcription
is retrieved using GetTranscribeAsyncResult
.
A file that has been uploaded (and not yet deleted) is in one of the following states:
QUEUED
: The file is queued to be transcribed.TRANSCRIBING
: The file is being transcribed.COMPLETED
: The file has been transcribed successfully, the result is available.FAILED
: Transcription has failed, the result is not and will not be available.
A file that is not in the TRANSCRIBING
state can be deleted using
DeleteTranscribeAsyncFile
. It is the responsibility
of the user to delete files, they are not deleted automatically.
There is a limit on the numbers of files that have been uploaded but not yet deleted,
which is 100. It the limit is reached, further TranscribeAsync
requests will be rejected
with gRPC status code RESOURCE_EXHAUSTED
and details message starting with
<too_many_files>
. It is the responsibility of the user to prevent or handle these errors.
rpc TranscribeAsync
rpc TranscribeAsync(stream TranscribeAsyncRequest) returns (TranscribeAsyncResponse) {}
message TranscribeAsyncRequest {
string api_key = 1;
string reference_name = 3;
TranscriptionConfig config = 5;
bytes audio = 4;
}
message TranscribeAsyncResponse {
string file_id = 1;
}
The TranscribeAsync
request is used to upload a file for asyncronous transcription.
The client sends a sequence of TranscribeAsyncRequest
messages, where the first message
specifies the api_key
, reference_name
and config
and may contain an audio
chunk,
and any further messages contain only an audio
chunk. The audio chunks are concatenated
to form the complete audio file. The maximum size of an audio chunk is 5 MB.
The maximum total size is 500 MB. The maximum total duration of audio is 5 hours.
Audio is extracted from the file as a part of TranscribeAsync
. For larger files, it may
take a few seconds to decode the audio after all the audio data has been received by the
service. If there is an error during decoding, the TranscribeAsync
request will fail with
gRPC status code UNKNOWN
and details message starting with <invalid_media_file>
.
The reference_name
specified in the first request is intended to enable the user to
identify the file after it has been uploaded. It can be any string not longer than 256
characters, including the empty string, and duplicates are allowed. The service does
not use this field, it is only for reference.
If TranscribeAsync
succeeds, the automatically assigned file_id
is returned. The
current state of the file can be queried by calling
GetTranscribeAsyncStatus
with the file_id
.
The file will initially be in the QUEUED
state and will transition to the TRANSCRIBING
state when the transcription starts. The time that this takes depends on the current service
load. When transcription has completed, the file will transition to the COMPLETED
state
and the results can be retrieved using
GetTranscribeAsyncResult
. If transcription fails, the
file will instead transition to the FAILED
state. The file will then remain in the
COMPLETED
or FAILED
state until it is deleted using
DeleteTranscribeAsyncFile
rpc GetTranscribeAsyncStatus
rpc GetTranscribeAsyncStatus(GetTranscribeAsyncStatusRequest) returns (GetTranscribeAsyncStatusResponse) {}
message GetTranscribeAsyncStatusRequest {
string api_key = 1;
string file_id = 2;
}
message GetTranscribeAsyncStatusResponse {
repeated TranscribeAsyncFileStatus files = 1;
}
message TranscribeAsyncFileStatus {
string file_id = 1;
string reference_name = 2;
// One of: QUEUED, TRANSCRIBING, COMPLETED, FAILED
string status = 3;
// UTC timestamp
google.protobuf.Timestamp created_time = 4;
string error_message = 5;
}
The GetTranscribeAsyncStatus
request returns the state and other information for a
specific file or for all existing files.
If file_id
in the request is non-empty, information for a file with that ID is returned.
In this case, if there is no file with that ID, the request fails with gRPC status code
NOT_FOUND
and details message starting with <file_id_not_found>
. If file_id
is empty,
information about all existing files is returned.
The files
field in the response is a sequence of TranscribeAsyncFileStatus
messages.
If file_id
in the request was non-empty, there will be exactly one element, otherwise
there will be one element for each existing file ordered by increasing created_time
.
The following information is returned for each file in the response file_id
,
reference_name
, status
(state), created_time
(UTC timestamp when TranscribeAsync
has completed) and error_message
(error information if the status is FAILED
).
rpc GetTranscribeAsyncResult
rpc GetTranscribeAsyncResult(GetTranscribeAsyncResultRequest) returns (stream GetTranscribeAsyncResultResponse) {}
message GetTranscribeAsyncResultRequest {
string api_key = 1;
string file_id = 2;
}
message GetTranscribeAsyncResultResponse {
bool separate_recognition_per_channel = 2;
Result result = 1;
}
The GetTranscribeAsyncResult
request retrieves the the transcription results for a file
in the COMPLETED
state.
The file for which to retrieve results is specified by file_id
. If there is no file
with that ID, the request fails with gRPC status code NOT_FOUND
and details message
starting with <file_id_not_found>
.
If the file is still in the QUEUED
or TRANSCRIBING
state, the request fails with
gRPC status code FAILED_PRECONDITION
and details message starting with
<file_not_transcribed_yet>
. If the file is in the FAILED
state, it fails with
gRPC status code FAILED_PRECONDITION
and details message starting with
<file_transcription_failed>
.
The transcription results are returned as a sequence of Result
messages embedded in a sequence of GetTranscribeAsyncResultResponse
responses, similar
to TranscribeStream
. The user can assemble the complete result
by concatenating the words
from all responses, which are guaranteed to be final, and
taking the fields final_proc_time_ms
, total_proc_time_ms
and speakers
from the
last response.
If separate recognition per channel was enabled, as is indicated by the field
separate_recognition_per_channel
in each response (having the same value in
all responses), the assembly of results described above must be done on a
per-channel basis according to result.channel
in each response.
rpc DeleteTranscribeAsyncFile
rpc DeleteTranscribeAsyncFile(DeleteTranscribeAsyncFileRequest) returns (DeleteTranscribeAsyncFileResponse) {}
message DeleteTranscribeAsyncFileRequest {
string api_key = 1;
string file_id = 2;
}
message DeleteTranscribeAsyncFileResponse {
}
The DeleteTranscribeAsyncFile
request deletes a specific file.
The file to delete is specified by file_id
. If there is no file with that ID, the request
fails with gRPC status code NOT_FOUND
and details message starting with
<file_id_not_found>
.
A file can be deleted as long as it is not in the TRANSCRIBING
state. If it is, the
request fails with gRPC status code FAILED_PRECONDITION
and details message starting
with <file_being_transcribed>
.
Transcription of Meetings
The TranscribeMeeting
request is provided for the purpose of real-time low-latency
transcription of a meeting with a separate audio stream for each participant. A requirement
for using this is that the application performs voice activity detection and sends only
segments of audio with voice activity detected.
The term stream is used as synonymous with meeting participant. The term segment means a contiguous audio recording sent to the service in the context of a specific stream. Each segment is itself sent to the service in a number of requests, to enable low-latency operation.
rpc TranscribeMeeting
rpc TranscribeMeeting(stream TranscribeMeetingRequest) returns (stream TranscribeMeetingResponse) {}
message TranscribeMeetingRequest {
string api_key = 1;
TranscriptionConfig config = 10;
int32 seq_num = 3;
int32 stream_id = 4;
bool start_of_segment = 5;
bytes audio = 6;
bool end_of_segment = 7;
}
message TranscribeMeetingResponse {
int32 seq_num = 1;
int32 stream_id = 2;
bool start_of_segment = 3;
bool end_of_segment = 4;
Result result = 5;
string error = 6;
}
The client sends a sequence of TranscribeMeetingRequest
requests to the service. The
api_key
and config
are provided in the first request and must not be included in
later requests. The configuration specified in config
applies to all streams.
Since TranscribeMeeting
is intended only for the real-time low-latency use case,
it is required that config.include_nonfinal
is set to true.
The seq_num
is an opaque value which is returned in each response. More information
about responses is given below.
The stream_id
determines which stream the fields start_of_segment
, audio
and
end_of_segment
apply to. A stream_id
of 0 means no stream, and in that case these
fields must have default values. It is recommended to send at least one request every
10 seconds to prevent the session from timing out; using stream_id
0 enables doing
so when no audio needs to be sent to the service.
Assuming that stream_id
is not 0, the audio
field contains new audio data for the
stream, if any. The start_of_segment
and end_of_segment
flags indicate whether an
audio segment starts before audio
, or ends after audio
, respectively. These flags
must be consistent within a stream. Specifically: start_of_segment
must be true in
the first request for the stream, and start_of_segment
must also be true if
end_of_segment
was true in the previous request for the same stream. If this is not
the case, some of the audio will not be processed.
Note that, effectively, an audio segment is defined as the concatenation of audio
data starting from a request where start_of_segment
is true up to first request
(the same or a later one) where end_of_segment
is true, considering only requests
for the same stream.
IMPORTANT: Each audio segment is decoded into audio samples independently. If using a container format, each audio segment must be encoded independently of previous segments in the same stream.
IMPORTANT: Active streams, that is streams where end_of_segment
was false in the
last request for that stream, occupy resources on the service. Make sure to terminate an
active stream when it is no longer relevant by sending a request with end_of_segment
equal to true (for example, when the meeting participant disconnects).
Transcription results are returned as a stream of TranscribeMeetingResponse
responses.
For each request, the service will send exactly one response, which will have the same
seq_num
, stream_id
, start_of_segment
and end_of_segment
. Within the same stream,
the order of responses will match the order of requests, but this is not generally true
within different streams. The seq_num
field can be used to reliably match responses to
requests.
IMPORTANT: In the future, the behavior may change such that there might not be one
response for each request, but one response could represent a number of consecutive
requests for the same stream belonging to the same segment. In such a response,
start_of_segment
would be that of the first of these requests, while end_of_segment
and seq_num
would be that of the last of these requests.
Errors specific to a stream generally do not result in the entire TranscribeStream
failing, but are reported using the error
field in the response. If the error
field
is non-empty, an error has occured, and the value of the field is the error message.
It is important that the client application checks for and reports these errors.
The actual transcription results are given in the result
field in the same manner as
for TranscribeStream
, but they must be interpreted in the context of the specific
stream. Note that the client must check whether result
is present before interpreting
it (refer to TranscribeStream
). For the special stream ID 0, result
will never be
present.
The transcript is always finalized at the end of each segment. Specifically, in a response
where end_of_segment
is true, result
will be present and will not contain any non-final
words.
Speech Context Management
These requests are used to manage the user's speech contexts stored in the Soniox Cloud.
Storing a speech context enables using it in a Transcribe
or TranscribeStream
request
as an alternative to directly specifying it. For general information about speech contexts,
refer to Custom Vocabulary.
Stored speech contexts exist in the context of a user account, where they are uniquely identified by a user-specified name. The user account is inferred from the API key used in the request.
rpc CreateSpeechContext
rpc CreateSpeechContext(CreateSpeechContextRequest) returns (CreateSpeechContextResponse) {}
message CreateSpeechContextRequest {
string api_key = 1;
SpeechContext speech_context = 2;
}
message CreateSpeechContextResponse {
}
The CreateSpeechContext
request creates a stored speech context.
The name of the speech context to create is specified as a part of the speech context in
speech_context.name
, which must be non-empty.
If a speech context with that name already exists, an error with status code
ALREADY_EXISTS
is returned. If the speech_context
does not satisfy the
SpeechContext
requirements, an error with status code
INVALID_ARGUMENT
and a message describing the problem is returned.
rpc UpdateSpeechContext
rpc UpdateSpeechContext(UpdateSpeechContextRequest) returns (UpdateSpeechContextResponse) {}
message UpdateSpeechContextRequest {
string api_key = 1;
SpeechContext speech_context = 2;
}
message UpdateSpeechContextResponse {
}
The UpdateSpeechContext
request updates an existing stored speech context.
The name of the speech context to update is specified as a part of the speech context in
speech_context.name
.
If there is no speech context with that name, an error with status code NOT_FOUND
is
returned. If the speech_context
does not satisfy the
SpeechContext
requirements, an error with status code
INVALID_ARGUMENT
and a message describing the problem is returned.
rpc DeleteSpeechContext
rpc DeleteSpeechContext(DeleteSpeechContextRequest) returns (DeleteSpeechContextResponse) {}
message DeleteSpeechContextRequest {
string api_key = 1;
string name = 2;
}
message DeleteSpeechContextResponse {
}
The DeleteSpeechContext
deletes a stored speech context.
The name of the speech context to delete is specified in name
. If there is no speech
context with that name, an error with status code NOT_FOUND
is returned.
rpc ListSpeechContextNames
rpc ListSpeechContextNames(ListSpeechContextNamesRequest) returns (ListSpeechContextNamesResponse) {}
message ListSpeechContextNamesRequest {
string api_key = 1;
}
message ListSpeechContextNamesResponse {
repeated string names = 1;
}
The ListSpeechContextNames
request returns the names of all stored speech contexts.
The names are returned in the names
field in no specific order.
rpc GetSpeechContext
rpc GetSpeechContext(GetSpeechContextRequest) returns (GetSpeechContextResponse) {}
message GetSpeechContextRequest {
string api_key = 1;
string name = 2;
}
message GetSpeechContextResponse {
SpeechContext speech_context = 1;
}
The GetSpeechContext
request retrieves a stored speech context.
The name of the speech context to retrieve is sepecified in the name
field. If there is
no speech context with that name, an error with status code NOT_FOUND
is returned.
message SpeechContext
message SpeechContext {
repeated SpeechContextEntry entries = 1;
string name = 2;
}
The SpeechContext
message represents a speech context.
The entries
field contains the entries defining the speech context, represented
by SpeechContextEntry
messages. The name
field
represents the name of the speech context.
The presence of entries
and name
depends on the context:
- When a speech context is sent in
CreateSpeechContext
orUpdateSpeechContext
, or returned inGetSpeechContext
, both are required or guaranteed to be non-empty respectively. - In a
Transcribe
orTranscribeStream
request, either both or exactly one must be non-empty. If both are empty, no speech context is used; ifentries
is non-empty, these entries are used; ifname
is non-empty, a stored speech context with that name is used.
Requirements:
- The size of the name must not exceed 50 characters.
- For each entry:
SpeechContextEntry
requirements. - The number of phrases in the entire speech context must not exceed 100.
- There must be no duplicate phrases in the entire speech context, after removing leading, trailing and repeated spaces.
message SpeechContextEntry
message SpeechContextEntry {
repeated string phrases = 1;
double boost = 2;
}
The SpeechContextEntry
message represents an entry in a speech context, defined by a list
of phrases and a single boost value that applies to these phrases. Words in each phrase are
separated by spaces.
Requirements:
- There must be at least phrase.
- The size of a phrase must not exceed 100 characters.
- The number of words in a phrase must be between 1 and 5
- The size of a word must not exceed 25 characters.
- A phrase may contain only the following characters:
a
-z
(lower-case only),'
(apostrophe),-
(hyphen/minus),(space).
- The boost value must be between -30 and 30 inclusive.
Speaker AI
Speaker Diarization
Speaker diarization distinguishes speakers based on their voice. Please refer to the Speaker AI guide for general information.
Speaker diarization is available for the Transcribe
,
TranscribeStream
and TranscribeAsync
.
It is enabled by setting config.enable_global_speaker_diarization
or
config.enable_streaming_speaker_diarization
to true, to use global or streaming
speaker diarization mode respectively. When speaker diarization is enabled, a speaker
number is included with each returned word (speaker
field in Word
).
When global speaker diarization is used with TranscribeStream
, specific restrictions
and considerations apply:
config.include_nonfinal
must be false. Therefore, real-time recognition is not possible.- Transcription results will be returned only after the end of the request stream. It may take some time before these are returned, depending on the audio duration.
- The total audio duration is limited to no more than 2 hours.
Stereaming speaker diarization does not have the restrictions above, but generally has lower accuracy, since it is optimized for low-latency real-time transcription.
When using speaker diarization, the minimum and maximum number of speakers can be
specified by setting config.min_num_speakers
and config.max_num_speakers
respectively. By default (if these values are 0), the service assumes that there
are between 1 and 10 speakers. The maximum permitted value of max_num_speakers
is 20. Note that if the actual number of speakers in the audio is outside of
the specified (or default) range, the accuracy of speaker diarization may be low.
Speaker Identification
Speaker identification works together with speaker diarization to associate numbered speakers with named candidate speakers based on voice samples provided in advance by the user.
A set of gRPC API calls are available for speaker management:
AddSpeaker
: Add a new speaker.GetSpeaker
: Return information about a specific speaker.RemoveSpeaker
: Remove a specific speaker.ListSpeakers
: Return a list of registered speakers.AddSpeakerAudio
: Add a new audio for a specific speaker.GetSpeakerAudio
: Retrieve a specific audio of a specific speaker.RemoveSpeakerAudio
: Remove a specific audio of a specific speaker.
Each speaker is identified by a speaker name, and each of the speaker's audios is identified by an audio name. Speaker names are unique in the context of the Soniox user account, while audio names are unique in the context of the speaker they belong to.
A simple command-line application
manage_speakers
is provided as a frontend to the speaker management API. This application can be
used to add speakers and audios for testing purposes, and it is also a good reference
for using these API calls directly.
In order to use speaker identification with Transcribe
or
TranscribeStream
, the following must be done:
- Speaker Diarization must be enabled (either global or streaming mode).
config.enable_speaker_identification
must be set to true.- Names of candidate speakers must be provided in
config.cand_speaker_names
.
Each of the candidate speakers specified must be an existing speaker as added using
AddSpeaker
(or manage_speakers --add_speaker
). If this is not the case, and error will
be returned. However, if some of these speakers do not have any audios, no error will be
returned, but it will not be possible to identify those speakers.
Results of speaker identification are provided in the speakers
field in the
Result
structure. This is a list of associations between speaker
numbers and candidate speakers. This list will not include entries for speaker
numbers that were not associated with a candidate speaker.