Models and Languages#

We build only high accuracy speech recognition AI solutions that enable you to transcribe any audio and get back highly accurate transcripts. For comparison with other providers, please see our benchmarks.

We support two types of models: default and low-latency. The default model provides the highest accuracy (much higher than low-latency) and should be used for almost all use cases (e.g. transcribing files, meetings, phone calls, voice interactions). The low-latency model should be used only when instant recognition is required (e.g. live captioning).

Language Model name Model type Speaker AI Supported
English en_v2 Default Yes
English en_v2_lowlatency Low-latency Yes
Korean ko_v2 Default No
Korean ko_v2_lowlatency Low-latency No
Chinese (simplified) zh_v2 Default No
Chinese (simplified) zh_v2_lowlatency Low-latency No
Spanish es_v2 Default No
Spanish es_v2_lowlatency Low-latency No
French fr_v2 Default No
French fr_v2_lowlatency Low-latency No
Italian it_v2 Default No
Italian it_v2_lowlatency Low-latency No
Portuguese pt_v2 Default No
Portuguese pt_v2_lowlatency Low-latency No
German de_v2 Default No
German de_v2_lowlatency Low-latency No

You must configure the model by setting the model TranscriptionConfig field to a valid model name (e.g., en_v2). Refer to Configure Requests. Do not forget to specify the model, as specifing no model will result in a legacy English model being used.

Bilingual Solutions#

For all non-English languages, Soniox’s speech recognition AI is a bilingual solution, meaning that it can recognize both the native and English language. For example, the model ko_v2 can recognize both Korean and English with high accuracy.

Note that English models (en_v2, en_v2_lowlatency) have higher accuracy on English only audio, therefore it is recommended to use the English models when the entire audio or large audio segment is in English language.

Default vs Low-Latency Model#

You should consider using the low-latency model only when instant recognition of words is required. Typical use cases include live captioning and live dictation. The low-latency processing mode is enabled by setting the model to a low-latency model (e.g. en_v2_lowlatency) and by setting the enable_nonfinal field to true. The low-latency model can be only used with streaming API calls.

However, our streaming API calls also support the default model, allowing you to transcribe the stream with maximum accuracy. However, this may result in many seconds of latency for the recognized words. This is particularly useful for applications where you can send the audio in real-time and obtain the entire transcript as soon as possible after the end of the audio. Typical use cases for this would include voice interactions.

In all other API calls (e.g. Transcribe Short Audio and Transcribe Files), only the default model is supported.

Spaces in Chinese Models#

With Chinese models, predicted space tokens should be treated somewhat differently. A space token represents either a physical space or a suggestion where a line break may occur. While the distinction is not provided by the model, a simple heuristic can be used: treat the space as a possible line break when it is, on both sides, adjacent to a Chinese character or a Chinese (full-width) punctuation symbol.