Models and languages

We build only high accuracy speech recognition AI solutions that enable you to transcribe any audio and get back highly accurate transcripts. For comparison with other providers, please see our benchmarks.

We support two types of models: default and low-latency. The default model provides the highest accuracy (much higher than low-latency) and should be used for almost all use cases (e.g. transcribing files, meetings, phone calls, voice interactions). The low-latency model should be used only when instant recognition is required (e.g. live captioning).

Language	Model name	Model type	Speaker AI Supported
English	en_v2	Default	Yes
English	en_v2_lowlatency	Low-latency	Yes
Korean	ko_v2	Default	No
Korean	ko_v2_lowlatency	Low-latency	No
Chinese (simplified)	zh_v2	Default	No
Chinese (simplified)	zh_v2_lowlatency	Low-latency	No
Spanish	es_v2	Default	No
Spanish	es_v2_lowlatency	Low-latency	No
French	fr_v2	Default	No
French	fr_v2_lowlatency	Low-latency	No
Italian	it_v2	Default	No
Italian	it_v2_lowlatency	Low-latency	No
Portuguese	pt_v2	Default	No
Portuguese	pt_v2_lowlatency	Low-latency	No
German	de_v2	Default	No
German	de_v2_lowlatency	Low-latency	No

You must configure the model by setting the model TranscriptionConfig field to a valid model name (e.g., en_v2). Refer to Configure Requests. Do not forget to specify the model, as specifing no model will result in a legacy English model being used.

Bilingual solutions

For all non-English languages, Soniox's speech recognition AI is a bilingual solution, meaning that it can recognize both the native and English language. For example, the model ko_v2 can recognize both Korean and English with high accuracy.

Note that English models (en_v2, en_v2_lowlatency) have higher accuracy on English only audio, therefore it is recommended to use the English models when the entire audio or large audio segment is in English language.

Default vs low-latency model

You should consider using the low-latency model only when instant recognition of words is required. Typical use cases include live captioning and live dictation. The low-latency processing mode is enabled by setting the model to a low-latency model (e.g. en_v2_lowlatency) and by setting the enable_nonfinal field to true. The low-latency model can be only used with streaming API calls.

However, our streaming API calls also support the default model, allowing you to transcribe the stream with maximum accuracy. However, this may result in many seconds of latency for the recognized words. This is particularly useful for applications where you can send the audio in real-time and obtain the entire transcript as soon as possible after the end of the audio. Typical use cases for this would include voice interactions.

In all other API calls (e.g. Transcribe Short Audio and Transcribe Files), only the default model is supported.

Spaces in Chinese models

With Chinese models, predicted space tokens should be treated somewhat differently. A space token represents either a physical space or a suggestion where a line break may occur. While the distinction is not provided by the model, a simple heuristic can be used: treat the space as a possible line break when it is, on both sides, adjacent to a Chinese character or a Chinese (full-width) punctuation symbol.

Models and languages

Bilingual solutions

Default vs low-latency model

Spaces in Chinese models

On this page