Models and Languages#

We build only high accuracy speech recognition AI solutions that enable you to transcribe any audio and get back highly accurate transcripts. For comparison with other providers, please see our benchmarks.

We support two types of models: default and low-latency. The default model provides the highest accuracy (much higher than low-latency) and should be used for almost all use cases (e.g. transcribing files, meetings, phone calls, voice interactions). The low-latency model should be used only when instant recognition is required (e.g. live captioning).

Language

Model name

Model type

Speaker AI Supported

English

en_v2

Default

Yes

English

en_v2_lowlatency

Low-latency

Yes

Korean

ko_v2

Default

No

Korean

ko_v2_lowlatency

Low-latency

No

Chinese (simplified)

zh_v2

Default

No

Chinese (simplified)

zh_v2_lowlatency

Low-latency

No

Spanish

es_v2

Default

No

Spanish

es_v2_lowlatency

Low-latency

No

French

fr_v2

Default

No

French

fr_v2_lowlatency

Low-latency

No

Italian

it_v2

Default

No

Italian

it_v2_lowlatency

Low-latency

No

Portuguese

pt_v2

Default

No

Portuguese

pt_v2_lowlatency

Low-latency

No

German

de_v2

Default

No

German

de_v2_lowlatency

Low-latency

No

You must configure the model by setting the model TranscriptionConfig field to a valid model name (e.g., en_v2). Refer to Configure Requests. Do not forget to specify the model, as specifing no model will result in a legacy English model being used.

Bilingual Solutions#

For all non-English languages, Soniox’s speech recognition AI is a bilingual solution, meaning that it can recognize both the native and English language. For example, the model ko_v2 can recognize both Korean and English with high accuracy.

Note that English models (en_v2, en_v2_lowlatency) have higher accuracy on English only audio, therefore it is recommended to use the English models when the entire audio or large audio segment is in English language.

Default vs Low-Latency Model#

You should consider using the low-latency model only when instant recognition of words is required. Typical use cases include live captioning and live dictation. The low-latency processing mode is enabled by setting the model to a low-latency model (e.g. en_v2_lowlatency) and by setting the enable_nonfinal field to true. The low-latency model can be only used with streaming API calls.

However, our streaming API calls also support the default model, allowing you to transcribe the stream with maximum accuracy. However, this may result in many seconds of latency for the recognized words. This is particularly useful for applications where you can send the audio in real-time and obtain the entire transcript as soon as possible after the end of the audio. Typical use cases for this would include voice interactions.

In all other API calls (e.g. Transcribe Short Audio and Transcribe Files), only the default model is supported.

Spaces in Chinese Models#

With Chinese models, predicted space tokens should be treated somewhat differently. A space token represents either a physical space or a suggestion where a line break may occur. While the distinction is not provided by the model, a simple heuristic can be used: treat the space as a possible line break when it is, on both sides, adjacent to a Chinese character or a Chinese (full-width) punctuation symbol.