Soniox
Docs
Speech-to-Text (legacy)/How-to guides

Models and languages

This page describes models offered by Soniox Speech-to-Text (legacy).

We build only high accuracy speech recognition AI solutions that enable you to transcribe any audio and get back highly accurate transcripts. For comparison with other providers, please see our benchmarks.

We support two types of models: default and low-latency. The default model provides the highest accuracy (much higher than low-latency) and should be used for almost all use cases (e.g. transcribing files, meetings, phone calls, voice interactions). The low-latency model should be used only when instant recognition is required (e.g. live captioning).

LanguageModel nameModel typeSpeaker AI Supported
Englishen_v2DefaultYes
Englishen_v2_lowlatencyLow-latencyYes
Koreanko_v2DefaultNo
Koreanko_v2_lowlatencyLow-latencyNo
Chinese (simplified)zh_v2DefaultNo
Chinese (simplified)zh_v2_lowlatencyLow-latencyNo
Spanishes_v2DefaultNo
Spanishes_v2_lowlatencyLow-latencyNo
Frenchfr_v2DefaultNo
Frenchfr_v2_lowlatencyLow-latencyNo
Italianit_v2DefaultNo
Italianit_v2_lowlatencyLow-latencyNo
Portuguesept_v2DefaultNo
Portuguesept_v2_lowlatencyLow-latencyNo
Germande_v2DefaultNo
Germande_v2_lowlatencyLow-latencyNo

You must configure the model by setting the model TranscriptionConfig field to a valid model name (e.g., en_v2). Refer to Configure Requests. Do not forget to specify the model, as specifing no model will result in a legacy English model being used.

Bilingual solutions

For all non-English languages, Soniox's speech recognition AI is a bilingual solution, meaning that it can recognize both the native and English language. For example, the model ko_v2 can recognize both Korean and English with high accuracy.

Note that English models (en_v2, en_v2_lowlatency) have higher accuracy on English only audio, therefore it is recommended to use the English models when the entire audio or large audio segment is in English language.

Default vs low-latency model

You should consider using the low-latency model only when instant recognition of words is required. Typical use cases include live captioning and live dictation. The low-latency processing mode is enabled by setting the model to a low-latency model (e.g. en_v2_lowlatency) and by setting the enable_nonfinal field to true. The low-latency model can be only used with streaming API calls.

However, our streaming API calls also support the default model, allowing you to transcribe the stream with maximum accuracy. However, this may result in many seconds of latency for the recognized words. This is particularly useful for applications where you can send the audio in real-time and obtain the entire transcript as soon as possible after the end of the audio. Typical use cases for this would include voice interactions.

In all other API calls (e.g. Transcribe Short Audio and Transcribe Files), only the default model is supported.

Spaces in Chinese models

With Chinese models, predicted space tokens should be treated somewhat differently. A space token represents either a physical space or a suggestion where a line break may occur. While the distinction is not provided by the model, a simple heuristic can be used: treat the space as a possible line break when it is, on both sides, adjacent to a Chinese character or a Chinese (full-width) punctuation symbol.

On this page