Soniox

Soniox conducted a comprehensive evaluation of the accuracy of various speech recognition providers in the industry. The benchmarking results are summarized as follows:

Providers evaluated: Soniox, OpenAI, Google, AWS, Azure, NVIDIA, Deepgram, AssemblyAI, Speechmatics, and ElevenLabs.
Languages evaluated: 60 languages.
Evaluation datasets: Real-world datasets of YouTube videos for each language, covering diverse acoustic conditions, speaking styles, accents, topics, and speaker variations.
Ground truth transcriptions: Transcribed and double-reviewed by humans, then normalized to ensure a fair evaluation across providers.
Processing modes evaluated: Asynchronous (file/batch) transcription.
Results: Soniox achieved the highest speech recognition accuracy across most languages by a significant margin.

Download benchmarks report

Evaluated providers and languages

To assess the accuracy of speech recognition providers across multiple languages, we conducted a rigorous benchmarking study using Word Error Rate (WER) and Character Error Rate (CER) as the primary evaluation metrics. These industry-standard metrics provide a quantitative measure of transcription accuracy, with lower values indicating superior performance. CER was used for the following languages: Korean, Chinese, Japanese, and Thai.

Our evaluation was based on 45 to 70 minutes of real-world audio per language, sourced from YouTube to ensure a diverse and challenging dataset. The selected samples encompass various acoustic conditions, speaking styles, accents, and topics, providing a robust assessment of model performance in real-world scenarios. Ground truth transcriptions were carefully transcribed and double-reviewed by humans, then normalized to ensure consistency and fairness across all providers.

Evaluation process

Dataset selection: For each language, real-world YouTube videos were chosen to reflect diverse speech patterns, varying acoustic conditions, and multiple speaker types.
Transcription & ground truth creation: All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.
Model integration: Each provider's API was carefully integrated according to official documentation, ensuring a fair and accurate comparison.
Evaluation metrics:
- WER (Word Error Rate): Measures transcription errors at the word level.
- CER (Character Error Rate): Used for logographic or non-space-separated languages to provide a finer-grained accuracy measurement.
Processing mode: All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.

Models evaluated

Provider	Model evaluated
Soniox	`stt-async-preview`
OpenAI	`Whisper large-v3`
Google	`long` (for supported languages), `chirp_2` (for other languages)
AWS	`Best/Default`
Azure	`Best/Default`
NVIDIA	`conformer-{lang}-asr-offline-asr-bls-ensemble` (for supported languages), `parakeet-1.1b-unified-ml-cs-universal-multi-asr-offline-asr-bls-ensemble`(for other languages)
Deepgram	`nova-3` (for English), `nova-2` (for other languages)
AssemblyAI	`best` (for supported languages), `nano` (for other languages)
Speechmatics	`enhanced`
ElevenLabs	`scribe_v1`

This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for 60 languages.

Speech-to-text benchmarks

Evaluated providers and languages

Evaluation process

Models evaluated

Transcription accuracy comparison of speech-to-text providers (2025)