Soniox conducted a comprehensive evaluation of the accuracy of various speech recognition providers in the industry. The benchmarking results are summarized as follows:
- Providers evaluated: Soniox, OpenAI, Google, AWS, Azure, NVIDIA, Deepgram, AssemblyAI, Speechmatics, and ElevenLabs.
- Languages evaluated: 60 languages.
- Evaluation datasets: Real-world datasets of YouTube videos for each language, covering diverse acoustic conditions, speaking styles, accents, topics, and speaker variations.
- Ground truth transcriptions: Transcribed and double-reviewed by humans, then normalized to ensure a fair evaluation across providers.
- Processing modes evaluated: Asynchronous (file/batch) transcription.
- Results: Soniox achieved the highest speech recognition accuracy across most languages by a significant margin.
Evaluated providers and languages
To assess the accuracy of speech recognition providers across multiple languages, we conducted a rigorous benchmarking study using Word Error Rate (WER) and Character Error Rate (CER) as the primary evaluation metrics. These industry-standard metrics provide a quantitative measure of transcription accuracy, with lower values indicating superior performance. CER was used for the following languages: Korean, Chinese, Japanese, and Thai.
Our evaluation was based on 45 to 70 minutes of real-world audio per language, sourced from YouTube to ensure a diverse and challenging dataset. The selected samples encompass various acoustic conditions, speaking styles, accents, and topics, providing a robust assessment of model performance in real-world scenarios. Ground truth transcriptions were carefully transcribed and double-reviewed by humans, then normalized to ensure consistency and fairness across all providers.
Evaluation process
- Dataset selection: For each language, real-world YouTube videos were chosen to reflect diverse speech patterns, varying acoustic conditions, and multiple speaker types.
- Transcription & ground truth creation: All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.
- Model integration: Each provider's API was carefully integrated according to official documentation, ensuring a fair and accurate comparison.
- Evaluation metrics:
- WER (Word Error Rate): Measures transcription errors at the word level.
- CER (Character Error Rate): Used for logographic or non-space-separated languages to provide a finer-grained accuracy measurement.
- Processing mode: All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.
Models evaluated
Provider | Model evaluated |
---|---|
Soniox | stt-async-preview |
OpenAI | Whisper large-v3 |
long (for supported languages),chirp_2 (for other languages) | |
AWS | Best/Default |
Azure | Best/Default |
NVIDIA | conformer-{lang}-asr-offline-asr-bls-ensemble (for supported languages),parakeet-1.1b-unified-ml-cs-universal-multi-asr-offline-asr-bls-ensemble (for other languages) |
Deepgram | nova-3 (for English),nova-2 (for other languages) |
AssemblyAI | best (for supported languages),nano (for other languages) |
Speechmatics | enhanced |
ElevenLabs | scribe_v1 |
This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for 60 languages.