Soniox
Benchmarks

Speech-to-text benchmarks 2025

Soniox conducted a comprehensive evaluation of the accuracy of various speech recognition providers in the industry. The benchmarking results are summarized as follows:

  • Providers evaluated: Soniox, OpenAI, Google, AWS, Azure, NVIDIA, Deepgram, AssemblyAI, Speechmatics, and ElevenLabs.
  • Languages evaluated: 60 languages.
  • Evaluation datasets: Real-world datasets of YouTube videos for each language, covering diverse acoustic conditions, speaking styles, accents, topics, and speaker variations.
  • Ground truth transcriptions: Transcribed and double-reviewed by humans, then normalized to ensure a fair evaluation across providers.
  • Processing modes evaluated: Asynchronous (file/batch) transcription.
  • Results: Soniox achieved the highest speech recognition accuracy across most languages by a significant margin.

Evaluated providers and languages

To assess the accuracy of speech recognition providers across multiple languages, we conducted a rigorous benchmarking study using Word Error Rate (WER) and Character Error Rate (CER) as the primary evaluation metrics. These industry-standard metrics provide a quantitative measure of transcription accuracy, with lower values indicating superior performance. CER was used for the following languages: Korean, Chinese, Japanese, and Thai.

Our evaluation was based on 45 to 70 minutes of real-world audio per language, sourced from YouTube to ensure a diverse and challenging dataset. The selected samples encompass various acoustic conditions, speaking styles, accents, and topics, providing a robust assessment of model performance in real-world scenarios. Ground truth transcriptions were carefully transcribed and double-reviewed by humans, then normalized to ensure consistency and fairness across all providers.

Evaluation process

  1. Dataset selection: For each language, real-world YouTube videos were chosen to reflect diverse speech patterns, varying acoustic conditions, and multiple speaker types.
  2. Transcription & ground truth creation: All dataset transcriptions were manually created, double-reviewed by humans, and normalized to provide a consistent reference for evaluation. The normalization process included removing punctuation and ignoring capitalization; otherwise, the ground truth transcription remained unchanged.
  3. Model integration: Each provider's API was carefully integrated according to official documentation, ensuring a fair and accurate comparison.
  4. Evaluation metrics:
    • WER (Word Error Rate): Measures transcription errors at the word level.
    • CER (Character Error Rate): Used for logographic or non-space-separated languages to provide a finer-grained accuracy measurement.
  5. Processing mode: All models were evaluated in asynchronous (file/batch) transcription mode to ensure consistency in testing.

Models evaluated

ProviderModel evaluated
Sonioxstt-async-preview
OpenAIWhisper large-v3
Googlelong (for supported languages),
chirp_2 (for other languages)
AWSBest/Default
AzureBest/Default
NVIDIAconformer-{lang}-asr-offline-asr-bls-ensemble (for supported languages),
parakeet-1.1b-unified-ml-cs-universal-multi-asr-offline-asr-bls-ensemble(for other languages)
Deepgramnova-3 (for English),
nova-2 (for other languages)
AssemblyAIbest (for supported languages),
nano (for other languages)
Speechmaticsenhanced
ElevenLabsscribe_v1

This evaluation provides a transparent and rigorous comparison of speech recognition performance across industry-leading providers for 60 languages.



Transcription accuracy comparison of speech-to-text providers (2025)