Soniox | How to Choose a Speech-to-Text Provider?

1. Speech Recognition Accuracy

What is the Word Error Rate (WER) for your specific use case?
- Domains: medical, contact center, sales, legal, etc.
- Channels: telephony, video conference, in-person or hybrid meetings
- Accents
- Background noises: chatters, traffic, music, applause, cheers, etc.
- Interjections & crosstalks
- Speaking formats: monologues, 1:1 conversations, group conversations, presentations, dictations
- Audio quality
Be sure to use double-reviewed and normalized ground truth transcriptions when benchmarking
Can the accuracy be improved with customization?
Does the accuracy hold up in real-time vs batch processing?

Are the punctuations accurate?
Are the capitalizations accurate?
Are numbers and units formatted properly?
- Example: “$7.72” vs “seven dollars and seventy two cents”

Are there different tiered speech-to-text models at different price points?
Does the provider charge for multiple streams/channels? Or does the provider charge only for 1 single stream/channel?
- Example:
  1. 1 hour of a 5-person meeting = 5 hours of audio charge OR
  2. 1 hour of a 5-person meeting = 1 hour of audio charge
Is the price quoted for real-time speech recognition or batch processing?
Are there further discounts for non-urgent file processing (e.g. within 24 hours)?
Are there volume discounts?
Are there charges for additional fees for features such as speaker diarization, speaker identification or content moderation?

What are the available deployment options?
- Public cloud API
- Private cloud API
- On-premises
- On-device

How long is it going to take your developers to integrate with the speech-to-text provider?
Are there SDKs in your chosen language?

Does your application require additional features such as speaker separation (diarization), speaker identification or content moderation?
- If so, be sure to benchmark the performance of these features.
It is one thing for a provider to say “Yes, we have feature X” and a completely different thing to actually deliver feature X with high accuracy.
- Example: most providers support speaker diarization; yet the accuracy of their speaker diarization is usually low - around 75%.