1. Speech Recognition Accuracy
- What is the Word Error Rate (WER) for your specific use case?
- Domains: medical, contact center, sales, legal, etc.
- Channels: telephony, video conference, in-person or hybrid meetings
- Background noises: chatters, traffic, music, applause, cheers, etc.
- Interjections & crosstalks
- Speaking formats: monologues, 1:1 conversations, group conversations, presentations, dictations
- Audio quality
- Be sure to use double-reviewed and normalized ground truth transcriptions when benchmarking
- Can the accuracy be improved with customization?
- Does the accuracy hold up in real-time vs batch processing?
2. Comprehension Quality
- Are the punctuations accurate?
- Are the capitalizations accurate?
- Are numbers and units formatted properly?
- Example: “$7.72” vs “seven dollars and seventy two cents”
3. Real-Time Performance
- How high is the latency when transcribing live?
- How does latency impact user experience?
- How does the accuracy of non-final words compare to final words?
- When a speaker talks fast, is latency further impacted?
- How does the accuracy in real-time compare to batch processing?
- Are there different tiered speech-to-text models at different price points?
- Does the provider charge for multiple streams/channels? Or does the provider charge only for 1 single stream/channel?
- 1 hour of a 5-person meeting = 5 hours of audio charge OR
- 1 hour of a 5-person meeting = 1 hour of audio charge
- Is the price quoted for real-time speech recognition or batch processing?
- Are there further discounts for non-urgent file processing (e.g. within 24 hours)?
- Are there volume discounts?
- Are there charges for additional fees for features such as speaker diarization, speaker identification or content moderation?
5. Deployment Options
- What are the available deployment options?
- Public cloud API
- Private cloud API
6. Scalability & Reliability
- What is the expected uptime?
- Are the deployments multi-region/multi-continent?
- What is the maximum number of concurrent streams/files supported?
7. Ease of Integration
- How long is it going to take your developers to integrate with the speech-to-text provider?
- Are there SDKs in your chosen language?
- What type of customization does the provider support?
- Can you customize transcription on-the-fly without any model retraining?
- Or, does the provider require additional data for model retraining?
- What is the accuracy gain of using customization?
9. Accuracy of Additional Features
- Does your application require additional features such as speaker separation (diarization), speaker identification or content moderation?
- If so, be sure to benchmark the performance of these features.
- It is one thing for a provider to say “Yes, we have feature X” and a completely different thing to actually deliver feature X with high accuracy.
- Example: most providers support speaker diarization; yet the accuracy of their speaker diarization is usually low - around 75%.
10. Ownership of the Stack
- Does the speech-to-text provider use any third party vendors under the hood?
- Was the product built from scratch?
- How quickly can the provider deliver new features?