General AISpeech AI

Soniox

November 21, 2022 by Cathy Xi, Klemen Simonic

How to Choose a Speech-to-Text Provider?

1. Speech Recognition Accuracy

  • What is the Word Error Rate (WER) for your specific use case?
    • Domains: medical, contact center, sales, legal, etc.
    • Channels: telephony, video conference, in-person or hybrid meetings
    • Accents
    • Background noises: chatters, traffic, music, applause, cheers, etc.
    • Interjections & crosstalks
    • Speaking formats: monologues, 1:1 conversations, group conversations, presentations, dictations
    • Audio quality
  • Be sure to use double-reviewed and normalized ground truth transcriptions when benchmarking
  • Can the accuracy be improved with customization?
  • Does the accuracy hold up in real-time vs batch processing?

2. Comprehension Quality

  • Are the punctuations accurate?
  • Are the capitalizations accurate?
  • Are numbers and units formatted properly?
    • Example: “$7.72” vs “seven dollars and seventy two cents”

3. Real-Time Performance

  • How high is the latency when transcribing live?
  • How does latency impact user experience?
  • How does the accuracy of non-final words compare to final words?
  • When a speaker talks fast, is latency further impacted?
  • How does the accuracy in real-time compare to batch processing?

4. Pricing

  • Are there different tiered speech-to-text models at different price points?
  • Does the provider charge for multiple streams/channels? Or does the provider charge only for 1 single stream/channel?
    • Example:
      1. 1 hour of a 5-person meeting = 5 hours of audio charge OR
      2. 1 hour of a 5-person meeting = 1 hour of audio charge
  • Is the price quoted for real-time speech recognition or batch processing?
  • Are there further discounts for non-urgent file processing (e.g. within 24 hours)?
  • Are there volume discounts?
  • Are there charges for additional fees for features such as speaker diarization, speaker identification or content moderation?

5. Deployment Options

  • What are the available deployment options?
    • Public cloud API
    • Private cloud API
    • On-premises
    • On-device

6. Scalability & Reliability

  • What is the expected uptime?
  • Are the deployments multi-region/multi-continent?
  • What is the maximum number of concurrent streams/files supported?

7. Ease of Integration

  • How long is it going to take your developers to integrate with the speech-to-text provider?
  • Are there SDKs in your chosen language?

8. Customization

  • What type of customization does the provider support?
  • Can you customize transcription on-the-fly without any model retraining?
  • Or, does the provider require additional data for model retraining?
  • What is the accuracy gain of using customization?

9. Accuracy of Additional Features

  • Does your application require additional features such as speaker separation (diarization), speaker identification or content moderation?
    • If so, be sure to benchmark the performance of these features.
  • It is one thing for a provider to say “Yes, we have feature X” and a completely different thing to actually deliver feature X with high accuracy.
    • Example: most providers support speaker diarization; yet the accuracy of their speaker diarization is usually low - around 75%.

10. Ownership of the Stack

  • Does the speech-to-text provider use any third party vendors under the hood?
  • Was the product built from scratch?
  • How quickly can the provider deliver new features?