September 13, 2023 by Soniox Team

Soniox Speech AI Achieves Extreme Accuracy

We have released groundbreaking speech recognition AI, achieving extreme levels of accuracy and unlocking new possibilities in human-machine interaction.


  • We have launched new foundational AI models for speech recognition, achieving extremely high accuracy rates.
  • Our AI models often surpass human performance, delivering more accurate speech recognition and generating properly formatted text.
  • Soniox’s speech recognition AI consistently outperforms OpenAI, Google, and other providers, with accuracy improvements from 24% to 78%, making it a game-changer for voice and speech applications.
  • We have released the Soniox mobile app and Soniox Playground, allowing you to experience the new era of voice AI firsthand.

Engineering Breakthrough

Foundational AI breakthroughs are challenging to achieve in a startup environment due to the costs and complexity associated with processing and training large models on internet-scale data. However, we did not shy away from the challenge and built a ground-up infrastructure to efficiently process and train large models on massive amounts of audio and text.

Specifically, we processed over 1 million hours of audio data for training. The entire training process was completed on a single A100 server (8xA100 GPUs) in less than 4 weeks! This achievement in engineering innovation alone saved millions of dollars in processing and training costs.

Novel AI Models

We also had to design and implement new model architectures and criterions. Why? Achieving high accuracy with low-latency constraints is one of the most challenging problems in AI today. The AI model has to constantly make decisions (e.g., output words) in real time while dealing with a high level of uncertainty and missing information. The existing neural networks fail to address this problem effectively.

To solve this problem, we had to fix several of the transformer and convolutional architecture problems, and designed new and more efficient architectures that inherently prioritize low-latency decision-making while still considering accuracy. Although we have been training these models for the past year, the improvements were incremental until the breakthrough moment about 6 months ago.

Inference Engine

We also had to develop our own inference engine to support streaming and low-latency processing with our proprietary neural networks. Our inference engine runs on CPUs and efficiently processes massive amounts of audio with commodity CPU servers. We are also working on power-efficient on-device deployment with one of the world’s largest mobile manufacturers.

Path Towards Human-Parity

In the last year, we have witnessed the release of speech recognition models from Google, Meta, and other companies that support one thousand or more languages. What all of these approaches fail to address is accuracy. Speech recognition is all about accuracy, period. Achieving human-parity or superhuman accuracy is of paramount importance, rather than settling for a solution with a misrecognition rate of 20% or higher, which proves to be useless for most applications.

We are introducing highly accurate models for nine languages, starting with English and Korean. For many of these languages, this will mark the first introduction of highly accurate speech recognition AI. We are looking forward to collaborations with various companies worldwide. We believe this could represent a breakthrough moment for numerous voice and speech applications.

You can access the benchmark report here:

Why Does This Matter?

If you are in the call automation business, accurately recognizing every single word during phone calls is of paramount importance; otherwise, automation quickly fails, and the experience deteriorates.

If you are involved in creating documents from audio, such as in the medical and legal industries, then accurately recognizing domain-specific words and properly formatting the text is crucial for making transcriptionists more efficient and saving costs.

Additionally, there is a rising trend in human-machine voice interaction. We know that LLMs work for text-based communication with them. The next step is voice communication with LLMs, which requires an extremely high level of speech recognition accuracy and super low-latency responses. This has been the missing component, and it is what Soniox brings to the table with our new voice AI models.

Try it Yourself

Download the Soniox mobile app and try out the new voice AI experience:

Try also the Soniox Playground, where you can quickly upload a file or transcribe a YouTube video:

Lastly, we have powerful and easy-to-use APIs. You can integrate your applications with Soniox in just a few minutes:

We would love to hear about your experience and what you can do with our new speech AI models.