BLOG

Soniox Launches Near Error-Free Speech Recognition for Free and for Anyone to Use

Speech recognition systems are typically built in two phases. The first stage consists of collecting a large amount of labeled speech data, which consists of a paired audio and its transcript (transcribed by humans). Typically 10,000 hours or more of labeled audio data is required to build a reasonably good speech recognition system. This makes the collection process extremely expensive and time consuming (millions of dollars over one or more years). There are also many other limitations that come with this approach*. In the second stage, we train a speech recognition model on the collected labeled data.

We invented a novel approach to training speech recognition models on unlabeled data. It is an iterative process with each iteration yielding higher accuracy models. We built a Soniox speech model with no human labeled data in less than 6 months.

Soniox speech recognition system has on average a 24% higher recognition accuracy compared to the best leading speech-to-text system in the industry. In practice, 24% more of spoken words are correctly recognized that are otherwise misrecognized by other speech-to-text systems.

Our speech system is near error-free on most domains of human knowledge, including people, places, geography, education, technology, engineering, medicine, health, law, science, art, history, food and sports. It achieves super-human recognition on many domains when compared to non-domain human experts. The speech system robustly recognizes speakers with accents from most countries in the world. It works well under challenging acoustic conditions with strong background noises and fairfield recording environments. It is one extremely accurate speech model for all kinds of audio.

We decided to offer our near error-free speech recognition for free for anyone to use. With a Soniox account, the user gets 5 hours per month of free speech recognition. These free hours can be used in Soniox Cloud web application to seamlessly transcribe live audio from the microphone or upload and transcribe files. Developers can also use these hours to transcribe audio through the Soniox API.

We believe that Soniox speech recognition should be ubiquitous and accessible to everyone. With Soniox Web Voice, anyone can embed real-time low-latency speech recognition right into their website or web application. We offer for free unlimited number of recognition sessions for up to 30 seconds per session. Soniox Web Voice is a unique product and enables developers to make their website or web applications accessible and interactive via voice.

Integration of speech recognition into applications is often difficult and error-prone. The main reasons are that it requires certain expertise about audio and speech processing, and that it is non-trivial to write bug-free concurrent code that simultaneously captures the audio, creates requests to the server, and handles responses from the server. To address this, we wrote Soniox Docs, which include a multi-part tutorial that explains all the core concepts, data structures and usage patterns for Soniox speech recognition API. We also built a super easy-to-use Python client library: with only a few lines of code the developer can transcribe files or audio streams in real-time low-latency scenarios. We also provide solutions for most common speech recognition use cases, which developers can use as templates when building their own applications.

Our mission is to accelerate the adoption of speech-based applications and spark innovation of human-machine voice interaction. We strongly believe we have made a significant contribution in this direction and will continue to do so as we release new technologies that will change the way speech and voice applications are built.