Our goal was to build a highly accurate speech recognition system with no or very little human labeling. If that were possible, then the cost of building a fairly accurate speech recognition system is zero or extremely low. Furthemore, one can apply the same approach to other languages and accurately recognize speech in languages where no system exists or does not work well.
To provide more context, let us take a quick look at the typically process of building a state of the art speech recognition system:
The major drawbacks of this approach are:
It takes a significant amount of time (one or more years) to collect a large amount of labeled data. Humans have to listen, write down or verify every word that is being said in the audio. This labeling process is very expensive and not many organizations can afford to pay for the labeled data.
Human knowledge is vast and consists of a large number of different words. Limited labeled data covers only a small subset of the vocabulary. Thus many topics and domains of human knowledge are not well represented; many words do not occur in the labeled data or occur only a small number of times.
Even with an extremely large budget for labeling data (millions of dollars), one can collect a relatively small amount of labeled data compared to an “infinite” amount of unlabeled audio and text data that is available on the internet.
The same process has to be repeated for every single language. The cost of collecting labeled data scales linearly with the number of languages. This may be feasible for top languages, but makes it prohibitively expensive for many other languages.
Due to the relatively small amount of labeled data, the diversity of audio is limited. Typically, a small number of noises, speakers, and accents are included in the labeled data compared to the real world acoustic scenarios.
These were the key factors that motivated our research work. Our goal was to explore if it is possible to build a speech recognition system with no or very little human labeling and still obtain a fairly accurate speech recognition system.
We invented a novel approach to building speech recognition systems on unlabeled data.
We built a Soniox speech model with no human labeled data, i.e. we did not pay a single dollar to label the data to train the speech model.
The model has been exposed to real-world acoustic conditions, which include tens of thousands of speakers with accents from most countries in the world and a huge amount of background noises.
It is an iterative never-ending learning process with each iteration yielding higher accuracy models. The model keeps auto-improving with more training iterations.
The approach is language agnostic and can be applied to non-English languages.
The model is trained on a vast amount of human knowledge including people, places, geography, education, technology, engineering, medicine, health, law, science, art, history, food and sports.
It achieves super-human recognition on many domains when compared to non-domain human experts.
We thoroughly benchmarked our system on many different domains and audio conditions against the best leading speech-to-text system. Soniox speech recognition system has on average 24% higher accuracy.
On average, 24% more of the words are correctly recognized by Soniox speech system that are otherwise misrecognized by the best leading speech-to-text system.