Self-Learning Speech Recognition AI

We have developed the world’s first self-learning speech recognition artificial intelligence that leverages vast amounts of unlabeled audio and unlabeled text to teach itself how to recognize speech. Our speech AI can accurately recognize speech in all kinds of real-world environments and on most topics of human knowledge.

Soniox Breakthrough AI Technology

Our goal was to build a highly accurate speech recognition system with no or very little human labeling. If that were possible, then the cost of building a fairly accurate speech recognition system is zero or extremely low. Furthemore, one can apply the same approach to other languages and accurately recognize speech in languages where no system exists or does not work well.


To provide more context, let us take a quick look at the typically process of building a state of the art speech recognition system:

  1. The first stage consists of collecting a large amount of labeled data. To build a reasonablly good system, about 10,000 hours or more of labeled data is required. Labeled data consists of an audio and its transcript. The transcript is typically obtained by asking humans to manually transcribe the audio, i.e. listen to the audio and write down or verify every single spoken word in the audio.
  2. Once a sufficient amount of labeled data has been collected, we train a speech recognition model on the collected labeled data.

The major drawbacks of this approach are:

1. Time Consuming and Expensive

It takes a significant amount of time (one or more years) to collect a large amount of labeled data. Humans have to listen, write down or verify every word that is being said in the audio. This labeling process is very expensive and not many organizations can afford to pay for the labeled data.

4. Limited Vocabulary

Human knowledge is vast and consists of a large number of different words. Limited labeled data covers only a small subset of the vocabulary. Thus many topics and domains of human knowledge are not well represented; many words do not occur in the labeled data or occur only a small number of times.

2. Small Amount of Data

Even with an extremely large budget for labeling data (millions of dollars), one can collect a relatively small amount of labeled data compared to an “infinite” amount of unlabeled audio and text data that is available on the internet.

5. Limited Languages

The same process has to be repeated for every single language. The cost of collecting labeled data scales linearly with the number of languages. This may be feasible for top languages, but makes it prohibitively expensive for many other languages.

3. Limited Acoustics

Due to the relatively small amount of labeled data, the diversity of audio is limited. Typically, a small number of noises, speakers, and accents are included in the labeled data compared to the real world acoustic scenarios.

These were the key factors that motivated our research work. Our goal was to explore if it is possible to build a speech recognition system with no or very little human labeling and still obtain a fairly accurate speech recognition system.

Soniox Solution

We invented a novel approach to building speech recognition systems on unlabeled data.

1. No Human Labeled Data

We built a Soniox speech model with no human labeled data, i.e. we did not pay a single dollar to label the data to train the speech model.

4. Real-World Acoustics

The model has been exposed to real-world acoustic conditions, which include tens of thousands of speakers with accents from most countries in the world and a huge amount of background noises.

2. Iterative Process

It is an iterative never-ending learning process with each iteration yielding higher accuracy models. The model keeps auto-improving with more training iterations.

5. Language Agnostic

The approach is language agnostic and can be applied to non-English languages.

3. Vast Amount of Data

The model is trained on a vast amount of human knowledge including people, places, geography, education, technology, engineering, medicine, health, law, science, art, history, food and sports.

6. Super-Human Recognition

It achieves super-human recognition on many domains when compared to non-domain human experts.

Soniox Benchmark

We thoroughly benchmarked our system on many different domains and audio conditions against the best leading speech-to-text system. Soniox speech recognition system has on average 24% higher accuracy.


On average, 24% more of the words are correctly recognized by Soniox speech system that are otherwise misrecognized by the best leading speech-to-text system.