We believe the first step in audio understanding is speech recognition. Current “speech-to-text” solutions require significant investments to obtain large amounts of paired audio-transcript data (humans have to manually transcribe the audio). This makes building an accurate speech recognition system extremely time consuming and expensive. And, the same process has to be repeated for each language, making it infeasible to scale to all spoken languages. It was critical for us to rethink how speech recognition AI is built.
We invented a novel approach to training speech recognition AI systems. Our speech AI learns from vast amounts of unlabeled audio and unlabeled text that is publicly available on the internet. It learns to recognize words by exploring different interpretations of spoken words in unlabeled audio and their usage in unlabeled written text. Our speech AI has learned to recognize near error-free most of the words in English language without any direct human supervision.
Accurately recognizing speech in all kinds of environments is only the first step in our mission. When we speak we use different emotions, intonations and spacings between words. All of this is valuable information “hidden” in the audio.
We made the first step in this direction and developed acoustic sentence boundary detection. Sentence boundary or punctuation models (models that add missing punctuation) typically work exclusively on recognized words (text) and do not take the underlying acoustics into consideration. However, to accurately recognize sentence boundaries in speech, it is necessary to include acoustic information as well.
Our speech AI detects sentence boundaries based on acoustic information. It is able to leverage the intonation and spacing of the spoken words to recognize a sentence boundary and the type of sentence boundary (dot or question mark). For example, our speech AI can detect the spoken difference between “really?” and “really.”. We believe this is the first kind of such AI in the industry.
Recognizing and identifying speakers is also extremely valuable for many downstream applications. We are developing a novel self-supervised approach for speaker diarization and speaker identification problems.
Detecting audio events (e.g. dog barking, clapping) can help us better understand the surrounding of the speaker. We plan to develop AI models for detection of thousands of real-world audio events to better understand the background environment of the speaker.
Our mission is not just to recognize various information in the audio, but to deeply understand the audio content. To achieve this, we need to successfully couple the recognized information from the audio with the natural language understanding. Spoken language has different characteristics and more information (timing, intonation, speaker) than written language. Most of the current natural language AI has been developed for written language only. There is a large potential for improvement of understanding of spoken language by coupling the audio information with the proper natural language AI models.
Audio should be processed into a structured form to be accessible and useful for all kinds of downstream applications. Making the audio searchable is just the first step. The main challenge is converting the audio into a “universal” representation that can be used for different applications and can offer added value to the consumer. We started to explore different audio and text representations coupled with smart indexing that enable “deep insights” into audio content.