October 15, 2024

Unleashing General Speech Intelligence to Empower Conversational AI

by Klemen Simonic

Advances in artificial intelligence are already transforming the interactions between man and machine. But even as this technology evolves, we are still only scratching the surface on how we communicate with AI through audio. This type of conversational AI, which just a few years ago might’ve felt like something out of a Sci-Fi film, is coming to life today, and the future in which we seamlessly talk back and forth with machines is not far off.

However, to achieve this future, we need to rethink how we train AI and the way in which these models interpret audio signals and produce similar outputs. Today, most AI models interpret audio signals by transcribing speech to text and then extracting meaning from text tokens rather than the audio itself. This leaves their capabilities limited. In the same way that text messages between friends are often misinterpreted, these models reason based on the written word rather than speech and lack the accuracy and deeper context that comes with tone, speech patterns, emotion, and other similar verbal signals. The result is AI that feels more robotic than natural, and falls flat on the true audio understanding and reasoning that we as humans take for granted.

We believe the missing piece for evolving conversational AI is general speech intelligence. That is, building AI that has the capability to natively understand audio and speech in the same way that we as humans do, rather than by converting it to text to extract meaning. General speech intelligence is the next frontier for conversational AI, but it has eluded the industry to-date.

When I joined Meta in 2015, I initially worked on natural language understanding, but that focus quickly shifted in 2016 when I joined the three-person Speech team. Our goal: build accurate and scalable speech recognition AI. Within months, we built the first production-ready prototype of its kind: a state-of-the-art automatic speech recognition (ASR) engine trained end-to-end without intermediate representations, such as phonemes. At the time, it was groundbreaking.

Our work laid the foundation for modern speech recognition and how we can use AI to solve problems. Over four years, we scaled the engine across several languages and integrated it into the company’s suite of products. However, I knew there was a better way to train ASR to be more accurate, more conversational, and more human-like. In 2020, I left Meta to explore the idea that AI could, in fact, converse with us with the same levels of intelligence, precision, context, and nuance that humans can. This idea would prove to be the foundation for Soniox.

The Art of Unsupervised Learning

Believe it or not, my goal after leaving Meta wasn’t to start a company. I simply loved working with raw signal and sequential data and figuring out how it can be used to build general purpose intelligence. I partnered with an old classmate from the University of Ljubljana and former Cosylab engineer, Ambroz Bizjak. Ambroz is a brilliant programmer and problem solver, and we spent the next 14 months on a self-funded research project.

We quickly discovered that almost every company building their own speech intelligence and recognition tools followed the same strategy: partner with another company and invest vast resources to transcribe and label data, and then use it to train their systems. But, there was never enough data, it couldn’t scale across languages, and it was riddled with quality issues. I realized this bottleneck would stymie progress toward general speech intelligence, but it could be solved with an unsupervised learning technique and a new approach to producing data on which to train an algorithm on.

Our next step was to build a system that could produce copious amounts of high quality paired audio-text data that could be used to train speech AI models. We called this our data factory. Eventually, our data factory grew to hundreds of thousands of hours of audio data and billions of words of text. As more data was tapped to train our model via unsupervised learning, the algorithm learned to structure, classify, and process data at an impressive rate.

In 2021, we officially launched Soniox and our speech recognition product, Soniox Speech Recognition, built on our unsupervised learning technique and backed by the largest audio dataset on the market. When we launched it, the product superseded anything else on the market: in a benchmark study on speech recognition in 2023, we conducted an extensive, multi-language evaluation of the accuracy of each leading speech recognition provider in the industry, including OpenAI, Google, AWS, Microsoft, and more. We provided each tool with real-world datasets in varying acoustic conditions, speaking styles, accents, and topics. The results: Soniox achieved the results with the highest accuracy in the English language - 24% higher than the runner-up, OpenAI.

Over the past three years, Soniox has grown into a multi-million dollar business, serving customers like Samsung, DeepScribe, DeliverHealth, and many others. Leading companies across technology, healthcare, legal, and other industries rely on our speech recognition product each day and we continue to grow rapidly.

Introducing Omnio

State-of-the-art speech recognition AI proved to be an inflection point for us, demonstrating the power of our unsupervised learning approach and data factory, but it was only the start. Today, we’re unveiling an all new conversational AI solution: Omnio. First release of Omnio supports Chat Completion API, to be followed by Omnio Real Time API, currently in beta.

Omnio is the world’s first multimodal AI with general audio and speech intelligence. It natively processes the audio signal and provides human-like reasoning over audio and speech.

Omnio excels at identifying speakers, their roles, and even the nuances of their interactions, including emotions, sentiment, and speaking styles. Beyond words, Omnio also recognizes sounds and non-verbal cues, providing unprecedented comprehension of the auditory environment. Omnio can thoroughly understand the full audio experience, just like we as humans do.

Doctors can share a quick audio clip and Omnio will produce high quality medical notes nearly instantaneously. Lawyers can drop in audio from depositions and quickly receive transcripts and briefs with sentiment and emotional analysis, providing quotes with timestamps as evidence. Educators can outline key points from their class lectures. Sales teams can ensure that every follow up item is tracked neatly post-meeting. Customer service teams can use Omnio as a passive listening tool that then provides feedback and quality assurance on an agent’s performance, with specific examples from each recorded call.

Omnio sets the standard for a new class of audio understanding and reasoning with customizable speech recognition, speaker identification, emotion and intent analysis, audio summarization, and more. Just like how we as humans process audio and sounds with all its richness, Omnio can understand our natural, conversational style with context and nuance.

We’re experiencing an exciting moment in AI’s history, and we believe general speech intelligence is that next step forward. With Omnio, we’re excited to unleash the next generation of conversational AI, empowering the enterprise and consumer applications like never before.

For additional details about Omnio, visit https://soniox.com/blog/omnio.