October 15, 2024

Introducing Omnio

First AI model that can natively reason over audio like humans.

Omnio is the first multimodal AI model to comprehensively understand both conversations and human behavior through audio. Omnio excels at identifying speakers, their roles, and even the nuances of their interactions, including emotions, sentiment, and speaking styles. Beyond words, Omnio also recognizes sounds and non-verbal cues, providing unprecedented comprehension of the auditory environment.

In addition to processing audio, Omnio is also a powerful AI model for text reasoning. On text benchmarks, Omnio performs on par with GPT-4 and other leading AI model providers.

Capabilities

Healthcare: Create medical documentation

How it works

Most existing audio applications use speech-to-text technology to convert audio into text. This process often results in the loss of essential information from the audio, such as the identity of the speakers, their roles, tone, emotions, non-verbal cues, and background sounds.

In contrast, Omnio directly processes the audio signal, and has been trained to recognize and understand foundational audio and speech concepts like humans. This novel capability enables deep understanding of the audio, speech and conversations, and the creation of a whole new range of applications. We believe Omnio marks a significant step forward in achieving general speech and audio intelligence.

Benchmarks

Omnio is also a powerful AI model for text-to-text applications. Omnio performs on-par with GPT-4o, Mistral Large 2 and Claude 3.5 Sonnet on text benchmarks.

The benchmarks shown in the plot were conducted using OpenAI's simple-evals library (MMLU, MATH, GPQA, DROP, HumanEval), while the remaining benchmarks were performed with our in-house evaluation library (Arc-c, OpenBookQA, CommonsenseQA, AGIEval). At the time of this release, none of the other providers (OpenAI, Anthropic, Mistral) offered audio-native AI models that we could use for comparison with Omnio audio capabilities.

ModelMMLUMATHGPQADROPHumanEvalArc-cOpenBookQACommonsenseQAAGIEval
omnio84.6%76.0%48.1%90.9%85.0%95.6%97.2%84.0%72.2%
gpt-4o88.3%76.0%52.4%90.2%90.2%96.4%96.2%81.4%75.1%
claude-3-5-sonnet87.9%69.0%55.8%90.0%89.0%96.0%96.6%81.0%67.5%
mistral-large-284.1%67.0%48.7%90.0%83.5%96.0%95.2%82.6%64.0%

Ready for the enterprise now

Omnio is more than just a general AI model for audio and text. It supports a wide range of industry-specific tasks, enabling businesses and enterprises to use it reliably in their operations without requiring any fine-tuning or modifications.

Over the past four years, we have amassed extensive industry-specific knowledge and proprietary datasets from real-world operations. For example, in the healthcare industry, we have a vast collection of doctor dictations and doctor-patient conversations, which enabled us to build accurate and robust AI capabilities that align with the everyday needs of doctors and healthcare organizations.

As a result, Omnio is a uniquely powerful AI model that supports a range of industry-specific tasks with high accuracy and reliability and can be integrated into business workflows to drive real-world impact. To learn more about Omnio’s capabilities, watch the demo videos in the section above.

Availability & pricing

Omnio API will begin rolling out today in public beta to all developers, offering $5.00 in free credits.

Text capabilities in the Chat Completion API are powered by our new model omnio-chat-text-preview. Text input tokens are priced at $2.00 per 1M and $5.00 per 1M output tokens.

Audio capabilities in the Chat Completion API are powered by our new model omnio-chat-audio-preview. The audio model uses both text tokens and audio tokens as input, and text tokens as output. Text input tokens are priced at $2.00 per 1M and $10.00 per 1M output tokens. Audio input tokens are priced at $50.00 per 1M tokens. This equates to approximately $0.03 per minute of audio input.

Getting started

Developers can start building with Omnio right away in the playground or by using our docs.

Try now