Soniox | Soniox v5 Real-Time: Turning live conversations into structured intelligence

Last week, we introduced Soniox v5 Async: a major leap forward in turning real-world audio into structured, machine-readable data.

Today, we are launching Soniox v5 Real-Time.

Soniox v5 Real-Time brings the same structured speech capabilities to live audio, where the challenge is even harder. A real-time model cannot wait for the full recording or analyze the entire conversation after the fact. It has to listen, understand, separate speakers, detect languages, produce stable text, translate, identify endpoints, and return useful output while the conversation is still happening.

That requires a fundamentally different kind of speech AI system. Real-time speech understanding is not batch transcription with lower latency. It requires streaming model architecture, streaming training, and streaming inference designed from the ground up for live interaction. Soniox v5 Real-Time was built for that world: voice agents, meetings, captions, translation, dictation, command-and-response systems, customer support, contact centers, and multilingual products where every few hundred milliseconds matters.

Native-speaker accuracy for real-world live speech

Real conversations are messy. People interrupt each other, speak over each other, hesitate, restart sentences, switch languages, mention unfamiliar names, and say important numbers quickly. They speak through laptop microphones, phones, conference room systems, headsets, car audio, and noisy public spaces.

This is where live speech AI often breaks down. Models miss words, lose accents, misidentify languages, corrupt names, or produce unstable partial text that keeps changing. The transcript may look acceptable in a demo, but it becomes unreliable in production.

Soniox v5 Real-Time is engineered for these conditions: noise, telephony audio, far-field microphones, multi-speaker conversations, accented speech, multilingual speech, interruptions, and overlapping voices. It pushes accuracy higher not just for English, but across more than 60 languages.

This matters because speech AI should not only work for English and easy conditions. It should work for people everywhere, in every language, accent, environment, and device setting where real conversations happen.

That has always been the core Soniox direction: building speech AI for the world as it actually speaks.

Real-time speaker separation: Knowing who said what

A live transcript is far more useful when you know who said what.

In meetings, calls, interviews, medical conversations, legal workflows, customer support, and voice agents, speaker identity is essential context. Without speaker separation, a system can mix instructions, questions, answers, commitments, symptoms, objections, or action items into one stream of text.

Speaker diarization is extremely hard in real time. The model has to identify speaker changes while the conversation is happening, often with interruptions, overlapping speech, background noise, similar voices, and changing microphone conditions. It cannot wait until the end of the recording to reconstruct the speaker structure.

Soniox v5 Real-Time introduces a major step forward: a live speech model that understands both what was said and who said it. It uses acoustic information, conversational context, and the flow of dialogue to produce speaker-aware output during the conversation.

This is foundational for the next generation of voice AI. A meeting assistant needs to know who assigned the action item. A medical assistant needs to know whether the doctor or patient said something. A customer support system needs to distinguish between the agent and the customer. A multilingual conversation system needs to track different speakers across different languages.

The novelty is not just transcription. It is understanding human conversation as a structured interaction.

Real-time spoken language identification

Multilingual speech is normal. People switch languages mid-conversation, mix English with local languages, use domain terms from another language, and speak English with accents shaped by their native language.

For global voice AI, language identification is not optional. If the model gets the language wrong, transcription gets worse, translation gets worse, formatting gets worse, and downstream AI becomes less reliable.

Soniox v5 Real-Time performs spoken language identification natively across more than 60 languages. It can identify language changes in live multilingual conversations and handle heavily accented speech more reliably, including cases where other systems confuse similar-sounding languages or struggle with non-native English.

This is important because accented speech is not an edge case. Most English speakers in the world are not native English speakers. A voice AI system that only works well for standard American or British English is not a global system. To understand people in the real world, the model must understand English spoken with Indian, French, Korean, Arabic, Spanish, Portuguese, Japanese, German, and many other accents.

Better language identification improves the entire live speech pipeline: transcription, translation, endpointing, speaker understanding, and downstream structured output.

Real-time translation built into speech recognition

Soniox v5 Real-Time does more than transcribe. It can translate while it transcribes.

Instead of sending audio to one system for transcription and then sending text to another system for translation, Soniox performs speech recognition and translation together in the live stream. This allows translations to follow the conversation with low latency while people are still speaking.

With Soniox v5 Real-Time, translation quality is significantly improved, especially on the parts of speech that matter most in real conversations: names, entities, pronouns, numbers, domain terms, and hard-to-translate speech where context is needed to preserve meaning.

This is critical because real-time translation is not word replacement. Spoken language is messy: people pause, restart, interrupt each other, switch languages, use accents, and refer to people or entities indirectly. A useful translation system has to preserve who or what is being discussed, not just produce fluent text.

Soniox supports real-time translation across more than 60 languages and 3,600 language pairs. It can be used for one-way translation into a target language or two-way multilingual conversations where people speaking different languages need to understand each other in real time.

This unlocks multilingual voice agents, live translated meetings, global customer support, real-time captions, accessibility tools, education products, travel communication, and communication systems for international teams.

Semantic endpointing for instant voice agents

In live voice applications, knowing when a speaker has finished speaking is just as important as knowing what they said.

If the system waits too long, the product feels slow. If it triggers too early, it interrupts the user or responds before the thought is complete. This balance is especially important for voice agents, command systems, dictation, and conversational apps.

Soniox already supported semantic endpointing, where the speech model uses pauses, intonation, and conversational context to determine when an utterance has ended. With Soniox v5 Real-Time, endpointing is faster, more accurate, and more reliable across accents, languages, speaking styles, and noisy environments.

With v5, developers can also control endpoint behavior using endpoint_sensitivity. Higher values make endpoints more likely, which can finalize segments sooner. Lower values make endpoints less likely, which helps the system wait longer before finalizing.

This gives developers more control over the live experience: aggressive endpointing for command systems, more patient endpointing for dictation, and a balanced setting for conversational assistants.

Alphanumeric precision in real time

Numbers, codes, names, dates, emails, addresses, product IDs, account numbers, tracking numbers, flight numbers, confirmation codes, and license plates are everywhere in speech. They are often the most important part of the conversation.

A transcript can be almost entirely readable and still fail if it gets one digit wrong in a phone number, one character wrong in an email, or one code wrong in a customer support workflow. For production systems, these details are not cosmetic. They determine whether automation works.

Soniox v5 Real-Time brings major improvements in alphanumeric recognition and formatting. The model is built to capture structured expressions across languages, accents, speaking styles, and noisy environments, then return them in clean, usable formats.

This matters for support calls, sales calls, healthcare documentation, financial workflows, logistics, travel, identity verification, form filling, dictation, CRM updates, and enterprise search.

Alphanumeric precision is not a niche feature. It is one of the core requirements for making speech AI useful in real products.

Native context for personalized speech AI

No speech model can know every customer name, company term, medical phrase, product SKU, internal acronym, or regional translation preference in advance. In real applications, these details often determine whether the output is truly usable.

Soniox v5 Real-Time lets developers provide context when opening the real-time API connection. The model uses that context while listening, improving recognition of session-specific names, terms, phrases, product identifiers, and domain vocabulary.

Context also applies to translation. Developers can specify preferred translation terms, localized vocabulary, or regional and dialect preferences when the translation needs to be adapted for a specific market, customer, or product.

Because Soniox v5 handles context natively inside the speech model, it is not a post-processing trick. Context improves speech recognition and translation while the audio is being understood, making the output more accurate, personalized, and production-ready across noisy audio, accents, multilingual speech, rare terms, and grammatical variations.

What’s new compared to Soniox v4 Real-Time

Soniox v4 Real-Time established the foundation for high-accuracy multilingual live transcription. Soniox v5 Real-Time is a major leap forward in accuracy, speaker understanding, language identification, endpointing, translation quality, contextual recognition, and structured output.

The largest improvements are visible where live speech is hardest: noisy environments, telephone calls, far-field microphones, multi-speaker meetings, accented speech, interruptions, overlapping voices, and natural language switching. Across more than 60 supported languages, v5 delivers substantially better accuracy and consistency under these real-world conditions.

Speaker separation has been completely reinvented. Soniox v5 Real-Time can more accurately identify who said what as the conversation happens, even with interruptions, speaker changes, laughter, background noise, and overlapping speech.

v5 also improves real-time spoken language identification, especially for heavily accented and multilingual speech. The model tracks language changes more reliably as people speak, reducing labeling errors and making global voice products easier to build.

Endpointing is faster, more accurate, and more reliable in v5. Developers can also tune endpoint behavior with endpoint_sensitivity: higher values finalize speech sooner, while lower values make the model wait longer before emitting an endpoint.

Real-time translation quality is another major upgrade. Soniox v5 Real-Time produces higher-quality translations overall, with especially strong improvements on names, entities, pronouns, domain terms, and difficult speech where context is required to preserve meaning.

Alphanumeric precision has also improved. Soniox v5 Real-Time is better at capturing and formatting the details that matter most in production workflows: numbers, dates, times, emails, account IDs, tracking codes, product SKUs, names, and addresses.

Context injection is significantly more robust as well. Soniox v5 Real-Time applies session-specific context more reliably across noisy audio, accents, multilingual speech, rare terms, and grammatical variations. Names, domain vocabulary, product terms, and custom phrases are recognized more consistently and in the right form.

In short, Soniox v5 Real-Time is not just a more accurate version of v4. It transforms live human conversation into structured intelligence: speaker-aware, language-aware, machine-readable output produced as people speak.

Soniox v5 Real-Time replaces stt-rt-v4 and is fully compatible with the existing Soniox Real-Time API. To upgrade, simply change the model name in your request to stt-rt-v5. The stt-rt-v4 model will be retired on June 30, 2026; after that date, requests using stt-rt-v4 will automatically route to stt-rt-v5 with no service interruption and no API changes required.

Availability

Soniox v5 Real-Time is available starting today through the Soniox API.

For recorded audio workflows, read more in the Soniox v5 Async blogpost.

Get started with Soniox v5 Real-Time