Soniox | Soniox v5 Async: Turning real-world speech into structured data

In a quiet room with one speaker, many modern STT systems perform reasonably well. But real-world audio is rarely clean. It is a hybrid meeting with cross-talk, a customer calling from a noisy street over a telephone line, a doctor and patient speaking over each other, or an engineer switching between English and Korean in the same sentence. It is full of accents, interruptions, background noise, unfamiliar names, domain-specific terms, and critical details such as emails, dates, account numbers, and product codes.

This is where traditional speech-to-text still breaks down. It may produce a transcript, but the result is often an unstructured wall of text: speakers are mixed together, language switches are missed, alphanumeric strings are corrupted, and context-dependent words are lost. The transcript may look readable, but it is not reliable enough for automation, analytics, compliance, search, or downstream AI systems.

Today, we are introducing Soniox v5 Async: a deeply specialized speech-to-text AI model built to convert real-world audio into clean, structured, machine-readable data. Soniox v5 processes acoustic signals and linguistic context together in a single end-to-end model. It is not a generic text model wrapped around audio. It is a speech-native system designed for the hardest production environments: multilingual speech, accented speakers, noisy audio, overlapping voices, domain-specific vocabulary, and precise formatting of numbers, names, codes, emails, dates, and other structured entities.

The result is a new foundation for speech intelligence: accurate transcription, speaker separation, language identification, timestamps, contextual vocabulary, and normalized structured output from one model.

1. Native-speaker accuracy across 60+ languages

Most speech systems still treat English as the primary language and everything else as secondary. Soniox v5 was built differently. It delivers breakthrough accuracy across more than 60 languages, including languages that are often underserved by modern STT systems such as Danish, Hungarian, Turkish, Arabic, Korean, Japanese, and many others.

Soniox v5 is engineered for real-world acoustic difficulty: background noise, telephony, far-field microphones, reverberant rooms, device recordings, call centers, meetings, and other production audio conditions where standard models degrade quickly. It also handles heavily accented speech and natural language mixing, including English spoken with strong regional or non-native accents, as well as code-switching such as Hinglish, Spanglish, Korean-English, Japanese-English, Arabic-English, and other multilingual combinations that occur naturally in global conversations.

The model performs spoken language identification natively across the full transcript, allowing it to identify the active language even when speakers switch languages inside the same conversation or sentence. This eliminates the need to route audio through separate language-specific models and makes Soniox v5 especially effective for global products, multilingual teams, contact centers, meetings, and voice AI applications.

2. Breakthrough speaker separation

A transcript is only useful if you know who said what.

In meetings, calls, interviews, medical conversations, and legal recordings, speaker separation is not a nice-to-have feature. It is the foundation that makes the transcript usable. Without it, a model may mix the doctor’s instructions with the patient’s symptoms, blend multiple meeting participants into one paragraph, or assign a critical statement to the wrong person.

This has been one of the hardest problems in speech AI, especially in real-world conversations where people interrupt each other, laugh, speak at the same time, or move around the room. Traditional diarization systems often rely heavily on acoustic similarity, which can break down when voices overlap, recording conditions change, or the conversation contains multiple speakers with similar voices.

Soniox v5 introduces a major step forward in speaker separation. Because the model processes acoustic information and conversational context together, it can use both the sound of each speaker’s voice and the semantic flow of the conversation to determine who said what. This allows v5 to separate speakers more accurately in noisy, multi-person, multilingual, and overlapping speech environments.

The output is not just a transcript. It is a structured conversation with speaker labels that preserve the flow of human interaction.

3. Native context injection

No speech model can know every company name, medical term, legal phrase, product SKU, customer name, or internal acronym in advance. In real enterprise workflows, these details matter. A transcript that misses a drug name, product identifier, customer name, or technical term may require human review before it can be used.

Traditional systems often try to solve this with post-processing: transcribe first, then patch the text afterward. But that approach is brittle. If the model hears the wrong word in the first place, a text-only correction system often cannot recover. It also struggles with grammatical variants, inflections, compound terms, and multilingual usage.

Soniox v5 handles context natively. You can provide session-specific context directly to the model, and v5 uses that information while listening to the audio. This allows the model to bias recognition toward relevant names, terms, phrases, product identifiers, and domain-specific vocabulary at the moment of transcription.

Because context is integrated into the speech model itself, Soniox v5 can apply it more naturally. You do not need to enumerate every possible variant of a word. The model can use the surrounding sentence to choose the correct form based on how the term is spoken. For enterprises, this means fewer manual corrections, better domain accuracy, and transcripts that are much closer to production-ready immediately after processing.

4. Universal alphanumeric precision

Alphanumerics are everywhere in speech. People say phone numbers, account numbers, dates, times, addresses, emails, confirmation codes, tracking IDs, invoice numbers, product SKUs, license plates, flight numbers, and serial numbers every day. These details are often the most important part of the conversation.

They are also where many speech models fail. A transcript can be almost entirely correct and still be useless if it gets one digit wrong in a phone number, one letter wrong in an email address, or one character wrong in a confirmation code. For many production workflows, these are not minor errors; they are the difference between automation working and automation failing.

Soniox v5 was built to handle alphanumeric speech as a core capability, not an afterthought. The model recognizes and formats structured expressions across languages, accents, and speaking styles. It can process spoken numbers, letters, symbols, emails, dates, times, and mixed character sequences, then output them in clean, standard business formats. This is critical for real-world automation. A support call, medical note, financial workflow, logistics transcript, or enterprise search system depends on these details being captured correctly.

The new standard for structured speech-to-text

Soniox v5 Async is designed for one purpose: to turn real-world speech into structured data that machines and people can use.

From a single audio file, Soniox v5 can produce accurate text, speaker labels, language labels, timestamps, domain-aware vocabulary, and normalized structured entities such as numbers, dates, emails, addresses, names, and codes. Instead of receiving a raw transcript that must be cleaned, corrected, segmented, labeled, and reformatted after the fact, developers can build directly on structured speech data from the start.

That matters for every product that depends on spoken language: voice agents, contact centers, meeting intelligence, healthcare documentation, legal transcription, financial services, media workflows, education, accessibility, analytics, compliance, and search. The world does not speak in clean text. People interrupt, switch languages, speak with accents, mention unfamiliar names, and say critical numbers that must be exact.

Soniox v5 Async was built for that world.

What’s new compared to Soniox v4 Async

Soniox v4 Async established the foundation for high-accuracy multilingual transcription.

Soniox v5 Async is a major leap forward in real-world robustness, speaker understanding, language identification, contextual accuracy, and structured output.

The largest improvements are visible in the hardest audio conditions: noisy recordings, telephone calls, far-field microphones, multi-speaker meetings, accented speech, and conversations where people switch languages naturally. Across more than 60 supported languages, v5 delivers substantially better accuracy and consistency, especially in cases where traditional STT systems degrade quickly.

Speaker separation has been completely reengineered in Soniox v5. The model is significantly better at identifying who said what in real-world conversations, including meetings, interviews, calls, interruptions, speaker changes, laughter, background noise, and overlapping speech. This turns the transcript from a block of text into a speaker-aware record of the conversation, making it far more useful for summaries, action items, analytics, compliance, customer intelligence, and downstream AI workflows.

v5 also improves spoken language identification, especially for heavily accented speech and multilingual conversations. The model can more reliably track language changes across a transcript, reducing language-labeling errors and making it easier to process global conversations without separate routing logic.

Another major upgrade is alphanumeric precision. Soniox v5 is better at capturing and formatting the details that matter most in production workflows: numbers, dates, times, emails, account IDs, tracking codes, product SKUs, names, and addresses.

Context injection is also significantly more robust. Soniox v5 applies session-specific context more reliably across noisy audio, accents, multilingual speech, rare terms, and grammatical variations. Names, domain vocabulary, product terms, and custom phrases are recognized more consistently and in the right form.

In short, Soniox v5 Async is not just a more accurate version of v4. It is a major step toward structured speech-to-text: transforming real-world audio into clean, speaker-aware, language-aware, machine-readable data.

Soniox v5 Async replaces stt-async-v4 and is fully compatible with the existing Soniox Async API. To upgrade, simply change the model name in your request to stt-async-v5. The stt-async-v4 model will be retired on June 30, 2026; after that date, requests using stt-async-v4 will automatically route to stt-async-v5 with no service interruption and no API changes required.

Availability

Soniox v5 Async is available starting today through the Soniox API.

For teams building live voice experiences, the ultra-low-latency Soniox v5 Real-Time model will launch in the coming weeks.

Get started with Soniox v5 Async