Today, we’re releasing Soniox v4 Real-Time, a speech recognition model purpose-built for low-latency voice interactions.
Unlike traditional STT systems that trade accuracy for speed or are optimized around a single language at a time, Soniox v4 delivers speaker-native accuracy across 60+ languages with no trade-offs. It’s built for mission-critical voice agents, live captioning, and real-time global communication, setting a new standard for real-time speech AI.
Soniox v4 Real-Time is built for teams shipping voice-first products where latency, accuracy, and multilingual reliability are non-negotiable.
1. Speaker-native accuracy for 60+ languages
For too long, speech AI has been “English-first”, everything else second. We’ve officially ended that era.
Soniox v4 reaches speaker-native accuracy across 60+ languages. We didn’t just optimize for the majors, we equalized the product experience for every supported language. Whether your users are speaking French, Hindi, Portuguese, or Japanese, they receive the same level of accuracy that was previously reserved for English.
In real-time voice interactions, “close enough” isn’t good enough. If a model misses a word, a voice agent loses the thread. v4 provides the high-fidelity foundation required for an AI to converse naturally, without constant interruptions.
2. Millisecond finality: Speed without sacrifice
In conversation, a 500 ms delay feels like an eternity. For a voice agent to feel alive, the transition from speech to text must be nearly instantaneous.
Soniox v4 delivers industry-leading low latency for final transcriptions, producing high-accuracy final text just milliseconds after speech ends. This allows your system to trigger the next action, whether that’s a LLM prompt or a spoken response, before the user even wonders if the AI is listening.
3. Semantic endpointing: Listen like a human
Traditional Voice Activity Detection (VAD) is a blunt, acoustic tool. It listens for silence and cuts the audio when the pause lasts too long. That’s why AI assistants often interrupt you while you’re slowly reading a phone number or an address.
Soniox v4 introduces Semantic Endpointing. Instead of relying on silence alone, the model understands context, rhythm, and intent. Endpointing shifts from an acoustic problem to a semantic one.
-
The VAD way:
You say: “555… [pause] …0192” → the system cuts you off. -
The Soniox way:
The model understands the sequence is incomplete and waits patiently. When it detects real conversational finality, it ends the turn immediately.
This results in fewer interruptions, lower downstream compute costs, and a more natural conversational experience across your entire voice stack.
4. Real-time global translation
With the Soniox API, you can transcribe and translate simultaneously in a single real-time stream.
Unlike sentence-level systems, Soniox translates speech in low-latency streaming chunks, with translation appearing continuously as the speaker talks. Translation quality has been significantly improved across both major and minor languages, supporting 3,600+ language pairs.
5. Built for the enterprise
Soniox v4 Real-Time is available immediately via API and is engineered for demanding, enterprise-grade workloads. It supports uninterrupted real-time audio streams of up to 5 hours, making it suitable for production deployments at global scale.
Soniox v4 Real-Time is fully backward-compatible with Soniox v3 Real-Time. Upgrading is seamless: simply switch the model name to stt-rt-v4 and immediately benefit from improved accuracy, latency, and reliability.
The practical shift
Soniox v4 Real-Time isn’t about adding features, it’s about removing the friction that has historically made real-time voice AI hard to deploy globally.
By solving for latency, language parity, and semantic understanding at the model level, we’re providing the infrastructure needed for truly reliable, human-like voice systems.
You can start building with Soniox v4 Real-Time today via the Soniox API.