Soniox | Soniox named best-in-class STT for voice agents

At Soniox, our mission has always been clear: to build the most accurate, lowest-latency speech AI in the world. Today, we are thrilled to share that a recent independent benchmark has validated that mission.

In a comprehensive study by Daily (Pipecat), Soniox was recognized as a top-tier provider for real-time voice agents, sitting firmly on the "Pareto frontier", the ideal balance of speed and accuracy that defines the best in the industry.

The results: Best-in-class performance

The Daily.co benchmark focused on the two metrics that matter most for the next generation of voice agents: latency and Word Error Rate (WER).

Soniox delivered exceptional results across the board:

Speed: A median latency of just 249ms, putting us at the top of the industry.
Accuracy: A 1.25% WER, proving that you don’t have to sacrifice precision for speed.

Daily.co benchmark results

Daily.co TTFS Median Latency vs Accuracy

Source: daily.co/blog/benchmarking-stt-for-voice-agents

We are grateful to the team at Daily.co for including us in this rigorous study. Seeing our technology perform at the highest level alongside other industry leaders fills us with a sense of humble pride. It’s a testament to the hard work of our engineering team and our unwavering commitment to the community of builders creating the future of voice AI.

Why speed and accuracy matter for voice agents

In the world of Voice AI, milliseconds aren't just technical specs, they are the difference between a seamless conversation and a frustrating failure. To understand why Soniox's performance is a game-changer, you have to look at the "latency budget."

A natural human conversation usually has a "turn-taking" gap of about 500ms. To stay within that gold standard, a voice agent must transcribe the audio (STT), reason through a response (LLM), and generate speech (TTS) all in less than a second. As the first hop in this chain, if the STT takes 500ms, the entire experience is already lagging before the AI even starts "thinking."

The "cascading error" problem

Accuracy is equally critical because of how errors propagate. If an STT engine misinterprets a single word, that mistake is fed directly into the LLM. The LLM then "reasons" based on bad data, leading to a response that is at best confusing and at worst dangerous.

Example: The "fifteen vs. fifty" risk

Imagine a voice agent in a healthcare setting. A patient says, "I need to take fifteen milligrams of my medication."

Low-quality STT: Records "fifty" (WER looks low, but the semantic error is massive).
Result: The LLM confirms a potentially toxic dosage.
With Soniox: Our 1.25% WER and alphanumeric precision ensure that high-stakes details like numbers and names are captured perfectly, every time.

By delivering sub-250ms latency and best-in-class accuracy, Soniox provides the rock-solid foundation that allows voice agents to feel truly human, intelligent, and reliable.

The end of "English-first" voice AI

While we are proud of these results, this benchmark is only the beginning. The study was conducted solely in English, the current industry standard. However, we believe that for voice AI to truly be "next-gen," it must be global.

If this benchmark had included multiple languages, the results would have been even more dramatic. Soniox performs equally well across all 60+ supported languages, while other providers would crush under the pressure of non-English audio. The reality is that most "industry leaders" are English-first. Their accuracy drops off a cliff the moment you move into other languages, and their latency spikes as they struggle to process complex audio. Soniox is the only speech API provider you can use today to build voice agents for all 8 billion people on Earth. We deliver the same best-in-class accuracy and speed regardless of the language being spoken. Others simply do not come even close.

Built for the realities of global voice AI

We outperform the competition by focusing on the unique challenges of real-world speech and enterprise requirements:

Native accuracy: Precise recognition across 60+ languages, including complex accents and dialects.
Code-switching: The only API that handles mid-sentence language switching automatically.
Alphanumeric precision: Perfect capture of emails, addresses, and phone numbers as they are spoken.
Advanced endpointing: We use tone and meaning—not just silence—to know exactly when a user is done speaking.
Real-time translation: True streaming, any-to-any translation for 3,600 language pairs.
Global infrastructure: Flexible, local deployments in the US, EU, and Japan (with more regions coming online) to ensure data residency and ultra-low latency.

See the difference: Compare live

Don't just take our word for the benchmark results, test them yourself. We believe in total transparency, which is why we’ve built a live comparison tool.

You can run your own audio side-by-side to see how Soniox performs against other STT providers in real-time. Same audio. Same conditions. Live, transparent results.

Looking ahead

Being recognized as a leader in speed and accuracy is a milestone, but we aren't stopping here. We are more committed than ever to building the next generation of voice AI: technology that is faster, more accurate, and more inclusive of the world’s diverse languages.

To our community of developers and partners: thank you for being on this journey with us. We can’t wait to see what you build next.

Ready to build the fastest voice agent on the market? Explore the Soniox API and experience the best-in-class performance for yourself.