Speech Recognition AI

Most accurate speech recognition AI

We invented a new AI learning algorithm to recognize speech at near human level of accuracy and robustness in real-world environments. Compared to other providers, Soniox speech recognition AI is in a league of its own.

Any audio input and format


Upload files and get back highly accurate transcripts within seconds to minutes. Fast turnaround with large number of files.

Explore docs

Audio format

Soniox automatically detects most common audio formats including mp3, wav, flac, ogg, aac, aiff, amr, asf, and raw PCM samples.

Explore docs

Live streams

Transcribe live streams with the highest accuracy and sub 200ms latency. Best auto-captioning experience with the highest comprehension quality.

Explore docs

Multi-channel audio

Merge multi-channels into one channel or transcribe each channel independently with a single API call.

Explore docs

Complete transcription result

Soniox returns a complete transcription result including the words being recognized, timestamps, confidence scores and speaker tags.

In streaming speech recognition, Soniox returns back "interim results" containing final words and non-final words (can change in the future) as more audio is transcribed.

Explore docs

text: "YouTube";
start_ms: 1450;
duration_ms: 350;
is_final: true;
speaker: 1;
confidence: 0.98;

Speech customization

We invented a novel procedure that effectively and on-the-fly customizes speech recognition AI to the specified context. Simply provide a list of words and phrases and Soniox will automatically recognize them when spoken in audio.

You can also re-format the words or phrases to your liking. For example, "twenty three and me => 23andMe".

We also support storing the speech customizations in our cloud, i.e. create a speech customization once and then use it many times on different audios.

Explore docs

# Create speech context on-the-fly.
speech_context = SpeechContext(
phrases=["twenty three and me => 23andMe"],

# Pass speech context to transcribe API call.
result = transcribe_file_short(

Dictation mode

Dictation mode enables you to use voice to type and format text. When a dictation command is recognized, it is mapped to a corresponding symbol. For example, the word "period" would appear on the transcript as "." and "dollar sign" as "$".

Explore docs

# Input:
This is cool period new line I am voice typing

# Output:
This is cool . [NEW_LINE] I am voice typing

Content moderation

Profanity filter

Detects and censors profane words and phrases as audio is being transcribed. All letters except the first are masked. For example, "f***".

Explore docs

Custom content moderation

Define any inappropriate word or phrase to moderate content. The defined words and phrases will be then automatically masked except for the first letter.

Explore docs

Domain specific models

Medical domain model

We offer a medical domain model for recognition of words that are common in the medical settings, such as diagnoses, medications, symptoms, treatments, diseases and anatomical parts.

Explore docs

IVR domain model

We offer an IVR speech model for applications that require capturing user data via voice. The IVR speech model recognizes and formats letters, digits, numbers, names, email addresses, phone numbers and zip codes.

Explore docs

Support for major languages

We build only high accuracy speech and speaker AI solutions that enable you to transcribe any audio and get back highly accurate transcripts.

Support for major languages including English, Spanish and German. More languages will be released in the following weeks.

See English benchmarks

See German and Spanish benchmarks

Ready to get started?

Explore Soniox Docs or create an account and start building your audio AI application. You can also contact us to design a custom package for your business.

Always know what you pay

Pay only for what you use. Integrated per-usage pricing with no hidden fees.

Pricing details

Start your integration

Get up and running with Soniox in as little as 5 minutes.

API reference