Speech Recognition AI
Most accurate speech recognition AI
We invented a new AI learning algorithm to recognize speech at near human level of accuracy and robustness in real-world environments. Compared to other providers, Soniox speech recognition AI is in a league of its own.
Any audio input and format
Files
Upload files and get back highly accurate transcripts within seconds to minutes. Fast turnaround with large number of files.
Audio format
Soniox automatically detects most common audio formats including mp3, wav, flac, ogg, aac, aiff, amr, asf, and raw PCM samples.
Live streams
Transcribe live streams with the highest accuracy and sub 200ms latency. Best auto-captioning experience with the highest comprehension quality.
Multi-channel audio
Merge multi-channels into one channel or transcribe each channel independently with a single API call.
Complete transcription result
Soniox returns a complete transcription result including the words being recognized, timestamps, confidence scores and speaker tags.
In streaming speech recognition, Soniox returns back "interim results" containing final words and non-final words (can change in the future) as more audio is transcribed.
{
text: "YouTube";
start_ms: 1450;
duration_ms: 350;
is_final: true;
speaker: 1;
confidence: 0.98;
}
Speech customization
We invented a novel procedure that effectively and on-the-fly customizes speech recognition AI to the specified context. Simply provide a list of words and phrases and Soniox will automatically recognize them when spoken in audio.
You can also re-format the words or phrases to your liking. For example, "twenty three and me => 23andMe".
We also support storing the speech customizations in our cloud, i.e. create a speech customization once and then use it many times on different audios.
speech_context = SpeechContext(
entries=[
SpeechContextEntry(
phrases=["twenty three and me => 23andMe"],
boost=10,
)
]
)
# Pass speech context to transcribe API call.
result = transcribe_file_short(
"../test_data/youtube_23andme.flac",
client,
speech_context=speech_context,
)
Dictation mode
Dictation mode enables you to use voice to type and format text. When a dictation command is recognized, it is mapped to a corresponding symbol. For example, the word "period" would appear on the transcript as "." and "dollar sign" as "$".
This is cool period new line I am voice typing
# Output:
This is cool . [NEW_LINE] I am voice typing
Content moderation
Profanity filter
Detects and censors profane words and phrases as audio is being transcribed. All letters except the first are masked. For example, "f***".
Custom content moderation
Define any inappropriate word or phrase to moderate content. The defined words and phrases will be then automatically masked except for the first letter.
Domain specific models
Medical domain model
We offer a medical domain model for recognition of words that are common in the medical settings, such as diagnoses, medications, symptoms, treatments, diseases and anatomical parts.
IVR domain model
We offer an IVR speech model for applications that require capturing user data via voice. The IVR speech model recognizes and formats letters, digits, numbers, names, email addresses, phone numbers and zip codes.
Support for major languages
We build only high accuracy speech and speaker AI solutions that enable you to transcribe any audio and get back highly accurate transcripts.
Support for major languages including English, Spanish and German. More languages will be released in the following weeks.
Ready to get started?
Explore Soniox Docs or create an account and start building your audio AI application. You can also contact us to design a custom package for your business.
Always know what you pay
Pay only for what you use. Integrated per-usage pricing with no hidden fees.