Speech Recognition

World’s most accurate speech recognition AI, built from scratch.

SamsungDeepScribeAvodahMedOneAIScribeTranscribeMeAgoraDeliverHealth

Use any audio input and format

Live streams

Transcribe live streams with the highest accuracy and sub 200ms latency.
Best auto-captioning experience with the highest comprehension quality.

Files

Upload files and get back highly accurate transcripts within seconds to minutes. Fast turnaround with large number of files.

Audio format

Soniox automatically detects most common audio formats including mp3, wav, flac, ogg, aac, aiff, amr, asf, and raw PCM samples.

Complete result

Soniox returns a complete transcription result including the words being recognized, timestamps, confidence scores and speaker tags.

In streaming speech recognition, Soniox returns back "interim results" containing final words and non-final words (can change in the future) as more audio is transcribed.

Explore docs
1 {
2 text: "YouTube";
3 start_ms: 1450;
4 duration_ms: 350;
5 is_final: true;
6 speaker: 1;
7 confidence: 0.98;
8 }

Speech customization

We invented a novel procedure that effectively and on-the-fly customizes speech recognition AI to the specified context. Simply provide a list of words and phrases and Soniox will automatically recognize them when spoken in audio.

Explore docs
 1 # Create speech context on-the-fly.
2 speech_context = SpeechContext(
3 entries=[
4 SpeechContextEntry(
5 phrases=["acetylcarnitine", "Zestoretic"],
6 boost=15,
7 )
8 ]
9 )
10 # Pass speech context to transcribe API call.
11 result = transcribe_file_short(
12 "../test_data/acetylcarnitine_zestoretic.flac",
13 client,
14 model="en_v2",
15 speech_context=speech_context,
16 )

Speaker diarization

Speaker diarization recognizes different speakers in audio and outputs a speaker-attributed transcription result. Speaker diarization does not require any additional input to recognise different speakers. The recognition is performed based on the audio input alone.

Explore docs
Speaker 1: Hi, good morning!

Speaker 2: Hi! What can I get for you?

Speaker 1: A latte with almond milk, please.

Speaker 2: Sure. Would you like any flavouring?

Speaker 1: Caramel sounds good. Yes, please.

Speaker 2: Great. Your total comes to $4.50.

Major languages

We have the world's most accurate speech recognition AI for English, Korean, Chinese, Japanese, Vietnamese, German, Spanish, French, Italian and Portuguese - see benchmarks. For each language, we offer AI models for async (batch) and low-latency speech recognition.

Soniox’s speech recognition AI is bilingual, meaning it can recognize both the native language and English simultaneously.

Reliable and scalable cloud service

We built the entire cloud service infrastructure from scratch to support processing of massive volumes of audio with large AI models.

Soniox cloud service auto scales to the real-time load and gracefully handles peaks during the day and on busy days.

100M+
minutes per month processed
99.99%
historical uptime
<200ms
latency of speech recognition
100+
trusted customers

Getting started

Developers can start building with Speech Recognition playground or by using our docs.

Try now