Use any audio input and format
Live streams
Transcribe live streams with the highest accuracy and sub 200ms latency.
Best auto-captioning experience with the highest comprehension quality.
Files
Upload files and get back highly accurate transcripts within seconds to minutes. Fast turnaround with large number of files.
Audio format
Soniox automatically detects most common audio formats including mp3, wav, flac, ogg, aac, aiff, amr, asf, and raw PCM samples.
Complete result
Soniox returns a complete transcription result including the words being recognized, timestamps, confidence scores and speaker tags.
In streaming speech recognition, Soniox returns back "interim results" containing final words and non-final words (can change in the future) as more audio is transcribed.
Explore docs1 {
2 text: "YouTube";
3 start_ms: 1450;
4 duration_ms: 350;
5 is_final: true;
6 speaker: 1;
7 confidence: 0.98;
8 }
Speech customization
We invented a novel procedure that effectively and on-the-fly customizes speech recognition AI to the specified context. Simply provide a list of words and phrases and Soniox will automatically recognize them when spoken in audio.
Explore docs1 # Create speech context on-the-fly.
2 speech_context = SpeechContext(
3 entries=[
4 SpeechContextEntry(
5 phrases=["acetylcarnitine", "Zestoretic"],
6 boost=15,
7 )
8 ]
9 )
10 # Pass speech context to transcribe API call.
11 result = transcribe_file_short(
12 "../test_data/acetylcarnitine_zestoretic.flac",
13 client,
14 model="en_v2",
15 speech_context=speech_context,
16 )
Speaker diarization
Speaker diarization recognizes different speakers in audio and outputs a speaker-attributed transcription result. Speaker diarization does not require any additional input to recognise different speakers. The recognition is performed based on the audio input alone.
Explore docsSpeaker 1: Hi, good morning!
Speaker 2: Hi! What can I get for you?
Speaker 1: A latte with almond milk, please.
Speaker 2: Sure. Would you like any flavouring?
Speaker 1: Caramel sounds good. Yes, please.
Speaker 2: Great. Your total comes to $4.50.
Major languages
We have the world's most accurate speech recognition AI for English, German, Korean, Chinese, Spanish, French, Italian and Portuguese - see benchmarks. For each language, we offer AI models for async (batch) and low-latency speech recognition.
Soniox’s speech recognition AI is bilingual, meaning it can recognize both the native language and English simultaneously.
Reliable and scalable cloud service
We built the entire cloud service infrastructure from scratch to support processing of massive volumes of audio with large AI models.
Soniox cloud service auto scales to the real-time load and gracefully handles peaks during the day and on busy days.
minutes per month processed
historical uptime
latency of speech recognition
trusted customers
Getting started
Developers can start building with Speech Recognition playground or by using our docs.
Try now