Language identification
Learn how to identify one or more spoken languages within an audio.
Overview
Soniox Speech-to-Text AI can automatically identify spoken languages within an audio stream — whether the audio contains a single language or multiple mixed languages. This powerful feature allows you to handle real-world, multilingual speech naturally and accurately, without requiring the user to specify languages in advance.
Language identification is designed to work seamlessly in both real-time and asynchronous transcription modes.
How it works
Language identification in Soniox is performed at the token level, meaning
each token in the transcript carries its own language
. However, the model is
trained to assign languages in a way that is consistent with the surrounding
sentence — not just based on isolated words or short phrases.
This means:
-
Each token is labeled individually, but the model favors sentence-level coherence when assigning language codes.
-
Short phrases or embedded words in a different language (e.g., greetings, interjections) do not typically result in a language switch unless the majority of the sentence is in that language.
-
The goal is to produce natural, intelligible output that reflects how humans interpret language shifts in real speech.
Examples
Example 1: Embedded foreign phrase
All tokens are labeled as English, even though “amigo” is Spanish.
Example 2: Distinct sentences in different languages
This sentence-aligned behavior ensures transcripts remain natural and easy to interpret, especially in real-world multilingual conversations where code-switching is common.
Enabling language identification
To enable automatic language identification, set the following parameter in your API request:
This feature is supported in both:
- Asynchronous transcription
- Real-time transcription
Output format
When enabled, each token in the response includes a language
field:
Real-time considerations
Real-time language identification is inherently more challenging due to low-latency constraints. The model has less future context to rely on when making decisions, which can lead to:
- Temporary misidentification of language
- Language code revisions as more speech context becomes available
Despite this, Soniox remains highly effective in recognizing language switches even in live scenarios.
Best practices
- Use
language_hints
when you know the likely languages ahead of time (for improved accuracy)
Supported languages
Soniox supports 60+ languages for automatic detection. See the full list and ISO codes on the Supported languages page.
Example
This example demonstrates how to transcribe a stream with automatic language identification.