Soniox
Docs
Shared concepts

Language identification

Learn how to identify one or more spoken languages within an audio.

Overview

Soniox Speech-to-Text AI can automatically identify spoken languages within an audio stream — whether the speech is entirely in one language or mixes multiple languages. This lets you handle real-world multilingual conversations naturally and accurately, without requiring users to specify languages in advance.


How it works

Language identification in Soniox is performed at the token level. Each token in the transcript is tagged with a language code. However, the model is trained to maintain sentence-level coherence, not just word-level decisions.


Examples

Example 1: embedded foreign word

[en] Hello, my dear amigo, how are you doing?

All tokens are labeled as English (en), even though “amigo” is Spanish.

Example 2: distinct sentences in multiple languages

[en] How are you?
[de] Guten Morgen!
[es] Cómo está everyone?
[en] Great! Let’s begin with the agenda.

Here, language tags align with sentence boundaries, making the transcript easier to read and interpret in multilingual conversations.


Enabling language identification

Enable automatic language identification by setting the flag in your request:

{
  "enable_language_identification": true
}

Output format

When enabled, each token includes a language field alongside the text:

{"text": "How",     "language": "en"}
{"text": " are",    "language": "en"}
{"text": " you",    "language": "en"}
{"text": "?",       "language": "en"}
{"text": "Gu",      "language": "de"}
{"text": "ten",     "language": "de"}
{"text": " Morgen", "language": "de"}
{"text": "!",       "language": "de"}
{"text": "Cómo",    "language": "es"}
{"text": " está",   "language": "es"}
{"text": " every",  "language": "es"}
{"text": "one",     "language": "es"}
{"text": "?",       "language": "es"}

Language hints

Use Language hints whenever possible to improve the accuracy of language identification.


Real-time considerations

Language identification in real-time is more challenging due to low-latency constraints. The model has less context available, which may cause:

  • Temporary misclassification of language.
  • Language tags being revised as more speech context arrives.

Despite this, Soniox provides highly reliable detection of language switches in real-time.


Supported languages

Language identification is available for all supported languages.