Soniox
Docs
Core concepts

Language identification

Learn how to identify one or more spoken languages within an audio.

Overview

Soniox Speech-to-Text AI can automatically identify spoken languages within an audio stream — whether the audio contains a single language or multiple mixed languages. This powerful feature allows you to handle real-world, multilingual speech naturally and accurately, without requiring the user to specify languages in advance.

Language identification is designed to work seamlessly in both real-time and asynchronous transcription modes.


How it works

Language identification in Soniox is performed at the token level, meaning each token in the transcript carries its own language. However, the model is trained to assign languages in a way that is consistent with the surrounding sentence — not just based on isolated words or short phrases.

This means:

  • Each token is labeled individually, but the model favors sentence-level coherence when assigning language codes.

  • Short phrases or embedded words in a different language (e.g., greetings, interjections) do not typically result in a language switch unless the majority of the sentence is in that language.

  • The goal is to produce natural, intelligible output that reflects how humans interpret language shifts in real speech.


Examples

Example 1: Embedded foreign phrase

[en] Hello, my dear amigo, how are you doing?

All tokens are labeled as English, even though “amigo” is Spanish.

Example 2: Distinct sentences in different languages

[en] How are you?
[de] Guten Morgen!
[es] Cómo está everyone?
[en] Great! Let’s begin with the agenda.

This sentence-aligned behavior ensures transcripts remain natural and easy to interpret, especially in real-world multilingual conversations where code-switching is common.


Enabling language identification

To enable automatic language identification, set the following parameter in your API request:

{
  "enable_language_identification": true
}

This feature is supported in both:

  • Asynchronous transcription
  • Real-time transcription

Output format

When enabled, each token in the response includes a language field:

{
  "tokens": [
    { "text": "How", "language": "en" },
    { "text": "are", "language": "en" },
    { "text": "you?", "language": "en" },
    { "text": "Guten", "language": "de" },
    { "text": "Morgen!", "language": "de" },
    { "text": "Cómo", "language": "es" },
    { "text": "está", "language": "es" },
    { "text": "everyone?", "language": "es" }
  ]
}

Real-time considerations

Real-time language identification is inherently more challenging due to low-latency constraints. The model has less future context to rely on when making decisions, which can lead to:

  • Temporary misidentification of language
  • Language code revisions as more speech context becomes available

Despite this, Soniox remains highly effective in recognizing language switches even in live scenarios.


Best practices

  • Use language_hints when you know the likely languages ahead of time (for improved accuracy)

Supported languages

Soniox supports 60+ languages for automatic detection. See the full list and ISO codes on the Supported languages page.


Example

This example demonstrates how to transcribe a stream with automatic language identification.

import json
import os
import threading
import time
 
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
file_to_transcribe = "two_way_translation.pcm_s16le"
 
 
def stream_audio(ws):
    with open(file_to_transcribe, "rb") as fh:
        while True:
            data = fh.read(3840)
            if len(data) == 0:
                break
            ws.send(data)
            time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def render_tokens(final_tokens, non_final_tokens):
    # Render the tokens in the terminal using ANSI escape codes.
    text = ""
    text += "\033[2J\033[H"  # clear the screen, move to top-left corner
    is_final = True
    language = ""
    for token in final_tokens + non_final_tokens:
        token_text = token["text"]
        if not token["is_final"] and is_final:
            text += "\033[34m"  # change text color to blue
            is_final = False
        if token.get("language") and token["language"] != language:
            if language:
                text += "\n"
            language = token["language"]
            text += f"[{language}] "
            token_text = token_text.lstrip()
        text += token_text
    text += "\033[39m"  # reset text color
    print(text)
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                    "enable_language_identification": True,
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        final_tokens = []
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_tokens = []
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_tokens.append(token)
                        else:
                            non_final_tokens.append(token)
 
                render_tokens(final_tokens, non_final_tokens)
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()
View example on GitHub

Output

On this page