Soniox
IntegrationsLangChain

LangChain (Python)

Soniox document loader for LangChain

Soniox x Langchain

Overview

LangChain is a popular framework for building applications powered by large language models (LLMs). The langchain-soniox package provides a document loader that transcribes audio files using Soniox's speech-to-text API, making it easy to incorporate audio transcription into your LangChain pipelines.

Setup

Install the package:

pip install langchain-soniox

Credentials

Get your Soniox API key from the Soniox Console and set it as an environment variable:

export SONIOX_API_KEY=your_api_key

Usage

Basic transcription

Transcribe audio files using the SonioxDocumentLoader:

from langchain_soniox import SonioxDocumentLoader

# Using a URL
loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3"
)

docs = list(loader.lazy_load())
print(docs[0].page_content)  # Transcribed text

You can also load audio from a local file or from bytes:

# Using a local file path
loader = SonioxDocumentLoader(file_path="/path/to/audio.mp3")

# Using binary data
with open("/path/to/audio.mp3", "rb") as f:
    audio_bytes = f.read()
loader = SonioxDocumentLoader(file_data=audio_bytes)

Async loading

For async operations, use alazy_load():

import asyncio
from langchain_soniox import SonioxDocumentLoader

async def transcribe_async():
    loader = SonioxDocumentLoader(
        file_url="https://soniox.com/media/examples/coffee_shop.mp3"
    )

    docs = [doc async for doc in loader.alazy_load()]
    print(docs[0].page_content)

asyncio.run(transcribe_async())

Advanced usage

Language hints

Soniox automatically detects and transcribes speech in 60+ languages. When you know which languages are likely to appear in your audio, provide language_hints to improve accuracy by biasing recognition toward those languages.

Language hints do not restrict recognition — they only bias the model toward the specified languages, while still allowing other languages to be detected if present.

from langchain_soniox import (
    SonioxDocumentLoader,
    SonioxTranscriptionOptions,
)

loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3",
    options=SonioxTranscriptionOptions(
        language_hints=["en", "es"],
    ),
)

docs = list(loader.lazy_load())

For more details, see the Soniox language hints documentation.

Speaker diarization

Enable speaker identification to distinguish between different speakers:

from langchain_soniox import (
    SonioxDocumentLoader,
    SonioxTranscriptionOptions,
)

loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3",
    options=SonioxTranscriptionOptions(
        enable_speaker_diarization=True,
    ),
)

docs = list(loader.lazy_load())

# Access speaker information in the metadata
current_speaker = None
output = ""
for token in docs[0].metadata["tokens"]:
    if current_speaker != token["speaker"]:
        current_speaker = token["speaker"]
        output += f"\nSpeaker {current_speaker}: {token['text'].lstrip()}"
    else:
        output += token["text"]
print(output)

Language identification

Enable automatic language detection and identification:

from langchain_soniox import (
    SonioxDocumentLoader,
    SonioxTranscriptionOptions,
)

loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3",
    options=SonioxTranscriptionOptions(
        enable_language_identification=True,
    ),
)

docs = list(loader.lazy_load())

# Access language information in the metadata
current_language = None
output = ""
for token in docs[0].metadata["tokens"]:
    if current_language != token["language"]:
        current_language = token["language"]
        output += f"\n[{current_language}] {token['text'].lstrip()}"
    else:
        output += token["text"]
print(output)

Context for improved accuracy

Provide domain-specific context to improve transcription accuracy. Context helps the model understand your domain, recognize important terms, and apply custom vocabulary.

The context object supports four optional sections:

from langchain_soniox import (
    SonioxDocumentLoader,
    SonioxTranscriptionOptions,
    StructuredContext,
    StructuredContextGeneralItem,
    StructuredContextTranslationTerm,
)

loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3",
    options=SonioxTranscriptionOptions(
        context=StructuredContext(
            # Structured key-value information (domain, topic, intent, etc.)
            general=[
                StructuredContextGeneralItem(key="domain", value="Healthcare"),
                StructuredContextGeneralItem(
                    key="topic", value="Diabetes management consultation"
                ),
                StructuredContextGeneralItem(key="doctor", value="Dr. Martha Smith"),
            ],
            # Longer free-form background text or related documents
            text="The patient has a history of...",
            # Domain-specific or uncommon words
            terms=["Celebrex", "Zyrtec", "Xanax"],
            # Custom translations for ambiguous terms
            translation_terms=[
                StructuredContextTranslationTerm(
                    source="Mr. Smith", target="Sr. Smith"
                ),
                StructuredContextTranslationTerm(source="MRI", target="RM"),
            ],
        ),
    ),
)

docs = list(loader.lazy_load())

For more details, see the Soniox context documentation.

Translation

Translate from any detected language to a target language:

from langchain_soniox import (
    SonioxDocumentLoader,
    SonioxTranscriptionOptions,
    TranslationConfig,
)

loader = SonioxDocumentLoader(
    file_url="https://soniox.com/media/examples/coffee_shop.mp3",
    options=SonioxTranscriptionOptions(
        translation=TranslationConfig(
            type="one_way",
            target_language="fr",
        ),
        language_hints=["en"],
    ),
)

docs = list(loader.lazy_load())

for token in docs[0].metadata["tokens"]:
    if token["translation_status"] == "translation":
        translated_text += token["text"]
    else:
        original_text += token["text"]

print(original_text)
print(translated_text)

You can also transcribe and translate between two languages simultaneously using two_way translation type. Learn more about translation here.

API reference

Constructor parameters

ParameterTypeRequiredDefaultDescription
file_pathstrNo*NonePath to local audio file to transcribe
file_databytesNo*NoneBinary data of audio file to transcribe
file_urlstrNo*NoneURL of audio file to transcribe
api_keystrNoSONIOX_API_KEY env varSoniox API key
base_urlstrNohttps://api.soniox.com/v1API base URL (see regional endpoints)
optionsSonioxTranscriptionOptionsNoSonioxTranscriptionOptions()Transcription options
polling_interval_secondsfloatNo1.0Time between status polls (seconds)
timeout_secondsfloatNo300.0 (5 minutes)Maximum time to wait for transcription
http_request_timeout_secondsfloatNo60.0Timeout for individual HTTP requests

* You must specify exactly one of: file_path, file_data, or file_url.

Transcription options

The SonioxTranscriptionOptions class supports these parameters:

ParameterTypeDescription
modelstrAsync model to use (see available models)
language_hintslist[str]Language hints for transcription (ISO language codes)
language_hints_strictboolEnforce strict language hints
enable_speaker_diarizationboolEnable speaker identification
enable_language_identificationboolEnable language detection
translationTranslationConfigTranslation configuration
contextStructuredContextContext for improved accuracy
client_reference_idstrCustom reference ID for your records
webhook_urlstrWebhook URL for completion notifications
webhook_auth_header_namestrCustom auth header name for webhook
webhook_auth_header_valuestrCustom auth header value for webhook

Browse the API documentation for a full list of supported options.

Return value

The lazy_load() and alazy_load() methods yield a single Document object:

Document(
    page_content=str,  # The transcribed text
    metadata={
        "source": str,  # File URL, path, or "file_upload"
        "transcription_id": str,  # Unique transcription ID
        "audio_duration_ms": int,  # Audio duration in milliseconds
        "model": str,  # Model used for transcription
        "created_at": str,  # ISO 8601 timestamp
        "tokens": list[dict],  # Detailed token-level information
    }
)

The tokens array in metadata includes detailed information for each transcribed word:

  • text: The transcribed text
  • start_ms: Start time in milliseconds
  • end_ms: End time in milliseconds
  • speaker: Speaker ID (if diarization enabled), for example "1", "2", etc.
  • language: Detected language (if identification enabled), for example "en", "fr", etc.
  • translation_status: Translation status ("original", "translated" or "none")

Learn more about the Soniox API reference.