Audio formats

Overview

Soniox Speech-to-Text AI supports a wide range of audio formats for both file-based and real-time transcription. In most cases, Soniox automatically detects the format using file or stream headers, requiring no additional configuration.

This page outlines which formats are supported in each mode and how to configure raw audio formats when automatic detection is not applicable.

Automatic audio format detection

Soniox can automatically detect common audio and video container formats by inspecting the file or stream header. No configuration is needed in these cases.

Supported formats (auto-detected)

Mode	Supported formats
File transcription	aac, aiff, amr, asf, flac, mp3, ogg, wav, webm, m4a, mp4
Real-time transcription	aac, aiff, amr, asf, flac, mp3, ogg, wav, webm

No configuration required — Soniox automatically detects the format based on the file or stream header.

Raw audio formats (manual configuration required)

For raw audio formats that do not include headers (such as PCM), you must manually specify the format using the following parameters:

audio_format: The encoding of the raw audio data (e.g., pcm_s16le, pcm_f32be, mulaw)
sample_rate: The number of audio samples per second, in Hz (e.g., 16000)
num_channels: The number of audio channels (e.g., 1 for mono, 2 for stereo)

Supported raw formats

Soniox supports a wide range of raw audio encodings, including:

Format	Description
pcm_s8	Signed 8-bit
pcm_s16le	Signed 16-bit, little-endian
pcm_s16be	Signed 16-bit, big-endian
pcm_s24le	Signed 24-bit, little-endian
pcm_s24be	Signed 24-bit, big-endian
pcm_s32le	Signed 32-bit, little-endian
pcm_s32be	Signed 32-bit, big-endian
pcm_u8	Unsigned 8-bit
pcm_u16le	Unsigned 16-bit, little-endian
pcm_u16be	Unsigned 16-bit, big-endian
pcm_u24le	Unsigned 24-bit, little-endian
pcm_u24be	Unsigned 24-bit, big-endian
pcm_u32le	Unsigned 32-bit, little-endian
pcm_u32be	Unsigned 32-bit, big-endian
pcm_f32le	32-bit float, little-endian
pcm_f32be	32-bit float, big-endian
pcm_f64le	64-bit float, little-endian
pcm_f64be	64-bit float, big-endian
mulaw	μ-law encoding (usually sample rate `8000` and `1` channel)
alaw	A-law encoding (usually sample rate `8000` and `1` channel)

These formats require explicit configuration of format, sample rate, and channel count.

Example

The following example demonstrates how to transcribe an audio stream encoded in 16-bit PCM (little-endian), with a 16 kHz sample rate and 1 channel:

import json
import os
import threading
import time
 
import requests
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
audio_url = "https://soniox.com/media/examples/coffee_shop.pcm_s16le"
 
 
def stream_audio(ws):
    with requests.get(audio_url, stream=True) as res:
        res.raise_for_status()
        for chunk in res.iter_content(chunk_size=3840):
            if chunk:
                ws.send(chunk)
                time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        final_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_text = ""
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_text += token["text"]
                        else:
                            non_final_text += token["text"]
 
                print(
                    "\033[2J\033[H"  # clear the screen, move to top-left corner
                    + final_text  # write final text
                    + "\033[34m"  # change text color to blue
                    + non_final_text  # write non-final text
                    + "\033[39m"  # reset text color
                )
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output