Soniox
Docs
Core concepts

Audio formats

Information about audio formats supported by Soniox Speech-to-text AI.

Overview

Soniox Speech-to-Text AI supports a wide range of audio formats for both file-based and real-time transcription. In most cases, Soniox automatically detects the format using file or stream headers, requiring no additional configuration.

This page outlines which formats are supported in each mode and how to configure raw audio formats when automatic detection is not applicable.


Automatic audio format detection

Soniox can automatically detect common audio and video container formats by inspecting the file or stream header. No configuration is needed in these cases.

Supported formats (auto-detected)

ModeSupported formats
File transcriptionaac, aiff, amr, asf, flac, mp3, ogg, wav, webm, m4a, mp4
Real-time transcriptionaac, aiff, amr, asf, flac, mp3, ogg, wav, webm

No configuration required — Soniox automatically detects the format based on the file or stream header.


Raw audio formats (manual configuration required)

For raw audio formats that do not include headers (such as PCM), you must manually specify the format using the following parameters:

  • audio_format: The encoding of the raw audio data (e.g., pcm_s16le, pcm_f32be, mulaw)
  • sample_rate: The number of audio samples per second, in Hz (e.g., 16000)
  • num_channels: The number of audio channels (e.g., 1 for mono, 2 for stereo)

Supported raw formats

Soniox supports a wide range of raw audio encodings, including:

FormatDescription
pcm_s8Signed 8-bit
pcm_s16leSigned 16-bit, little-endian
pcm_s16beSigned 16-bit, big-endian
pcm_s24leSigned 24-bit, little-endian
pcm_s24beSigned 24-bit, big-endian
pcm_s32leSigned 32-bit, little-endian
pcm_s32beSigned 32-bit, big-endian
pcm_u8Unsigned 8-bit
pcm_u16leUnsigned 16-bit, little-endian
pcm_u16beUnsigned 16-bit, big-endian
pcm_u24leUnsigned 24-bit, little-endian
pcm_u24beUnsigned 24-bit, big-endian
pcm_u32leUnsigned 32-bit, little-endian
pcm_u32beUnsigned 32-bit, big-endian
pcm_f32le32-bit float, little-endian
pcm_f32be32-bit float, big-endian
pcm_f64le64-bit float, little-endian
pcm_f64be64-bit float, big-endian
mulawμ-law encoding (usually sample rate 8000 and 1 channel)
alawA-law encoding (usually sample rate 8000 and 1 channel)

These formats require explicit configuration of format, sample rate, and channel count.

Example

The following example demonstrates how to transcribe an audio stream encoded in 16-bit PCM (little-endian), with a 16 kHz sample rate and 1 channel:

import json
import os
import threading
import time
 
import requests
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
audio_url = "https://soniox.com/media/examples/coffee_shop.pcm_s16le"
 
 
def stream_audio(ws):
    with requests.get(audio_url, stream=True) as res:
        res.raise_for_status()
        for chunk in res.iter_content(chunk_size=3840):
            if chunk:
                ws.send(chunk)
                time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        final_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_text = ""
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_text += token["text"]
                        else:
                            non_final_text += token["text"]
 
                print(
                    "\033[2J\033[H"  # clear the screen, move to top-left corner
                    + final_text  # write final text
                    + "\033[34m"  # change text color to blue
                    + non_final_text  # write non-final text
                    + "\033[39m"  # reset text color
                )
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output

On this page