WebSocket API

Overview

The Soniox WebSocket API provides real-time Text-to-Speech with low latency over a persistent WebSocket connection. A single connection can host up to 5 concurrent streams multiplexed by stream_id. Ideal for voice agents, interactive assistants, and LLM-driven applications where audio must start playing before the full text is generated.

WebSocket endpoint

Connect to the API using:

wss://tts-rt.soniox.com/tts-websocket

Configuration

Before streaming text for a stream, send a configuration message on the WebSocket connection. Send one config message per stream_id you want to start.

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "tts-rt-v1",
  "language": "en",
  "voice": "Adrian",
  "audio_format": "wav",
  "sample_rate": 24000,
  "stream_id": "stream-001"
}

Parameters

api_keyRequiredstring

Your Soniox API key. Create API keys in the Soniox Console. For client apps, generate a temporary API key from your server to keep secrets secure.

stream_idRequiredstring

Client-generated stream identifier. Must be unique among active streams on the same WebSocket connection. You may reuse a stream_id only after its previous stream is terminated.

modelRequiredstring

Text-to-Speech model to use. See models.

Example: "tts-rt-v1"

languageRequiredstring

Language code. See the list of supported languages and their ISO codes.

Example: "en"

voiceRequiredstring

Voice. See voices.

Example: "Adrian"

audio_formatRequiredstring

Audio format of the stream. See audio formats.

Example: "wav"

sample_ratenumber

Optional output sample rate in Hz. See audio formats.

Example: 24000

bitratenumber

Codec bitrate in bps (for lossy compressed formats). See audio formats.

Example: 128000

client_reference_idstring

Optional client-defined identifier recorded with this request in usage logs. Does not need to be unique. Ignored if the request authenticates with a temporary API key.

Text streaming

After sending the configuration message for a stream, send text messages for that same stream_id:

{
  "text": "Hello there, this is chunk one.",
  "text_end": false,
  "stream_id": "stream-001"
}

Final text chunk:

{
  "text": "And this is the end.",
  "text_end": true,
  "stream_id": "stream-001"
}

Ending the stream

A stream ends with a three-step handshake: the client sends a text message with text_end: true, the server sends the last audio payload with audio_end: true, then {"terminated": true}.

For the full lifecycle, including error completion and client-initiated cancellation see Stream termination.

Normal completion

A stream completes in this order:

The client sends a text message with text_end: true for the target stream_id.

{
  "text": "And this is the end.",
  "text_end": true,
  "stream_id": "stream-001"
}

The server sends the last audio payload with audio_end: true.

{
  "audio": "<base64-audio-bytes>",
  "audio_end": true,
  "stream_id": "stream-001"
}

The server sends a final stream event with terminated: true.

{
  "terminated": true,
  "stream_id": "stream-001"
}

What `audio_end` means

audio_end: true marks the last audio chunk for that stream. No more audio payloads will follow. You should still keep the stream open and wait for the terminal terminated: true event.

What `terminated` means

terminated: true indicates the server has fully closed the stream and released all stream resources. Only after terminated: true it is safe to:

reuse the same stream_id
stop tracking stream state
consider the stream lifecycle complete

Treat the stream as complete only after you receive terminated: true.

Error completion

If an error occurs for a stream:

The server sends an error response for that stream_id.

{
  "stream_id": "stream-001",
  "error_code": 400,
  "error_type": "invalid_request",
  "error_message": "Missing model",
  "more_info": "https://soniox.com/docs/api-reference/errors#invalid-request",
  "request_id": "3d37a3bd-5078-47ee-a369-b204e3bbedda"
}

The server sends {"terminated": true} for that same stream_id.

{
  "terminated": true,
  "stream_id": "stream-001"
}

The failed stream is removed, but the WebSocket connection stays open and other streams can continue.

Client-initiated cancellation

To cancel a stream, send a cancel message. The server finalizes the stream and does not send audio chunks.

Cancel request:

{
  "stream_id": "stream-001",
  "cancel": true
}

Finalization response:

{
  "terminated": true,
  "stream_id": "stream-001"
}

Error handling

One failed stream does not close the whole WebSocket connection.

Stream-level runtime errors (inside a running stream):
- The server sends an error response for that stream_id.
- The server then sends {"terminated": true} for that same stream_id.
- Only that stream ends; other active streams continue.
Validation/input errors (invalid start/text message, unknown stream, malformed stream message):
- The server sends an error response.
- The WebSocket message loop stays alive, so valid streams can continue.
Connection-level failures (WebSocket disconnect/read/write failure, forced shutdown):
- The WebSocket connection closes.
- All streams on that connection end.

Response

Server messages are JSON and include stream_id for stream-specific events. Successful audio messages include audio (base64 chunk), and terminal messages include terminated.

{
  "audio": "<base64-audio-bytes>",
  "audio_end": false,
  "stream_id": "stream-001"
}

Terminal stream message

{
  "terminated": true,
  "stream_id": "stream-001"
}

Error response

If an error occurs, the server returns an error message:

{
  "stream_id": "stream-001",
  "error_code": 400,
  "error_type": "invalid_request",
  "error_message": "Missing model",
  "more_info": "https://soniox.com/docs/api-reference/errors#invalid-request",
  "request_id": "3d37a3bd-5078-47ee-a369-b204e3bbedda"
}

error_codenumber

HTTP status code of the error.

error_typestring

Stable, machine-readable identifier of the error. Branch on this, not on error_message. See the Errors reference for the full catalog and recovery steps.

error_messagestring

Human-readable description of the error.

more_infostring

Link to the section on the Errors page describing this error_type.

request_idstring

Unique identifier of this request. Include it when contacting support@soniox.com; server logs are keyed on it.

For error scoping and isolation behavior when one WebSocket hosts multiple streams, see Streams. For the full catalog of error_type values across all Soniox APIs, see the Errors reference.

Full list of possible error codes and messages

The request is malformed or contains invalid parameters. error_type is one of invalid_request, invalid_stream_state, max_concurrent_streams_reached, or model_not_available.

API key is too long (max length 250).
Audio format is too long (max length 50).
Expected a text message. Binary frames are not accepted on this endpoint.
Invalid language '<language>' for model '<model>'.
Invalid message format. Expected JSON matching a start, text, or keep-alive request.
Invalid voice '<voice>' for model '<model>'.
Language is required.
Language is too long (max length 50).
Maximum concurrent streams per connection (N) reached. Send a cancel message for one of your active streams to free a slot, or open a new WebSocket connection.
Missing audio_format
Missing language
Missing model
Missing stream_id
Missing voice
Model name is too long (max length 50).
Stream <stream_id> has already been cancelled. Start a new stream to send more text.
Stream <stream_id> has already received text_end and is closed for input. Start a new stream to send more text.
Stream <stream_id> is already active on this connection. Choose a different stream_id, or cancel the existing stream first.
Stream <stream_id> not found. Send a start message first.
Stream ID is too long (max length 256).
Text is too long (max length 5000).
The 'cancel' field cannot be combined with 'text' or 'text_end'. Send 'cancel' on its own to stop a stream.
The requested model is not available. See https://soniox.com/docs/tts/models for the list of supported TTS models.
Voice is too long (max length 50).

Authentication is missing or incorrect. Ensure a valid API key is provided before retrying. error_type: unauthenticated.

Incorrect API key provided. You can get an API key at https://console.soniox.com
Invalid or expired temporary API key. Create a new temporary API key and retry. See https://soniox.com/docs/guides/temporary-api-keys for details.
Missing API key. Provide API key as a header (i.e. Authorization: Bearer <SONIOX_API_KEY>). You can get an API key at https://console.soniox.com
The temporary API key cannot be used for this action. Each temporary API key is scoped to a specific `usage_type`; create a new key with the correct usage type.

The organization's balance or monthly usage limit has been reached. error_type is one of organization_balance_exhausted, organization_monthly_budget_exhausted, or project_monthly_budget_exhausted.

Organization balance exhausted. Please either add funds manually or enable autopay.
Organization monthly budget exhausted. Please increase it.
Project monthly budget exhausted. Please increase it.

The temporary API key in use was created with a max_session_duration_seconds cap, and that duration has elapsed for the current session. error_type: temp_api_key_session_expired.

Temporary API key session duration limit exceeded. Create a new temporary API key to start a new session.

A backend call exceeded its deadline before completing. Retry the request. error_type: request_timeout.

A usage or rate limit has been exceeded. You may retry after a delay or request an increase in limits via the Soniox Console. error_type: limit_exceeded.

Concurrent requests limit for text-to-speech has been exceeded for your organization.
Concurrent requests limit for text-to-speech has been exceeded for your project.
Requests per minute limit for text-to-speech has been exceeded for your organization.
Requests per minute limit for text-to-speech has been exceeded for your project.

An unexpected server-side error occurred. The request may be retried. error_type: internal_error.

The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our support email support@soniox.com if you keep seeing this error.

The service cannot accept the request right now (upstream overload, cache exhausted, shutdown). Retry with backoff. The numeric (code N) in the message identifies the sub-cause for support triage. error_type: service_unavailable.

Cannot continue request (code N). Please restart the request. Refer to: https://soniox.com/url/cannot-continue-request

Code example

Prerequisite: Complete the steps in Get started.

See on GitHub: soniox_sdk_realtime.py.

import argparse
import os
import threading
import time
from pathlib import Path
from uuid import uuid4

from soniox import SonioxClient
from soniox.errors import SonioxRealtimeError
from soniox.types import RealtimeTTSConfig
from soniox.utils import output_file_for_audio_format

VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000]
VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000]
VALID_AUDIO_FORMATS = [
    "pcm_f32le",
    "pcm_s16le",
    "pcm_mulaw",
    "pcm_alaw",
    "wav",
    "aac",
    "mp3",
    "opus",
    "flac",
]

DEFAULT_LINES = [
    "Welcome to Soniox real-time Text-to-Speech. ",
    "As text is streamed in, audio streams back in parallel with high accuracy, ",
    "so your application can start playing speech ",
    "within milliseconds of the first word.",
]


def get_config(
    model: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str | None,
) -> RealtimeTTSConfig:
    config = RealtimeTTSConfig(
        # Stream id for this realtime TTS session.
        # If omitted, a random id is generated.
        stream_id=stream_id or f"tts-{uuid4()}",
        #
        # Select the model to use.
        # See: soniox.com/docs/tts/models
        model=model,
        #
        # Set the language of the input text.
        # See: soniox.com/docs/tts/concepts/supported-languages
        language=language,
        #
        # Select the voice to use.
        # See: soniox.com/docs/tts/concepts/voices
        voice=voice,
        #
        # Set output audio format and optional encoding parameters.
        # See: soniox.com/docs/api-reference/tts/websocket-api
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
    )

    return config


def run_session(
    client: SonioxClient,
    lines: list[str],
    model: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str | None,
    output_path: str | None,
) -> None:
    # Build a realtime Text-to-Speech session configuration.
    config = get_config(
        model=model,
        language=language,
        voice=voice,
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
        stream_id=stream_id,
    )
    sanitized_lines = [line.strip() for line in lines if line.strip()]
    if not sanitized_lines:
        raise ValueError("Text is empty after parsing.")

    destination = (
        Path(output_path)
        if output_path
        else output_file_for_audio_format(audio_format, "tts_realtime")
    )
    print("Connecting to Soniox...")
    audio_chunks: list[bytes] = []
    try:
        with client.realtime.tts.connect(config=config) as session:
            print("Session started.")
            send_errors: list[Exception] = []

            def send_worker() -> None:
                try:
                    for line in sanitized_lines:
                        session.send_text_chunk(line, text_end=False)
                        time.sleep(0.1)
                    session.finish()
                except Exception as exc:
                    send_errors.append(exc)

            threading.Thread(target=send_worker, daemon=True).start()
            # Receive streamed audio chunks from the websocket.
            for audio_chunk in session.receive_audio_chunks():
                audio_chunks.append(audio_chunk)
            if send_errors:
                raise RuntimeError(f"Failed to send realtime text: {send_errors[0]}")
            print("Session finished.")
    finally:
        audio = b"".join(audio_chunks)
        if audio:
            destination.write_bytes(audio)
            print(f"Wrote {len(audio)} bytes to {destination.resolve()}")
        else:
            print("No audio file was written.")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--line",
        action="append",
        default=None,
        help="Line to send to realtime TTS (repeat --line for multiple lines).",
    )
    parser.add_argument("--model", default="tts-rt-v1")
    parser.add_argument("--language", default="en")
    parser.add_argument("--voice", default="Adrian")
    parser.add_argument("--audio_format", default="wav")
    parser.add_argument("--sample_rate", type=int)
    parser.add_argument("--bitrate", type=int)
    parser.add_argument("--stream_id", help="Optional stream id.")
    parser.add_argument(
        "--output_path",
        help="Optional output file path. If omitted, a timestamped path is generated.",
    )
    args = parser.parse_args()

    if args.audio_format not in VALID_AUDIO_FORMATS:
        raise ValueError(f"audio_format must be one of {VALID_AUDIO_FORMATS}")
    if args.sample_rate is not None and args.sample_rate not in VALID_SAMPLE_RATES:
        raise ValueError(f"sample_rate must be None or one of {VALID_SAMPLE_RATES}")
    if args.bitrate is not None and args.bitrate not in VALID_BITRATES:
        raise ValueError(f"bitrate must be None or one of {VALID_BITRATES}")

    api_key = os.environ.get("SONIOX_API_KEY")
    if not api_key:
        raise RuntimeError(
            "Missing SONIOX_API_KEY.\n"
            "1. Get your API key at https://console.soniox.com\n"
            "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>"
        )

    client = SonioxClient(api_key=api_key)

    try:
        run_session(
            client=client,
            lines=args.line or DEFAULT_LINES,
            model=args.model,
            language=args.language,
            voice=args.voice,
            audio_format=args.audio_format,
            sample_rate=args.sample_rate,
            bitrate=args.bitrate,
            stream_id=args.stream_id,
            output_path=args.output_path,
        )
    except SonioxRealtimeError as exc:
        print("Soniox realtime error:", exc)
    finally:
        client.close()


if __name__ == "__main__":
    main()

Terminal

# Generate speech with default settings (wav output)
python soniox_sdk_realtime.py --line "Hello from Soniox realtime Text-to-Speech."

# Generate raw PCM output
python soniox_sdk_realtime.py --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output.pcm

See on GitHub: soniox_sdk_realtime.js.

import { RealtimeError, SonioxNodeClient } from "@soniox/node";
import fs from "fs";
import path from "path";
import { parseArgs } from "node:util";
import process from "process";

const VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000];
const VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000];
const VALID_AUDIO_FORMATS = [
  "pcm_f32le",
  "pcm_s16le",
  "pcm_mulaw",
  "pcm_alaw",
  "wav",
  "aac",
  "mp3",
  "opus",
  "flac",
];
const RAW_PCM_FORMATS = ["pcm_s16le", "pcm_f32le", "pcm_mulaw", "pcm_alaw"];

const DEFAULT_LINES = [
  "Welcome to Soniox real-time Text-to-Speech. ",
  "As text is streamed in, audio streams back in parallel with high accuracy, ",
  "so your application can start playing speech ",
  "within milliseconds of the first word.",
];

// Initialize the client.
// The API key is read from the SONIOX_API_KEY environment variable.
const client = new SonioxNodeClient();

// Resolve a concrete output file path.
// If the provided path has no extension, derive one from audio_format:
//   * pcm_s16le  -> .wav  (we wrap the bytes in a WAV container below)
//   * other pcm_* -> .pcm (raw, no container)
//   * anything else -> the format name (e.g. .flac, .mp3, .opus)
function resolveOutputPath(outputPath, audioFormat) {
  if (outputPath && path.extname(outputPath)) {
    return outputPath;
  }
  const ext =
    audioFormat === "pcm_s16le"
      ? "wav"
      : RAW_PCM_FORMATS.includes(audioFormat)
        ? "pcm"
        : audioFormat;
  const base = outputPath || "tts_realtime";
  return `${base}.${ext}`;
}

function pcmS16leToWav(pcm, { sampleRate, numChannels = 1 }) {
  const bitsPerSample = 16;
  const byteRate = sampleRate * numChannels * (bitsPerSample / 8);
  const blockAlign = numChannels * (bitsPerSample / 8);
  const dataSize = pcm.byteLength;
  const header = Buffer.alloc(44);
  header.write("RIFF", 0, "ascii");
  header.writeUInt32LE(36 + dataSize, 4);
  header.write("WAVE", 8, "ascii");
  header.write("fmt ", 12, "ascii");
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);
  header.writeUInt16LE(numChannels, 22);
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(byteRate, 28);
  header.writeUInt16LE(blockAlign, 32);
  header.writeUInt16LE(bitsPerSample, 34);
  header.write("data", 36, "ascii");
  header.writeUInt32LE(dataSize, 40);
  return Buffer.concat([header, Buffer.from(pcm)]);
}

// Build a realtime TTS stream config.
function getStreamConfig({
  model,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
}) {
  const config = {
    // Client-defined stream id (auto-generated if omitted).
    ...(streamId && { stream_id: streamId }),

    // Select the model to use.
    // See: soniox.com/docs/tts/models
    model,

    // Set the language of the input text.
    // See: soniox.com/docs/tts/concepts/supported-languages
    language,

    // Select the voice to use.
    // See: soniox.com/docs/tts/concepts/voices
    voice,

    // Set output audio format and optional encoding parameters.
    // See: soniox.com/docs/api-reference/tts/websocket-api
    audio_format: audioFormat,
  };

  if (sampleRate !== undefined) config.sample_rate = sampleRate;
  if (bitrate !== undefined) config.bitrate = bitrate;

  return config;
}

async function runSession({
  lines,
  model,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
  outputPath,
}) {
  const sanitizedLines = lines
    .map((line) => line.trim())
    .filter((line) => line.length > 0);
  if (sanitizedLines.length === 0) {
    throw new Error("Text is empty after parsing.");
  }

  const destination = resolveOutputPath(outputPath, audioFormat);
  const config = getStreamConfig({
    model,
    language,
    voice,
    audioFormat,
    sampleRate,
    bitrate,
    streamId,
  });

  console.log("Connecting to Soniox...");
  const stream = await client.realtime.tts(config);
  console.log("Session started.");

  // Send text chunks in the background while receiving audio.
  let sendError = null;
  const sendPromise = (async () => {
    try {
      for (const line of sanitizedLines) {
        stream.sendText(line);
        // Sleep for 100 ms to simulate real-time streaming.
        await new Promise((res) => setTimeout(res, 100));
      }
      stream.finish();
    } catch (err) {
      sendError = err;
    }
  })();

  // Collect streamed audio chunks.
  const audioChunks = [];
  try {
    for await (const chunk of stream) {
      audioChunks.push(chunk);
    }
  } finally {
    await sendPromise;
    stream.close();
  }

  if (sendError) {
    throw new Error(`Failed to send realtime text: ${sendError.message}`);
  }

  console.log("Session finished.");

  const audio = Buffer.concat(audioChunks.map((c) => Buffer.from(c)));
  if (audio.length > 0) {
    // Wrap raw pcm_s16le in a WAV container so the .wav file plays everywhere.
    const bytes =
      audioFormat === "pcm_s16le" &&
      path.extname(destination).toLowerCase() === ".wav"
        ? pcmS16leToWav(audio, { sampleRate })
        : audio;
    fs.writeFileSync(destination, bytes);
    console.log(`Wrote ${bytes.length} bytes to ${path.resolve(destination)}`);
  } else {
    console.log("No audio file was written.");
  }
}

async function main() {
  const { values: argv } = parseArgs({
    options: {
      line: { type: "string", multiple: true },
      model: { type: "string", default: "tts-rt-v1" },
      language: { type: "string", default: "en" },
      voice: { type: "string", default: "Adrian" },
      audio_format: { type: "string", default: "pcm_s16le" },
      sample_rate: { type: "string" },
      bitrate: { type: "string" },
      stream_id: { type: "string" },
      output_path: { type: "string" },
    },
  });

  if (!VALID_AUDIO_FORMATS.includes(argv.audio_format)) {
    throw new Error(
      `audio_format must be one of ${VALID_AUDIO_FORMATS.join(", ")}`,
    );
  }
  let sampleRate =
    argv.sample_rate !== undefined ? Number(argv.sample_rate) : undefined;
  if (sampleRate === undefined && RAW_PCM_FORMATS.includes(argv.audio_format)) {
    sampleRate = 24000;
  }
  if (sampleRate !== undefined && !VALID_SAMPLE_RATES.includes(sampleRate)) {
    throw new Error(
      `sample_rate must be one of ${VALID_SAMPLE_RATES.join(", ")}`,
    );
  }
  const bitrate = argv.bitrate !== undefined ? Number(argv.bitrate) : undefined;
  if (bitrate !== undefined && !VALID_BITRATES.includes(bitrate)) {
    throw new Error(`bitrate must be one of ${VALID_BITRATES.join(", ")}`);
  }

  try {
    await runSession({
      lines: argv.line && argv.line.length > 0 ? argv.line : DEFAULT_LINES,
      model: argv.model,
      language: argv.language,
      voice: argv.voice,
      audioFormat: argv.audio_format,
      sampleRate,
      bitrate,
      streamId: argv.stream_id,
      outputPath: argv.output_path,
    });
  } catch (err) {
    if (err instanceof RealtimeError) {
      console.error("Soniox realtime error:", err.message);
    } else {
      throw err;
    }
  }
}

main().catch((err) => {
  console.error("Error:", err.message);
  process.exit(1);
});

Terminal

# Generate speech with default settings (wav output)
node soniox_sdk_realtime.js --line "Hello from Soniox realtime Text-to-Speech."

# Generate raw PCM output
node soniox_sdk_realtime.js --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output.pcm

See on GitHub: soniox_realtime.py.

import argparse
import base64
import json
import os
import threading
import time
from typing import Any

from websockets import ConnectionClosedOK
from websockets.sync.client import connect

SONIOX_TTS_WEBSOCKET_URL = "wss://tts-rt.soniox.com/tts-websocket"
MODEL = "tts-rt-v1"
VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000]
VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000]
VALID_AUDIO_FORMATS = [
    "pcm_f32le",
    "pcm_s16le",
    "pcm_mulaw",
    "pcm_alaw",
    "wav",
    "aac",
    "mp3",
    "opus",
    "flac",
]
DEFAULT_LINES = [
    "Welcome to Soniox real-time Text-to-Speech. ",
    "As text is streamed in, audio streams back in parallel with high accuracy, ",
    "so your application can start playing speech ",
    "within milliseconds of the first word.",
]


def get_output_path(*, output_path: str, audio_format: str) -> str:
    """
    Generates the resulting output path for the given audio format.
    """
    if "." in os.path.basename(output_path):
        return output_path
    ext = "pcm" if audio_format in ("pcm_s16le", "pcm_s16be") else audio_format
    return f"{output_path}.{ext}"


# Get Soniox TTS config.
def get_config(
    api_key: str,
    stream_id: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
) -> dict:
    config: dict[str, Any] = {
        # Get your API key at console.soniox.com, then run: export SONIOX_API_KEY=<YOUR_API_KEY>
        "api_key": api_key,
        #
        # Client-defined stream id to identify this realtime request.
        "stream_id": stream_id,
        #
        # Select the model to use.
        # See: soniox.com/docs/tts/models
        "model": MODEL,
        #
        # Set the language of the input text.
        # See: soniox.com/docs/tts/concepts/supported-languages
        "language": language,
        #
        # Select the voice to use.
        # See: soniox.com/docs/tts/concepts/voices
        "voice": voice,
        #
        # Audio format.
        # See: soniox.com/docs/tts/concepts/audio-formats
        "audio_format": audio_format,
    }

    if sample_rate is not None:
        config["sample_rate"] = sample_rate
    if bitrate is not None:
        config["bitrate"] = bitrate

    return config


def get_text_request(text: str, stream_id: str, text_end: bool) -> dict:
    return {
        "text": text,
        "text_end": text_end,
        "stream_id": stream_id,
    }


# Stream text lines to the websocket.
def stream_text(lines: list[str], stream_id: str, ws) -> None:
    for line in lines:
        clean_line = line.strip()
        if not clean_line:
            continue
        ws.send(json.dumps(get_text_request(clean_line, stream_id, text_end=False)))
        # Sleep for 100 ms to simulate real-time streaming.
        time.sleep(0.1)

    # Send text_end=true after the last chunk.
    ws.send(json.dumps(get_text_request("", stream_id, text_end=True)))


def send_requests(
    ws,
    api_key: str,
    lines: list[str],
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str,
) -> None:
    config = get_config(
        api_key=api_key,
        stream_id=stream_id,
        language=language,
        voice=voice,
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
    )
    ws.send(json.dumps(config))
    stream_text(lines, stream_id, ws)


def run_session(
    api_key: str,
    lines: list[str],
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str,
    output_path: str,
) -> None:
    print("Connecting to Soniox...")
    with connect(SONIOX_TTS_WEBSOCKET_URL) as ws:
        send_errors: list[Exception] = []

        def send_worker() -> None:
            try:
                send_requests(
                    ws,
                    api_key,
                    lines,
                    language,
                    voice,
                    audio_format,
                    sample_rate,
                    bitrate,
                    stream_id,
                )
            except Exception as exc:
                send_errors.append(exc)

        # Send config and text in the background while receiving responses.
        threading.Thread(
            target=send_worker,
            daemon=True,
        ).start()

        print("Session started.")
        audio_chunks: list[bytes] = []

        try:
            while True:
                if send_errors:
                    raise RuntimeError(f"Failed to send realtime requests: {send_errors[0]}")
                message = ws.recv()
                res = json.loads(message)

                # Error from server.
                if res.get("error_code") is not None:
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break

                # Collect audio bytes from base64-encoded chunks.
                audio_b64 = res.get("audio")
                if audio_b64:
                    audio_chunks.append(base64.b64decode(audio_b64))

                # Session finished.
                if res.get("terminated"):
                    break

        except ConnectionClosedOK:
            # Normal, server closed after finished.
            pass
        except KeyboardInterrupt:
            print("\nInterrupted by user.")
        except Exception as e:
            print(f"Error: {e}")
        finally:
            audio_data = b"".join(audio_chunks)
            if audio_data:
                destination = get_output_path(output_path=output_path, audio_format=audio_format)
                with open(destination, "wb") as fh:
                    fh.write(audio_data)
                print(f"Wrote {len(audio_data)} bytes to {destination}")
            else:
                print("No audio file was written.")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--line",
        action="append",
        default=None,
        help="Line to send to realtime TTS (repeat --line for multiple lines).",
    )
    parser.add_argument("--language", default="en")
    parser.add_argument("--voice", default="Adrian")
    parser.add_argument("--audio_format", default="wav")
    parser.add_argument("--stream_id", default="stream-1")
    parser.add_argument("--output_path", default="tts-ws")
    parser.add_argument("--sample_rate", type=int)
    parser.add_argument("--bitrate", type=int)
    args = parser.parse_args()

    if args.audio_format not in VALID_AUDIO_FORMATS:
        raise ValueError(f"audio_format must be one of {VALID_AUDIO_FORMATS}")
    if args.sample_rate is not None and args.sample_rate not in VALID_SAMPLE_RATES:
        raise ValueError(f"sample_rate must be None or one of {VALID_SAMPLE_RATES}")
    if args.bitrate is not None and args.bitrate not in VALID_BITRATES:
        raise ValueError(f"bitrate must be None or one of {VALID_BITRATES}")

    api_key = os.environ.get("SONIOX_API_KEY")
    if not api_key:
        raise RuntimeError(
            "Missing SONIOX_API_KEY.\n"
            "1. Get your API key at https://console.soniox.com\n"
            "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>"
        )

    run_session(
        api_key=api_key,
        lines=args.line or DEFAULT_LINES,
        language=args.language,
        voice=args.voice,
        audio_format=args.audio_format,
        sample_rate=args.sample_rate,
        bitrate=args.bitrate,
        stream_id=args.stream_id,
        output_path=args.output_path,
    )


if __name__ == "__main__":
    main()

Terminal

# Generate speech with default settings (wav output)
python soniox_realtime.py --line "Hello from Soniox websocket Text-to-Speech."

# Generate raw PCM output
python soniox_realtime.py --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output

See on GitHub: soniox_realtime.js.

import fs from "fs";
import path from "path";
import WebSocket from "ws";
import { parseArgs } from "node:util";
import process from "process";

const SONIOX_TTS_WEBSOCKET_URL = "wss://tts-rt.soniox.com/tts-websocket";
const MODEL = "tts-rt-v1";
const VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000];
const VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000];
const VALID_AUDIO_FORMATS = [
  "pcm_f32le",
  "pcm_s16le",
  "pcm_mulaw",
  "pcm_alaw",
  "wav",
  "aac",
  "mp3",
  "opus",
  "flac",
];
const RAW_PCM_FORMATS = ["pcm_s16le", "pcm_f32le", "pcm_mulaw", "pcm_alaw"];

const DEFAULT_LINES = [
  "Welcome to Soniox real-time Text-to-Speech. ",
  "As text is streamed in, audio streams back in parallel with high accuracy, ",
  "so your application can start playing speech ",
  "within milliseconds of the first word.",
];

// Resolve a concrete output file path.
// If the provided path has no extension, derive one from audio_format:
//   * pcm_s16le  -> .wav  (we wrap the bytes in a WAV container below)
//   * other pcm_* -> .pcm (raw, no container)
//   * anything else -> the format name (e.g. .flac, .mp3, .opus)
function resolveOutputPath(outputPath, audioFormat) {
  if (outputPath && path.extname(outputPath)) {
    return outputPath;
  }
  const ext =
    audioFormat === "pcm_s16le"
      ? "wav"
      : RAW_PCM_FORMATS.includes(audioFormat)
        ? "pcm"
        : audioFormat;
  const base = outputPath || "tts-ws";
  return `${base}.${ext}`;
}

function pcmS16leToWav(pcm, { sampleRate, numChannels = 1 }) {
  const bitsPerSample = 16;
  const byteRate = sampleRate * numChannels * (bitsPerSample / 8);
  const blockAlign = numChannels * (bitsPerSample / 8);
  const dataSize = pcm.byteLength;
  const header = Buffer.alloc(44);
  header.write("RIFF", 0, "ascii");
  header.writeUInt32LE(36 + dataSize, 4);
  header.write("WAVE", 8, "ascii");
  header.write("fmt ", 12, "ascii");
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);
  header.writeUInt16LE(numChannels, 22);
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(byteRate, 28);
  header.writeUInt16LE(blockAlign, 32);
  header.writeUInt16LE(bitsPerSample, 34);
  header.write("data", 36, "ascii");
  header.writeUInt32LE(dataSize, 40);
  return Buffer.concat([header, Buffer.from(pcm)]);
}

// Get Soniox TTS config.
function getConfig({
  apiKey,
  streamId,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
}) {
  const config = {
    // Get your API key at console.soniox.com, then run: export SONIOX_API_KEY=<YOUR_API_KEY>
    api_key: apiKey,

    // Client-defined stream id to identify this realtime request.
    stream_id: streamId,

    // Select the model to use.
    // See: soniox.com/docs/tts/models
    model: MODEL,

    // Set the language of the input text.
    // See: soniox.com/docs/tts/concepts/supported-languages
    language,

    // Select the voice to use.
    // See: soniox.com/docs/tts/concepts/voices
    voice,

    // Audio format.
    // See: soniox.com/docs/tts/concepts/audio-formats
    audio_format: audioFormat,
  };

  if (sampleRate !== undefined) config.sample_rate = sampleRate;
  if (bitrate !== undefined) config.bitrate = bitrate;

  return config;
}

function getTextRequest(text, streamId, textEnd) {
  return {
    text,
    text_end: textEnd,
    stream_id: streamId,
  };
}

// Stream text lines to the websocket.
async function streamText(lines, streamId, ws) {
  for (const line of lines) {
    const cleanLine = line.trim();
    if (!cleanLine) continue;
    ws.send(JSON.stringify(getTextRequest(cleanLine, streamId, false)));
    // Sleep for 100 ms to simulate real-time streaming.
    await new Promise((res) => setTimeout(res, 100));
  }

  // Send text_end=true after the last chunk.
  ws.send(JSON.stringify(getTextRequest("", streamId, true)));
}

function runSession({
  apiKey,
  lines,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
  outputPath,
}) {
  return new Promise((resolve, reject) => {
    console.log("Connecting to Soniox...");
    const ws = new WebSocket(SONIOX_TTS_WEBSOCKET_URL);

    const audioChunks = [];

    const finalize = (err) => {
      const destination = resolveOutputPath(outputPath, audioFormat);
      if (audioChunks.length > 0) {
        const audio = Buffer.concat(audioChunks);
        // Wrap raw pcm_s16le in a WAV container so the .wav file plays everywhere.
        const bytes =
          audioFormat === "pcm_s16le" &&
          path.extname(destination).toLowerCase() === ".wav"
            ? pcmS16leToWav(audio, { sampleRate })
            : audio;
        fs.writeFileSync(destination, bytes);
        console.log(`Wrote ${bytes.length} bytes to ${path.resolve(destination)}`);
      } else {
        console.log("No audio file was written.");
      }
      if (err) reject(err);
      else resolve();
    };

    ws.on("open", () => {
      const config = getConfig({
        apiKey,
        streamId,
        language,
        voice,
        audioFormat,
        sampleRate,
        bitrate,
      });

      // Send first request with config.
      ws.send(JSON.stringify(config));

      // Start streaming text in the background.
      streamText(lines, streamId, ws).catch((err) => {
        console.error("Text stream error:", err);
      });
      console.log("Session started.");
    });

    ws.on("message", (msg) => {
      let res;
      try {
        res = JSON.parse(msg.toString());
      } catch {
        return;
      }

      // Error from server.
      // See: https://soniox.com/docs/api-reference/tts/websocket-api#error-response
      if (res.error_code) {
        console.error(`Error: ${res.error_code} - ${res.error_message}`);
        ws.close();
        return;
      }

      // Collect audio bytes from base64-encoded chunks.
      if (res.audio) {
        audioChunks.push(Buffer.from(res.audio, "base64"));
      }

      // Session finished.
      if (res.terminated) {
        console.log("Session finished.");
        ws.close();
      }
    });

    ws.on("close", () => {
      finalize(null);
    });

    ws.on("error", (err) => {
      console.error("WebSocket error:", err.message);
      finalize(err);
    });
  });
}

async function main() {
  const { values: argv } = parseArgs({
    options: {
      line: { type: "string", multiple: true },
      language: { type: "string", default: "en" },
      voice: { type: "string", default: "Adrian" },
      audio_format: { type: "string", default: "pcm_s16le" },
      stream_id: { type: "string", default: "stream-1" },
      output_path: { type: "string", default: "tts-ws" },
      sample_rate: { type: "string" },
      bitrate: { type: "string" },
    },
  });

  if (!VALID_AUDIO_FORMATS.includes(argv.audio_format)) {
    throw new Error(
      `audio_format must be one of ${VALID_AUDIO_FORMATS.join(", ")}`,
    );
  }
  let sampleRate =
    argv.sample_rate !== undefined ? Number(argv.sample_rate) : undefined;
  if (sampleRate === undefined && RAW_PCM_FORMATS.includes(argv.audio_format)) {
    sampleRate = 24000;
  }
  if (sampleRate !== undefined && !VALID_SAMPLE_RATES.includes(sampleRate)) {
    throw new Error(
      `sample_rate must be one of ${VALID_SAMPLE_RATES.join(", ")}`,
    );
  }
  const bitrate = argv.bitrate !== undefined ? Number(argv.bitrate) : undefined;
  if (bitrate !== undefined && !VALID_BITRATES.includes(bitrate)) {
    throw new Error(`bitrate must be one of ${VALID_BITRATES.join(", ")}`);
  }

  const apiKey = process.env.SONIOX_API_KEY;
  if (!apiKey) {
    throw new Error(
      "Missing SONIOX_API_KEY.\n" +
        "1. Get your API key at https://console.soniox.com\n" +
        "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>",
    );
  }

  await runSession({
    apiKey,
    lines: argv.line && argv.line.length > 0 ? argv.line : DEFAULT_LINES,
    language: argv.language,
    voice: argv.voice,
    audioFormat: argv.audio_format,
    sampleRate,
    bitrate,
    streamId: argv.stream_id,
    outputPath: argv.output_path,
  });
}

main().catch((err) => {
  console.error("Error:", err.message);
  process.exit(1);
});

Terminal

# Generate speech with default settings (wav output)
node soniox_realtime.js --line "Hello from Soniox websocket Text-to-Speech."

# Generate raw PCM output
node soniox_realtime.js --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output

WebSocket API

400Bad Request

401Unauthorized

402Payment Required

403Forbidden

408Request Timeout

429Too Many Requests

500Internal Server Error

503Service Unavailable

Code

Run

Code

Run

Code

Run

Code

Run

On this page