Real-time speech generation

Overview

Soniox Text-to-Speech generates native-speaker-quality speech in 60+ languages, with hallucination-free output and accurate pronunciation of alphanumerics like phone numbers, email addresses, and IDs.

It is optimized for ultra-low latency and can begin generating speech from the first few words, before the full sentence is available.

Endpoint

Connect to the WebSocket endpoint to start real-time generation.

wss://tts-rt.soniox.com/tts-websocket

Once connected, send JSON messages to start streams, send text, and receive audio.

Minimal client loop

Open the WebSocket connection.

Send config for stream-001.

Start reading server messages while sending text chunks.

Send the final chunk with text_end: true.

Decode and play or write audio chunks as they arrive.

Stop receiving audio after terminated: true.

Handle errors per stream, not per connection — check the error_code on each stream_id.

Connection keepalive

During idle periods, send keepalive messages to prevent the WebSocket connection from timing out. For details, see Connection keepalive.

Supported features

Voice selection
Audio output controls (audio_format, sample_rate, bitrate)
Incremental text streaming
Multiple streams on one WebSocket connection (using stream_id, up to 5 active streams)

End-to-end flow

For each stream, follow this exact order:

Send a config message to start a stream.

Send one or more text messages with text_end: false.

Send a final text message with text_end: true.

Read server messages in parallel with sending text chunks until you receive terminated: true.

Start a stream with a config message

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "tts-rt-v1-preview",
  "language": "en",
  "voice": "Adrian",
  "audio_format": "pcm_s16le",
  "sample_rate": 24000,
  "stream_id": "stream-001"
}

Configuration parameters

Field	Type	Required	Description
`api_key`	string	Yes	Soniox API key or temporary API key.
`stream_id`	string	Yes	Client-generated stream identifier, unique among active streams on one socket.
`model`	string	Yes	Text-to-Speech model from models.
`language`	string	Yes	Language code, from languages.
`voice`	string	Yes	Voice name from voices.
`audio_format`	string	Yes	Audio format from audio formats.
`sample_rate`	number	Optional	Output sample rate in Hz. Required for raw formats.
`bitrate`	number	Optional	Codec bitrate in bps for compressed formats.

Send text chunks

{
  "text": "Hello there. This is the first chunk.",
  "text_end": false,
  "stream_id": "stream-001"
}

Send more chunks as needed with the same stream_id.

Finish the stream

{
  "text": "And this is the end.",
  "text_end": true,
  "stream_id": "stream-001"
}

Receive server messages until termination

Typical audio message:

{
  "audio": "<base64-audio-bytes>",
  "audio_end": false,
  "stream_id": "stream-001"
}

Final audio message (if the stream produced audio):

{
  "audio": "<base64-audio-bytes>",
  "audio_end": true,
  "stream_id": "stream-001"
}

Final stream event:

{
  "terminated": true,
  "stream_id": "stream-001"
}

Important

audio_end: true means no more audio payloads for the stream.
terminated: true means the stream lifecycle is closed.
Consider the stream complete only after receiving terminated: true.

Cancel a stream

Send a cancel message when you need to stop generation early (for example, user interruption or timeout).

When a stream is canceled:

Cancellation applies only to the specified stream_id.
The server stops generation for that stream and sends no additional audio.
The server sends terminated: true to confirm the stream is fully closed.
The WebSocket connection and other active streams remain available.

Cancel request:

{
  "stream_id": "stream-001",
  "cancel": true
}

Finalization response:

{
  "terminated": true,
  "stream_id": "stream-001"
}

Error handling

Error response format

If a stream fails due to a validation or generation error, the server returns an error message for that stream:

{
  "stream_id": "stream-001",
  "error_code": 400,
  "error_message": "Missing required field: model"
}

Stream error lifecycle

After an error, the server sends a termination message for the affected stream:

{
  "terminated": true,
  "stream_id": "stream-001"
}

Handle errors by stream_id, and release stream state only after terminated: true.

A stream error does not close the WebSocket connection. Other streams on the same connection continue normally.

Error details

For the full error list and messages, see the WebSocket API reference.

Code example

Prerequisite: Complete the steps in Get started.

See on GitHub: soniox_sdk_realtime.py.

import argparse
import os
import threading
import time
from pathlib import Path
from uuid import uuid4

from soniox import SonioxClient
from soniox.errors import SonioxRealtimeError
from soniox.types import RealtimeTTSConfig
from soniox.utils import output_file_for_audio_format

VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000]
VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000]
VALID_AUDIO_FORMATS = [
    "pcm_f32le",
    "pcm_s16le",
    "pcm_mulaw",
    "pcm_alaw",
    "wav",
    "aac",
    "mp3",
    "opus",
    "flac",
]

DEFAULT_LINES = [
    "Welcome to Soniox real-time Text-to-Speech. ",
    "As text is streamed in, audio streams back in parallel with high accuracy, ",
    "so your application can start playing speech ",
    "within milliseconds of the first word.",
]


def get_config(
    model: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str | None,
) -> RealtimeTTSConfig:
    config = RealtimeTTSConfig(
        # Stream id for this realtime TTS session.
        # If omitted, a random id is generated.
        stream_id=stream_id or f"tts-{uuid4()}",
        #
        # Select the model to use.
        # See: soniox.com/docs/tts/models
        model=model,
        #
        # Set the language of the input text.
        # See: soniox.com/docs/tts/languages
        language=language,
        #
        # Select the voice to use.
        # See: soniox.com/docs/tts/voices
        voice=voice,
        #
        # Set output audio format and optional encoding parameters.
        # See: soniox.com/docs/tts/api-reference/websocket-api
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
    )

    return config


def run_session(
    client: SonioxClient,
    lines: list[str],
    model: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str | None,
    output_path: str | None,
) -> None:
    # Build a realtime Text-to-Speech session configuration.
    config = get_config(
        model=model,
        language=language,
        voice=voice,
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
        stream_id=stream_id,
    )
    sanitized_lines = [line.strip() for line in lines if line.strip()]
    if not sanitized_lines:
        raise ValueError("Text is empty after parsing.")

    destination = (
        Path(output_path)
        if output_path
        else output_file_for_audio_format(audio_format, "tts_realtime")
    )
    print("Connecting to Soniox...")
    audio_chunks: list[bytes] = []
    try:
        with client.realtime.tts.connect(config=config) as session:
            print("Session started.")
            send_errors: list[Exception] = []

            def send_worker() -> None:
                try:
                    for line in sanitized_lines:
                        session.send_text_chunk(line, text_end=False)
                        time.sleep(0.1)
                    session.finish()
                except Exception as exc:
                    send_errors.append(exc)

            threading.Thread(target=send_worker, daemon=True).start()
            # Receive streamed audio chunks from the websocket.
            for audio_chunk in session.receive_audio_chunks():
                audio_chunks.append(audio_chunk)
            if send_errors:
                raise RuntimeError(f"Failed to send realtime text: {send_errors[0]}")
            print("Session finished.")
    finally:
        audio = b"".join(audio_chunks)
        if audio:
            destination.write_bytes(audio)
            print(f"Wrote {len(audio)} bytes to {destination.resolve()}")
        else:
            print("No audio file was written.")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--line",
        action="append",
        default=None,
        help="Line to send to realtime TTS (repeat --line for multiple lines).",
    )
    parser.add_argument("--model", default="tts-rt-v1-preview")
    parser.add_argument("--language", default="en")
    parser.add_argument("--voice", default="Adrian")
    parser.add_argument("--audio_format", default="wav")
    parser.add_argument("--sample_rate", type=int)
    parser.add_argument("--bitrate", type=int)
    parser.add_argument("--stream_id", help="Optional stream id.")
    parser.add_argument(
        "--output_path",
        help="Optional output file path. If omitted, a timestamped path is generated.",
    )
    args = parser.parse_args()

    if args.audio_format not in VALID_AUDIO_FORMATS:
        raise ValueError(f"audio_format must be one of {VALID_AUDIO_FORMATS}")
    if args.sample_rate is not None and args.sample_rate not in VALID_SAMPLE_RATES:
        raise ValueError(f"sample_rate must be None or one of {VALID_SAMPLE_RATES}")
    if args.bitrate is not None and args.bitrate not in VALID_BITRATES:
        raise ValueError(f"bitrate must be None or one of {VALID_BITRATES}")

    api_key = os.environ.get("SONIOX_API_KEY")
    if not api_key:
        raise RuntimeError(
            "Missing SONIOX_API_KEY.\n"
            "1. Get your API key at https://console.soniox.com\n"
            "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>"
        )

    client = SonioxClient(api_key=api_key)

    try:
        run_session(
            client=client,
            lines=args.line or DEFAULT_LINES,
            model=args.model,
            language=args.language,
            voice=args.voice,
            audio_format=args.audio_format,
            sample_rate=args.sample_rate,
            bitrate=args.bitrate,
            stream_id=args.stream_id,
            output_path=args.output_path,
        )
    except SonioxRealtimeError as exc:
        print("Soniox realtime error:", exc)
    finally:
        client.close()


if __name__ == "__main__":
    main()

Terminal

# Generate speech with default settings (wav output)
python soniox_sdk_realtime.py --line "Hello from Soniox realtime Text-to-Speech."

# Generate raw PCM output
python soniox_sdk_realtime.py --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output.pcm

See on GitHub: soniox_sdk_realtime.js.

import { RealtimeError, SonioxNodeClient } from "@soniox/node";
import fs from "fs";
import path from "path";
import { parseArgs } from "node:util";
import process from "process";

const VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000];
const VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000];
const VALID_AUDIO_FORMATS = [
  "pcm_f32le",
  "pcm_s16le",
  "pcm_mulaw",
  "pcm_alaw",
  "wav",
  "aac",
  "mp3",
  "opus",
  "flac",
];
const RAW_PCM_FORMATS = ["pcm_s16le", "pcm_f32le", "pcm_mulaw", "pcm_alaw"];

const DEFAULT_LINES = [
  "Welcome to Soniox real-time Text-to-Speech. ",
  "As text is streamed in, audio streams back in parallel with high accuracy, ",
  "so your application can start playing speech ",
  "within milliseconds of the first word.",
];

// Initialize the client.
// The API key is read from the SONIOX_API_KEY environment variable.
const client = new SonioxNodeClient();

// Resolve a concrete output file path.
// If the provided path has no extension, derive one from audio_format:
//   * pcm_s16le  -> .wav  (we wrap the bytes in a WAV container below)
//   * other pcm_* -> .pcm (raw, no container)
//   * anything else -> the format name (e.g. .flac, .mp3, .opus)
function resolveOutputPath(outputPath, audioFormat) {
  if (outputPath && path.extname(outputPath)) {
    return outputPath;
  }
  const ext =
    audioFormat === "pcm_s16le"
      ? "wav"
      : RAW_PCM_FORMATS.includes(audioFormat)
        ? "pcm"
        : audioFormat;
  const base = outputPath || "tts_realtime";
  return `${base}.${ext}`;
}

function pcmS16leToWav(pcm, { sampleRate, numChannels = 1 }) {
  const bitsPerSample = 16;
  const byteRate = sampleRate * numChannels * (bitsPerSample / 8);
  const blockAlign = numChannels * (bitsPerSample / 8);
  const dataSize = pcm.byteLength;
  const header = Buffer.alloc(44);
  header.write("RIFF", 0, "ascii");
  header.writeUInt32LE(36 + dataSize, 4);
  header.write("WAVE", 8, "ascii");
  header.write("fmt ", 12, "ascii");
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);
  header.writeUInt16LE(numChannels, 22);
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(byteRate, 28);
  header.writeUInt16LE(blockAlign, 32);
  header.writeUInt16LE(bitsPerSample, 34);
  header.write("data", 36, "ascii");
  header.writeUInt32LE(dataSize, 40);
  return Buffer.concat([header, Buffer.from(pcm)]);
}

// Build a realtime TTS stream config.
function getStreamConfig({
  model,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
}) {
  const config = {
    // Client-defined stream id (auto-generated if omitted).
    ...(streamId && { stream_id: streamId }),

    // Select the model to use.
    // See: soniox.com/docs/tts/models
    model,

    // Set the language of the input text.
    // See: soniox.com/docs/tts/languages
    language,

    // Select the voice to use.
    // See: soniox.com/docs/tts/voices
    voice,

    // Set output audio format and optional encoding parameters.
    // See: soniox.com/docs/tts/api-reference/websocket-api
    audio_format: audioFormat,
  };

  if (sampleRate !== undefined) config.sample_rate = sampleRate;
  if (bitrate !== undefined) config.bitrate = bitrate;

  return config;
}

async function runSession({
  lines,
  model,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
  outputPath,
}) {
  const sanitizedLines = lines
    .map((line) => line.trim())
    .filter((line) => line.length > 0);
  if (sanitizedLines.length === 0) {
    throw new Error("Text is empty after parsing.");
  }

  const destination = resolveOutputPath(outputPath, audioFormat);
  const config = getStreamConfig({
    model,
    language,
    voice,
    audioFormat,
    sampleRate,
    bitrate,
    streamId,
  });

  console.log("Connecting to Soniox...");
  const stream = await client.realtime.tts(config);
  console.log("Session started.");

  // Send text chunks in the background while receiving audio.
  let sendError = null;
  const sendPromise = (async () => {
    try {
      for (const line of sanitizedLines) {
        stream.sendText(line);
        // Sleep for 100 ms to simulate real-time streaming.
        await new Promise((res) => setTimeout(res, 100));
      }
      stream.finish();
    } catch (err) {
      sendError = err;
    }
  })();

  // Collect streamed audio chunks.
  const audioChunks = [];
  try {
    for await (const chunk of stream) {
      audioChunks.push(chunk);
    }
  } finally {
    await sendPromise;
    stream.close();
  }

  if (sendError) {
    throw new Error(`Failed to send realtime text: ${sendError.message}`);
  }

  console.log("Session finished.");

  const audio = Buffer.concat(audioChunks.map((c) => Buffer.from(c)));
  if (audio.length > 0) {
    // Wrap raw pcm_s16le in a WAV container so the .wav file plays everywhere.
    const bytes =
      audioFormat === "pcm_s16le" &&
      path.extname(destination).toLowerCase() === ".wav"
        ? pcmS16leToWav(audio, { sampleRate })
        : audio;
    fs.writeFileSync(destination, bytes);
    console.log(`Wrote ${bytes.length} bytes to ${path.resolve(destination)}`);
  } else {
    console.log("No audio file was written.");
  }
}

async function main() {
  const { values: argv } = parseArgs({
    options: {
      line: { type: "string", multiple: true },
      model: { type: "string", default: "tts-rt-v1-preview" },
      language: { type: "string", default: "en" },
      voice: { type: "string", default: "Adrian" },
      audio_format: { type: "string", default: "pcm_s16le" },
      sample_rate: { type: "string" },
      bitrate: { type: "string" },
      stream_id: { type: "string" },
      output_path: { type: "string" },
    },
  });

  if (!VALID_AUDIO_FORMATS.includes(argv.audio_format)) {
    throw new Error(
      `audio_format must be one of ${VALID_AUDIO_FORMATS.join(", ")}`,
    );
  }
  let sampleRate =
    argv.sample_rate !== undefined ? Number(argv.sample_rate) : undefined;
  if (sampleRate === undefined && RAW_PCM_FORMATS.includes(argv.audio_format)) {
    sampleRate = 24000;
  }
  if (sampleRate !== undefined && !VALID_SAMPLE_RATES.includes(sampleRate)) {
    throw new Error(
      `sample_rate must be one of ${VALID_SAMPLE_RATES.join(", ")}`,
    );
  }
  const bitrate = argv.bitrate !== undefined ? Number(argv.bitrate) : undefined;
  if (bitrate !== undefined && !VALID_BITRATES.includes(bitrate)) {
    throw new Error(`bitrate must be one of ${VALID_BITRATES.join(", ")}`);
  }

  try {
    await runSession({
      lines: argv.line && argv.line.length > 0 ? argv.line : DEFAULT_LINES,
      model: argv.model,
      language: argv.language,
      voice: argv.voice,
      audioFormat: argv.audio_format,
      sampleRate,
      bitrate,
      streamId: argv.stream_id,
      outputPath: argv.output_path,
    });
  } catch (err) {
    if (err instanceof RealtimeError) {
      console.error("Soniox realtime error:", err.message);
    } else {
      throw err;
    }
  }
}

main().catch((err) => {
  console.error("Error:", err.message);
  process.exit(1);
});

Terminal

# Generate speech with default settings (wav output)
node soniox_sdk_realtime.js --line "Hello from Soniox realtime Text-to-Speech."

# Generate raw PCM output
node soniox_sdk_realtime.js --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output.pcm

See on GitHub: soniox_realtime.py.

import argparse
import base64
import json
import os
import threading
import time
from typing import Any

from websockets import ConnectionClosedOK
from websockets.sync.client import connect

SONIOX_TTS_WEBSOCKET_URL = "wss://tts-rt.soniox.com/tts-websocket"
MODEL = "tts-rt-v1-preview"
VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000]
VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000]
VALID_AUDIO_FORMATS = [
    "pcm_f32le",
    "pcm_s16le",
    "pcm_mulaw",
    "pcm_alaw",
    "wav",
    "aac",
    "mp3",
    "opus",
    "flac",
]
DEFAULT_LINES = [
    "Welcome to Soniox real-time Text-to-Speech. ",
    "As text is streamed in, audio streams back in parallel with high accuracy, ",
    "so your application can start playing speech ",
    "within milliseconds of the first word.",
]


def get_output_path(*, output_path: str, audio_format: str) -> str:
    """
    Generates the resulting output path for the given audio format.
    """
    if "." in os.path.basename(output_path):
        return output_path
    ext = "pcm" if audio_format in ("pcm_s16le", "pcm_s16be") else audio_format
    return f"{output_path}.{ext}"


# Get Soniox TTS config.
def get_config(
    api_key: str,
    stream_id: str,
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
) -> dict:
    config: dict[str, Any] = {
        # Get your API key at console.soniox.com, then run: export SONIOX_API_KEY=<YOUR_API_KEY>
        "api_key": api_key,
        #
        # Client-defined stream id to identify this realtime request.
        "stream_id": stream_id,
        #
        # Select the model to use.
        # See: soniox.com/docs/tts/models
        "model": MODEL,
        #
        # Set the language of the input text.
        # See: soniox.com/docs/tts/languages
        "language": language,
        #
        # Select the voice to use.
        # See: soniox.com/docs/tts/voices
        "voice": voice,
        #
        # Audio format.
        # See: soniox.com/docs/tts/audio-formats
        "audio_format": audio_format,
    }

    if sample_rate is not None:
        config["sample_rate"] = sample_rate
    if bitrate is not None:
        config["bitrate"] = bitrate

    return config


def get_text_request(text: str, stream_id: str, text_end: bool) -> dict:
    return {
        "text": text,
        "text_end": text_end,
        "stream_id": stream_id,
    }


# Stream text lines to the websocket.
def stream_text(lines: list[str], stream_id: str, ws) -> None:
    for line in lines:
        clean_line = line.strip()
        if not clean_line:
            continue
        ws.send(json.dumps(get_text_request(clean_line, stream_id, text_end=False)))
        # Sleep for 100 ms to simulate real-time streaming.
        time.sleep(0.1)

    # Send text_end=true after the last chunk.
    ws.send(json.dumps(get_text_request("", stream_id, text_end=True)))


def send_requests(
    ws,
    api_key: str,
    lines: list[str],
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str,
) -> None:
    config = get_config(
        api_key=api_key,
        stream_id=stream_id,
        language=language,
        voice=voice,
        audio_format=audio_format,
        sample_rate=sample_rate,
        bitrate=bitrate,
    )
    ws.send(json.dumps(config))
    stream_text(lines, stream_id, ws)


def run_session(
    api_key: str,
    lines: list[str],
    language: str,
    voice: str,
    audio_format: str,
    sample_rate: int | None,
    bitrate: int | None,
    stream_id: str,
    output_path: str,
) -> None:
    print("Connecting to Soniox...")
    with connect(SONIOX_TTS_WEBSOCKET_URL) as ws:
        send_errors: list[Exception] = []

        def send_worker() -> None:
            try:
                send_requests(
                    ws,
                    api_key,
                    lines,
                    language,
                    voice,
                    audio_format,
                    sample_rate,
                    bitrate,
                    stream_id,
                )
            except Exception as exc:
                send_errors.append(exc)

        # Send config and text in the background while receiving responses.
        threading.Thread(
            target=send_worker,
            daemon=True,
        ).start()

        print("Session started.")
        audio_chunks: list[bytes] = []

        try:
            while True:
                if send_errors:
                    raise RuntimeError(f"Failed to send realtime requests: {send_errors[0]}")
                message = ws.recv()
                res = json.loads(message)

                # Error from server.
                if res.get("error_code") is not None:
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break

                # Collect audio bytes from base64-encoded chunks.
                audio_b64 = res.get("audio")
                if audio_b64:
                    audio_chunks.append(base64.b64decode(audio_b64))

                # Session finished.
                if res.get("terminated"):
                    break

        except ConnectionClosedOK:
            # Normal, server closed after finished.
            pass
        except KeyboardInterrupt:
            print("\nInterrupted by user.")
        except Exception as e:
            print(f"Error: {e}")
        finally:
            audio_data = b"".join(audio_chunks)
            if audio_data:
                destination = get_output_path(output_path=output_path, audio_format=audio_format)
                with open(destination, "wb") as fh:
                    fh.write(audio_data)
                print(f"Wrote {len(audio_data)} bytes to {destination}")
            else:
                print("No audio file was written.")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--line",
        action="append",
        default=None,
        help="Line to send to realtime TTS (repeat --line for multiple lines).",
    )
    parser.add_argument("--language", default="en")
    parser.add_argument("--voice", default="Adrian")
    parser.add_argument("--audio_format", default="wav")
    parser.add_argument("--stream_id", default="stream-1")
    parser.add_argument("--output_path", default="tts-ws")
    parser.add_argument("--sample_rate", type=int)
    parser.add_argument("--bitrate", type=int)
    args = parser.parse_args()

    if args.audio_format not in VALID_AUDIO_FORMATS:
        raise ValueError(f"audio_format must be one of {VALID_AUDIO_FORMATS}")
    if args.sample_rate is not None and args.sample_rate not in VALID_SAMPLE_RATES:
        raise ValueError(f"sample_rate must be None or one of {VALID_SAMPLE_RATES}")
    if args.bitrate is not None and args.bitrate not in VALID_BITRATES:
        raise ValueError(f"bitrate must be None or one of {VALID_BITRATES}")

    api_key = os.environ.get("SONIOX_API_KEY")
    if not api_key:
        raise RuntimeError(
            "Missing SONIOX_API_KEY.\n"
            "1. Get your API key at https://console.soniox.com\n"
            "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>"
        )

    run_session(
        api_key=api_key,
        lines=args.line or DEFAULT_LINES,
        language=args.language,
        voice=args.voice,
        audio_format=args.audio_format,
        sample_rate=args.sample_rate,
        bitrate=args.bitrate,
        stream_id=args.stream_id,
        output_path=args.output_path,
    )


if __name__ == "__main__":
    main()

Terminal

# Generate speech with default settings (wav output)
python soniox_realtime.py --line "Hello from Soniox websocket Text-to-Speech."

# Generate raw PCM output
python soniox_realtime.py --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output

See on GitHub: soniox_realtime.js.

import fs from "fs";
import path from "path";
import WebSocket from "ws";
import { parseArgs } from "node:util";
import process from "process";

const SONIOX_TTS_WEBSOCKET_URL = "wss://tts-rt.soniox.com/tts-websocket";
const MODEL = "tts-rt-v1-preview";
const VALID_SAMPLE_RATES = [8000, 16000, 24000, 44100, 48000];
const VALID_BITRATES = [32000, 64000, 96000, 128000, 192000, 256000, 320000];
const VALID_AUDIO_FORMATS = [
  "pcm_f32le",
  "pcm_s16le",
  "pcm_mulaw",
  "pcm_alaw",
  "wav",
  "aac",
  "mp3",
  "opus",
  "flac",
];
const RAW_PCM_FORMATS = ["pcm_s16le", "pcm_f32le", "pcm_mulaw", "pcm_alaw"];

const DEFAULT_LINES = [
  "Welcome to Soniox real-time Text-to-Speech. ",
  "As text is streamed in, audio streams back in parallel with high accuracy, ",
  "so your application can start playing speech ",
  "within milliseconds of the first word.",
];

// Resolve a concrete output file path.
// If the provided path has no extension, derive one from audio_format:
//   * pcm_s16le  -> .wav  (we wrap the bytes in a WAV container below)
//   * other pcm_* -> .pcm (raw, no container)
//   * anything else -> the format name (e.g. .flac, .mp3, .opus)
function resolveOutputPath(outputPath, audioFormat) {
  if (outputPath && path.extname(outputPath)) {
    return outputPath;
  }
  const ext =
    audioFormat === "pcm_s16le"
      ? "wav"
      : RAW_PCM_FORMATS.includes(audioFormat)
        ? "pcm"
        : audioFormat;
  const base = outputPath || "tts-ws";
  return `${base}.${ext}`;
}

function pcmS16leToWav(pcm, { sampleRate, numChannels = 1 }) {
  const bitsPerSample = 16;
  const byteRate = sampleRate * numChannels * (bitsPerSample / 8);
  const blockAlign = numChannels * (bitsPerSample / 8);
  const dataSize = pcm.byteLength;
  const header = Buffer.alloc(44);
  header.write("RIFF", 0, "ascii");
  header.writeUInt32LE(36 + dataSize, 4);
  header.write("WAVE", 8, "ascii");
  header.write("fmt ", 12, "ascii");
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);
  header.writeUInt16LE(numChannels, 22);
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(byteRate, 28);
  header.writeUInt16LE(blockAlign, 32);
  header.writeUInt16LE(bitsPerSample, 34);
  header.write("data", 36, "ascii");
  header.writeUInt32LE(dataSize, 40);
  return Buffer.concat([header, Buffer.from(pcm)]);
}

// Get Soniox TTS config.
function getConfig({
  apiKey,
  streamId,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
}) {
  const config = {
    // Get your API key at console.soniox.com, then run: export SONIOX_API_KEY=<YOUR_API_KEY>
    api_key: apiKey,

    // Client-defined stream id to identify this realtime request.
    stream_id: streamId,

    // Select the model to use.
    // See: soniox.com/docs/tts/models
    model: MODEL,

    // Set the language of the input text.
    // See: soniox.com/docs/tts/languages
    language,

    // Select the voice to use.
    // See: soniox.com/docs/tts/voices
    voice,

    // Audio format.
    // See: soniox.com/docs/tts/audio-formats
    audio_format: audioFormat,
  };

  if (sampleRate !== undefined) config.sample_rate = sampleRate;
  if (bitrate !== undefined) config.bitrate = bitrate;

  return config;
}

function getTextRequest(text, streamId, textEnd) {
  return {
    text,
    text_end: textEnd,
    stream_id: streamId,
  };
}

// Stream text lines to the websocket.
async function streamText(lines, streamId, ws) {
  for (const line of lines) {
    const cleanLine = line.trim();
    if (!cleanLine) continue;
    ws.send(JSON.stringify(getTextRequest(cleanLine, streamId, false)));
    // Sleep for 100 ms to simulate real-time streaming.
    await new Promise((res) => setTimeout(res, 100));
  }

  // Send text_end=true after the last chunk.
  ws.send(JSON.stringify(getTextRequest("", streamId, true)));
}

function runSession({
  apiKey,
  lines,
  language,
  voice,
  audioFormat,
  sampleRate,
  bitrate,
  streamId,
  outputPath,
}) {
  return new Promise((resolve, reject) => {
    console.log("Connecting to Soniox...");
    const ws = new WebSocket(SONIOX_TTS_WEBSOCKET_URL);

    const audioChunks = [];

    const finalize = (err) => {
      const destination = resolveOutputPath(outputPath, audioFormat);
      if (audioChunks.length > 0) {
        const audio = Buffer.concat(audioChunks);
        // Wrap raw pcm_s16le in a WAV container so the .wav file plays everywhere.
        const bytes =
          audioFormat === "pcm_s16le" &&
          path.extname(destination).toLowerCase() === ".wav"
            ? pcmS16leToWav(audio, { sampleRate })
            : audio;
        fs.writeFileSync(destination, bytes);
        console.log(`Wrote ${bytes.length} bytes to ${path.resolve(destination)}`);
      } else {
        console.log("No audio file was written.");
      }
      if (err) reject(err);
      else resolve();
    };

    ws.on("open", () => {
      const config = getConfig({
        apiKey,
        streamId,
        language,
        voice,
        audioFormat,
        sampleRate,
        bitrate,
      });

      // Send first request with config.
      ws.send(JSON.stringify(config));

      // Start streaming text in the background.
      streamText(lines, streamId, ws).catch((err) => {
        console.error("Text stream error:", err);
      });
      console.log("Session started.");
    });

    ws.on("message", (msg) => {
      let res;
      try {
        res = JSON.parse(msg.toString());
      } catch {
        return;
      }

      // Error from server.
      // See: https://soniox.com/docs/tts/api-reference/websocket-api#error-response
      if (res.error_code) {
        console.error(`Error: ${res.error_code} - ${res.error_message}`);
        ws.close();
        return;
      }

      // Collect audio bytes from base64-encoded chunks.
      if (res.audio) {
        audioChunks.push(Buffer.from(res.audio, "base64"));
      }

      // Session finished.
      if (res.terminated) {
        console.log("Session finished.");
        ws.close();
      }
    });

    ws.on("close", () => {
      finalize(null);
    });

    ws.on("error", (err) => {
      console.error("WebSocket error:", err.message);
      finalize(err);
    });
  });
}

async function main() {
  const { values: argv } = parseArgs({
    options: {
      line: { type: "string", multiple: true },
      language: { type: "string", default: "en" },
      voice: { type: "string", default: "Adrian" },
      audio_format: { type: "string", default: "pcm_s16le" },
      stream_id: { type: "string", default: "stream-1" },
      output_path: { type: "string", default: "tts-ws" },
      sample_rate: { type: "string" },
      bitrate: { type: "string" },
    },
  });

  if (!VALID_AUDIO_FORMATS.includes(argv.audio_format)) {
    throw new Error(
      `audio_format must be one of ${VALID_AUDIO_FORMATS.join(", ")}`,
    );
  }
  let sampleRate =
    argv.sample_rate !== undefined ? Number(argv.sample_rate) : undefined;
  if (sampleRate === undefined && RAW_PCM_FORMATS.includes(argv.audio_format)) {
    sampleRate = 24000;
  }
  if (sampleRate !== undefined && !VALID_SAMPLE_RATES.includes(sampleRate)) {
    throw new Error(
      `sample_rate must be one of ${VALID_SAMPLE_RATES.join(", ")}`,
    );
  }
  const bitrate = argv.bitrate !== undefined ? Number(argv.bitrate) : undefined;
  if (bitrate !== undefined && !VALID_BITRATES.includes(bitrate)) {
    throw new Error(`bitrate must be one of ${VALID_BITRATES.join(", ")}`);
  }

  const apiKey = process.env.SONIOX_API_KEY;
  if (!apiKey) {
    throw new Error(
      "Missing SONIOX_API_KEY.\n" +
        "1. Get your API key at https://console.soniox.com\n" +
        "2. Run: export SONIOX_API_KEY=<YOUR_API_KEY>",
    );
  }

  await runSession({
    apiKey,
    lines: argv.line && argv.line.length > 0 ? argv.line : DEFAULT_LINES,
    language: argv.language,
    voice: argv.voice,
    audioFormat: argv.audio_format,
    sampleRate,
    bitrate,
    streamId: argv.stream_id,
    outputPath: argv.output_path,
  });
}

main().catch((err) => {
  console.error("Error:", err.message);
  process.exit(1);
});

Terminal

# Generate speech with default settings (wav output)
node soniox_realtime.js --line "Hello from Soniox websocket Text-to-Speech."

# Generate raw PCM output
node soniox_realtime.js --audio_format pcm_s16le --sample_rate 24000 --output_path tts-output

Code

Run

Code

Run

Code

Run

Code

Run

On this page