Real-time speech generation with Web SDK

The Soniox Web SDK supports real-time Text-to-Speech generation over WebSocket directly in the browser. You send text — all at once or incrementally — and receive decoded audio chunks as they arrive, so playback can start before generation is complete. This is the ideal transport for narrating LLM output and building voice agents in the browser.

If you already have the full text up front and don't need chunk-by-chunk playback, use REST speech generation — it's a single HTTP request.

Set up your temporary API key endpoint

Create a temporary key endpoint on your server using the Soniox Node SDK. Real-time TTS keys use the tts_rt usage type.

To attribute browser-side TTS traffic to an end user or session, pass client_reference_id to createTemporaryKey - every request authenticated with the key is recorded under that identifier in usage logs. Clients cannot override it.

import express from 'express';
import { SonioxNodeClient } from '@soniox/node';

const app = express();
const client = new SonioxNodeClient(); // reads SONIOX_API_KEY from env

app.get('/tts-rt-tmp-key', async (_req, res) => {
  try {
    const { api_key, expires_at } = await client.auth.createTemporaryKey({
      usage_type: 'tts_rt',
      expires_in_seconds: 300,
    });
    res.json({ api_key, expires_at });
  } catch (err) {
    res.status(500).json({ error: err instanceof Error ? err.message : 'Failed to create temporary key' });
  }
});

app.listen(3000);

Quickstart

Create a SonioxClient with a config resolver, then call client.realtime.tts() to open a single-stream session. Send text, consume audio by async iteration, and play it back.

import { SonioxClient } from "@soniox/client";

const client = new SonioxClient({
  config: async () => {
    const res = await fetch("/tts-rt-tmp-key");
    const { api_key } = await res.json();
    return { api_key };
  },
});

const stream = await client.realtime.tts({
  voice: "Adrian",
  model: "tts-rt-v1",
  language: "en",
  audio_format: "wav",
});

stream.sendText("Hello from Soniox real-time text-to-speech.", { end: true });

const chunks: Uint8Array[] = [];
for await (const chunk of stream) {
  chunks.push(chunk);
}

const blob = new Blob(chunks, { type: "audio/wav" });
await new Audio(URL.createObjectURL(blob)).play();

The stream closes itself (and the underlying WebSocket) once terminated fires. You never have to call close() in single-stream mode.

Play audio as it arrives

For the lowest-latency playback, feed chunks into a MediaSource instead of waiting for the full payload.

const mediaSource = new MediaSource();
const audioEl = new Audio(URL.createObjectURL(mediaSource));
await audioEl.play();

mediaSource.addEventListener("sourceopen", async () => {
  const sourceBuffer = mediaSource.addSourceBuffer("audio/wav");

  const stream = await client.realtime.tts({
    voice: "Adrian",
    audio_format: "wav",
  });
  stream.sendText("Streaming audio for low-latency playback.", { end: true });

  for await (const chunk of stream) {
    await new Promise<void>((resolve) => {
      sourceBuffer.addEventListener("updateend", () => resolve(), { once: true });
      sourceBuffer.appendBuffer(chunk);
    });
  }

  mediaSource.endOfStream();
});

Send text incrementally

Call sendText(text) for each chunk as it becomes available, then mark the last chunk with { end: true } or invoke finish() explicitly. This is the pattern for narrating an LLM response token-by-token.

const stream = await client.realtime.tts({ voice: "Adrian", audio_format: "wav" });

stream.sendText("Hello from Soniox ");
stream.sendText("real-time TTS. ");
stream.sendText("This is the final chunk.", { end: true });

for await (const chunk of stream) {
  playback(chunk); // your playback function (see "Play audio as it arrives" above)
}

Pipe from an async iterable

stream.sendStream(source) pipes any AsyncIterable<string> into the TTS session and auto-finishes when the iterable completes. Sending and receiving run concurrently.

async function* llmTokens(prompt: string): AsyncIterable<string> {
  const res = await fetch("/llm/stream", {
    method: "POST",
    body: JSON.stringify({ prompt }),
  });
  const reader = res.body!.getReader();
  const decoder = new TextDecoder();
  while (true) {
    const { value, done } = await reader.read();
    if (done) return;
    yield decoder.decode(value);
  }
}

const stream = await client.realtime.tts({ voice: "Adrian", audio_format: "wav" });
stream.sendStream(llmTokens("Tell me a story."));

for await (const chunk of stream) {
  playback(chunk);
}

Event-based consumption

RealtimeTtsStream is also a typed event emitter. When you prefer an event-driven style over async iteration, listen for TtsStreamEvents:

Event	Payload	Description
`audio`	`Uint8Array`	Decoded audio chunk.
`audioEnd`	—	Server marked the final audio payload for this stream.
`terminated`	—	Stream fully closed by the server.
`error`	`Error`	Stream-level error.

const stream = await client.realtime.tts({ voice: "Adrian", audio_format: "wav" });

stream.on("audio", (chunk) => playback(chunk));
stream.on("audioEnd", () => console.log("last audio payload received"));
stream.on("error", (err) => console.error("Stream error:", err));
stream.on("terminated", () => console.log("stream done"));

stream.sendText("Hello from event-based TTS.", { end: true });

Choose either async iteration or event listeners — not both. The async iterator consumes audio events internally.

Multi-stream connection

A single WebSocket connection can carry up to 5 concurrent TTS streams. Use client.realtime.tts.multiStream() to open a RealtimeTtsConnection, then call connection.stream() for each stream — each with its own voice, model, and audio format.

const connection = await client.realtime.tts.multiStream();

const s1 = await connection.stream({ voice: "Adrian", audio_format: "wav" });
// Enumerate available voices via the Node SDK's `client.tts.listModels()`.
const s2 = await connection.stream({ voice: "<another-voice>", audio_format: "wav" });

s1.sendText("Hello from stream 1.", { end: true });
s2.sendText("Hello from stream 2.", { end: true });

// Consume both streams concurrently. `playback(chunk, streamId)` is your
// app-specific playback function (e.g. feeding two `MediaSource` instances).
await Promise.all([
  (async () => { for await (const c of s1) playback(c, "s1"); })(),
  (async () => { for await (const c of s2) playback(c, "s2"); })(),
]);

connection.close();

Call connection.close() when you're done — this ends all active streams and closes the WebSocket.

Cancel, finish, and close

Method	Behavior
`stream.finish()`	Signals "no more text". The server finishes generating audio and sends `terminated`.
`stream.cancel()`	Aborts generation immediately. The server stops producing audio and sends `terminated`.
`stream.close()`	Terminates the stream. In single-stream mode this also closes the WebSocket.
`connection.close()`	Closes the WebSocket and terminates all streams on a multi-stream connection.

stream.finish();  // graceful stop
stream.cancel();  // user-triggered cancel

Error handling

A failed stream does not close the whole WebSocket connection by default. Stream-level errors finalize only that stream (terminated fires for the same stream id), while other streams on the same connection can continue. Connection-level failures end the whole connection and all active streams.

import { RealtimeError, SonioxError } from "@soniox/client";

try {
  const stream = await client.realtime.tts({ voice: "Adrian" });
  stream.sendText("Hello!", { end: true });
  for await (const _ of stream) {
    // consume audio
  }
} catch (err) {
  if (err instanceof RealtimeError) {
    console.error(`Realtime TTS error (${err.code}):`, err.message);
  } else if (err instanceof SonioxError) {
    console.error("Soniox SDK error:", err.message);
  } else {
    throw err;
  }
}

Server-driven defaults

There's no first-class endpoint for TTS defaults — you own them. Keep them on your server next to the temporary-key endpoint and return them via SonioxConnectionConfig.tts_defaults. The SDK merges them as the base layer when opening TTS streams, and caller-provided fields on client.realtime.tts(...) / connection.stream(...) override the defaults.

app.get('/tts-rt-tmp-key', async (_req, res) => {
  const { api_key, expires_at } = await nodeClient.auth.createTemporaryKey({
    usage_type: 'tts_rt',
    expires_in_seconds: 300,
  });

  res.json({
    api_key,
    expires_at,
    tts_defaults: {
      model: 'tts-rt-v1',
      language: 'en',
      voice: 'Adrian',
      audio_format: 'wav',
    },
  });
});

The browser client consumes the defaults automatically:

const client = new SonioxClient({
  config: async () => {
    const res = await fetch("/tts-rt-tmp-key");
    return await res.json(); // { api_key, tts_defaults, ... }
  },
});

const stream = await client.realtime.tts({}); // uses server-provided defaults
const override = await client.realtime.tts({ voice: "<another-voice>" }); // overrides voice

Real-time speech generation with Web SDK

Set up your temporary API key endpoint

Quickstart

Play audio as it arrives

Send text incrementally

Pipe from an async iterable

Event-based consumption

Multi-stream connection

Cancel, finish, and close

Error handling

Server-driven defaults

See also

On this page

Real-time speech generation with Web SDK

Example temporary key endpoint for real-time TTS

On this page