Real-time speech generation with Web SDK
Stream text to speech in the browser with the Soniox Web SDK over WebSocket
The Soniox Web SDK supports real-time Text-to-Speech generation over WebSocket directly in the browser. You send text — all at once or incrementally — and receive decoded audio chunks as they arrive, so playback can start before generation is complete. This is the ideal transport for narrating LLM output and building voice agents in the browser.
If you already have the full text up front and don't need chunk-by-chunk playback, use REST speech generation — it's a single HTTP request.
Set up your temporary API key endpoint
Create a temporary key endpoint on your server using the Soniox Node SDK. Real-time TTS keys use the tts_rt usage type.
Quickstart
Create a SonioxClient with a config resolver, then call client.realtime.tts() to open a single-stream session. Send text, consume audio by async iteration, and play it back.
The stream closes itself (and the underlying WebSocket) once terminated fires. You never have to call close() in single-stream mode.
Play audio as it arrives
For the lowest-latency playback, feed chunks into a MediaSource instead of waiting for the full payload.
Send text incrementally
Call sendText(text) for each chunk as it becomes available, then mark the last chunk with { end: true } or invoke finish() explicitly. This is the pattern for narrating an LLM response token-by-token.
Pipe from an async iterable
stream.sendStream(source) pipes any AsyncIterable<string> into the TTS session and auto-finishes when the iterable completes. Sending and receiving run concurrently.
Event-based consumption
RealtimeTtsStream is also a typed event emitter. When you prefer an event-driven style over async iteration, listen for TtsStreamEvents:
| Event | Payload | Description |
|---|---|---|
audio | Uint8Array | Decoded audio chunk. |
audioEnd | — | Server marked the final audio payload for this stream. |
terminated | — | Stream fully closed by the server. |
error | Error | Stream-level error. |
Choose either async iteration or event listeners — not both. The async iterator consumes audio events internally.
Multi-stream connection
A single WebSocket connection can carry up to 5 concurrent TTS streams. Use client.realtime.tts.multiStream() to open a RealtimeTtsConnection, then call connection.stream() for each stream — each with its own voice, model, and audio format.
Call connection.close() when you're done — this ends all active streams and closes the WebSocket.
Cancel, finish, and close
| Method | Behavior |
|---|---|
stream.finish() | Signals "no more text". The server finishes generating audio and sends terminated. |
stream.cancel() | Aborts generation immediately. The server stops producing audio and sends terminated. |
stream.close() | Terminates the stream. In single-stream mode this also closes the WebSocket. |
connection.close() | Closes the WebSocket and terminates all streams on a multi-stream connection. |
Error handling
A failed stream does not close the whole WebSocket connection by default. Stream-level errors finalize only that stream (terminated fires for the same stream id), while other streams on the same connection can continue. Connection-level failures end the whole connection and all active streams.
Server-driven defaults
There's no first-class endpoint for TTS defaults — you own them. Keep them on your server next to the temporary-key endpoint and return them via SonioxConnectionConfig.tts_defaults. The SDK merges them as the base layer when opening TTS streams, and caller-provided fields on client.realtime.tts(...) / connection.stream(...) override the defaults.
The browser client consumes the defaults automatically:
See also
- REST speech generation — single-request HTTP TTS.
RealtimeTtsStreamreferenceRealtimeTtsConnectionreferenceTtsStreamInput,TtsStreamEvents- TTS WebSocket API