Real-time speech generation with Node SDK
Stream text to speech with the Soniox Node SDK over WebSocket
The Soniox Node SDK supports real-time Text-to-Speech generation over WebSocket. You send text — all at once or incrementally — and receive decoded audio chunks as they are generated. This is the lowest-latency path for voice agents, LLM output narration, and any scenario where text arrives progressively.
If you already have the full text up front and don't need chunk-by-chunk streaming, use REST speech generation instead — it's a single HTTP request.
Quickstart
client.realtime.tts() creates a single-stream session: it opens a WebSocket, configures a stream, and returns a RealtimeTtsStream. Send text, then consume audio by async iteration.
The stream closes itself (and the underlying WebSocket) once terminated fires. You never have to call close() in single-stream mode.
Send text incrementally
Use sendText(text) for each chunk as it becomes available, then either set { end: true } on the last call or invoke finish() explicitly. This is the pattern for narrating an LLM response token-by-token.
Equivalent with an explicit finish():
Pipe from an async iterable
stream.sendStream(source) pipes any AsyncIterable<string> into the TTS session and auto-finishes when the iterable completes. This is the idiomatic way to connect an LLM token stream directly to speech output — sending and receiving run concurrently.
Event-based consumption
RealtimeTtsStream is also a TypedEmitter. When you prefer an event-driven style over async iteration, listen for TtsStreamEvents:
| Event | Payload | Description |
|---|---|---|
audio | Uint8Array | Decoded audio chunk. |
audioEnd | — | Server marked the final audio payload for this stream. |
terminated | — | Stream fully closed by the server. |
error | Error | Stream-level error. |
Choose either async iteration or event listeners — not both. The async iterator consumes audio events internally.
Multi-stream connection
A single WebSocket connection can carry up to 5 concurrent TTS streams. Use client.realtime.tts.multiStream() to open a RealtimeTtsConnection, then call connection.stream() for each stream. Each stream has its own streamId and can have different voice, model, and audio format settings.
Call connection.close() when you're done — this ends all active streams and closes the WebSocket.
Cancel, finish, and close
| Method | Behavior |
|---|---|
stream.finish() | Signals "no more text". The server finishes generating audio and sends terminated. |
stream.cancel() | Aborts generation immediately. The server stops producing audio and sends terminated. |
stream.close() | Terminates the stream. In single-stream mode (client.realtime.tts(...)) this also closes the WebSocket. |
connection.close() | Closes the WebSocket and terminates all streams on a multi-stream connection. |
Error handling
A failed stream does not close the whole WebSocket connection by default. Stream-level errors finalize only that stream (terminated fires for the same streamId), while other streams on the same connection can continue. Connection-level failures end the whole connection and all active streams.
Server-driven defaults
Set shared TTS fields once on the client via tts_defaults and they'll be merged as the base layer every time you open a stream. Caller-provided fields on client.realtime.tts(...) / connection.stream(...) override the defaults, so you never need to spread them manually.
tts_defaults is also accepted on RealtimeOptions if you want to scope defaults to a specific realtime namespace.
On the Web and React SDKs, the equivalent is SonioxConnectionConfig.tts_defaults — return it from the async config resolver alongside the temporary api_key so the server owns the defaults.
See also
- REST speech generation — single-request HTTP TTS.
RealtimeTtsStreamreferenceRealtimeTtsConnectionreferenceTtsStreamInput,TtsStreamEvents- TTS WebSocket API