Timestamps
Learn how to receive character-level audio timestamps from the Soniox Text-to-Speech WebSocket API for subtitle highlighting and agent-interruption flows.
Overview
The Text-to-Speech WebSocket API can return character-level timestamps alongside the generated audio. For each character of the spoken text, you receive the start and end time (in seconds) of the audio that pronounces it.
This makes it possible to line up the audio you play with the exact text that produced it, in real time, as the stream arrives.
Timestamps are available on the WebSocket API only. The REST endpoint streams raw audio bytes with no JSON envelope, so it has nowhere to carry alignment data.
Why use timestamps
- Subtitle and caption highlighting. Drive karaoke-style highlighting that follows the voice character by character, or word by word. Because timestamps stream back with the audio, you can highlight live instead of waiting for the full clip.
- Agent-interruption flows. When a user interrupts a voice agent mid-sentence, the timestamps tell you how far into the text the audio actually reached. You can compute the exact interruption point and feed the spoken-so-far text back to the LLM, so the agent knows what the user did and did not hear.
Quick start
Set return_timestamps: true in the configuration message when you start a stream:
Audio responses then include a timestamps object:
Request
return_timestamps is an optional field on the configuration (start) message. It defaults to false and has no effect on text, keep_alive, or cancel messages.
return_timestampsbooleanRequest character-level timestamps in the responses. Defaults to false.
Response
When timestamps are active, a timestamps object is attached to audio response frames. It contains three equal-length, parallel arrays:
charactersarray<string>One entry per character (Unicode codepoint) of the spoken text.
character_start_times_secondsarray<number>Start time of each character, in seconds.
character_end_times_secondsarray<number>End time of each character, in seconds.
How timestamps are delivered
- Streaming and chunked. Timestamps arrive incrementally, interleaved with audio. Each
timestampsobject covers only the characters in that frame. Concatenating thecharactersarrays across all frames, in order, reconstructs the full spoken text. - Timestamps always ride with audio. A
timestampsobject is always attached to anaudioframe, never sent on its own. The service may, however, emit anaudioframe that carries notimestamps(audio-only), so clients must handle audio frames both with and without alignment data. - Monotonic timing. Times are non-decreasing across the whole stream, including across chunk boundaries, and never exceed the duration of audio generated so far.
- Omitted when empty. The
timestampsobject is left out of frames that carry no alignment data, such as the terminalaudio_endandterminatedframes.
Preprocessed text
Timestamps map to the preprocessed text, not the raw input you sent.
Before generation, the model normalizes the input: for example, normalizing whitespace and removing characters it cannot pronounce, such as emojis. The characters you receive reflect that normalized form, so they may differ from your original string. There is currently no mapping back to the original input text, only the preprocessed alignment is returned.
For agent-interruption use cases this is usually sufficient: feed the normalized text, up to the interruption point, back to the LLM.
End-to-end example
Client → Server (start with return_timestamps: true):
Client → Server (text, then end of text):
Server → Client (audio with timestamps):
Server → Client (last audio chunk, carrying the trailing audio with no new characters, so no timestamps):
Server → Client (stream terminated):
REST API
The REST endpoint returns raw audio bytes as the HTTP response body (for example, audio/mpeg), with no JSON wrapper. There is nowhere to place alignment data without corrupting the audio stream, so timestamps are not available over REST, and return_timestamps is ignored there.
Use the WebSocket API if you need timestamps.
API reference
For the full message schema, configuration parameters, and error codes, see the WebSocket API reference.