Timestamps

Overview

The Text-to-Speech WebSocket API can return character-level timestamps alongside the generated audio. For each character of the spoken text, you receive the start and end time (in seconds) of the audio that pronounces it.

This makes it possible to line up the audio you play with the exact text that produced it, in real time, as the stream arrives.

Timestamps are available on the WebSocket API only. The REST endpoint streams raw audio bytes with no JSON envelope, so it has nowhere to carry alignment data.

Why use timestamps

Subtitle and caption highlighting. Drive karaoke-style highlighting that follows the voice character by character, or word by word. Because timestamps stream back with the audio, you can highlight live instead of waiting for the full clip.
Agent-interruption flows. When a user interrupts a voice agent mid-sentence, the timestamps tell you how far into the text the audio actually reached. You can compute the exact interruption point and feed the spoken-so-far text back to the LLM, so the agent knows what the user did and did not hear.

Quick start

Set return_timestamps: true in the configuration message when you start a stream:

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "tts-rt-v1",
  "language": "en",
  "voice": "Adrian",
  "audio_format": "pcm_s16le",
  "sample_rate": 24000,
  "stream_id": "stream-001",
  "return_timestamps": true
}

Audio responses then include a timestamps object:

{
  "stream_id": "stream-001",
  "audio": "<base64-encoded-audio-chunk>",
  "timestamps": {
    "characters": ["H", "e", "l", "l", "o"],
    "character_start_times_seconds": [0.0, 0.1, 0.2, 0.3, 0.4],
    "character_end_times_seconds": [0.1, 0.2, 0.3, 0.4, 0.5]
  }
}

Request

return_timestamps is an optional field on the configuration (start) message. It defaults to false and has no effect on text, keep_alive, or cancel messages.

return_timestampsboolean

Request character-level timestamps in the responses. Defaults to false.

Response

When timestamps are active, a timestamps object is attached to audio response frames. It contains three equal-length, parallel arrays:

charactersarray<string>

One entry per character (Unicode codepoint) of the spoken text.

character_start_times_secondsarray<number>

Start time of each character, in seconds.

character_end_times_secondsarray<number>

End time of each character, in seconds.

How timestamps are delivered

Streaming and chunked. Timestamps arrive incrementally, interleaved with audio. Each timestamps object covers only the characters in that frame. Concatenating the characters arrays across all frames, in order, reconstructs the full spoken text.
Timestamps always ride with audio. A timestamps object is always attached to an audio frame, never sent on its own. The service may, however, emit an audio frame that carries no timestamps (audio-only), so clients must handle audio frames both with and without alignment data.
Monotonic timing. Times are non-decreasing across the whole stream, including across chunk boundaries, and never exceed the duration of audio generated so far.
Omitted when empty. The timestamps object is left out of frames that carry no alignment data, such as the terminal audio_end and terminated frames.

Preprocessed text

Timestamps map to the preprocessed text, not the raw input you sent.

Before generation, the model normalizes the input: for example, normalizing whitespace and removing characters it cannot pronounce, such as emojis. The characters you receive reflect that normalized form, so they may differ from your original string. There is currently no mapping back to the original input text, only the preprocessed alignment is returned.

For agent-interruption use cases this is usually sufficient: feed the normalized text, up to the interruption point, back to the LLM.

End-to-end example

Client → Server (start with return_timestamps: true):

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "tts-rt-v1",
  "language": "en",
  "voice": "Adrian",
  "audio_format": "mp3",
  "bitrate": 128000,
  "stream_id": "stream-001",
  "return_timestamps": true
}

Client → Server (text, then end of text):

{
  "text": "Hi",
  "text_end": true,
  "stream_id": "stream-001"
}

Server → Client (audio with timestamps):

{
  "stream_id": "stream-001",
  "audio": "<base64-audio-bytes>",
  "timestamps": {
    "characters": ["H", "i"],
    "character_start_times_seconds": [0.0, 0.1],
    "character_end_times_seconds": [0.1, 0.25]
  }
}

Server → Client (last audio chunk, carrying the trailing audio with no new characters, so no timestamps):

{
  "stream_id": "stream-001",
  "audio": "<base64-audio-bytes>",
  "audio_end": true
}

Server → Client (stream terminated):

{
  "stream_id": "stream-001",
  "terminated": true
}

REST API

The REST endpoint returns raw audio bytes as the HTTP response body (for example, audio/mpeg), with no JSON wrapper. There is nowhere to place alignment data without corrupting the audio stream, so timestamps are not available over REST, and return_timestamps is ignored there.

Use the WebSocket API if you need timestamps.

API reference

For the full message schema, configuration parameters, and error codes, see the WebSocket API reference.

Timestamps

On this page