Timestamps

Overview

Soniox Speech-to-Text AI provides precise timestamps for each recognized token (word or sub-word) in your transcription. These timestamps allow you to align text with the original audio for use cases like subtitles, audio indexing, keyword search, and real-time captioning.

Timestamps are returned by default — no configuration is required — and are supported in both asynchronous and real-time processing.

Output format

Each token in the response includes:

text: The recognized word or token
start_ms: The start time of the token in milliseconds
end_ms: The end time of the token in milliseconds

Example response

In this example, the word "beautiful" is split into three tokens with corresponding timestamp ranges.

{
  "tokens": [
    {"text": "Beau", "start_ms": 300, "end_ms": 420},
    {"text": "ti",  "start_ms": 420, "end_ms": 540},
    {"text": "ful", "start_ms": 540, "end_ms": 780}
  ]
}

Use cases

Use case	Description
Subtitles & captions	Sync spoken words with video playback.
Audio editing	Locate and extract segments of interest.
Keyword spotting	Jump to where specific words are spoken.
Visualization	Build real-time transcript viewers with time markers.
Live captioning	Stream partial results with timing for broadcast or accessibility tools.

Overview

Output format

Example response

Use cases

On this page