Soniox
Docs
Core concepts

Timestamps

Learn how to use timestamps and understand their granularity.

Overview

Soniox Speech-to-Text AI provides precise timestamps for each recognized token (word or sub-word) in your transcription. These timestamps allow you to align text with the original audio for use cases like subtitles, audio indexing, keyword search, and real-time captioning.

Timestamps are returned by default — no configuration is required — and are supported in both asynchronous and real-time processing.


Output format

Each token in the response includes:

  • text: The recognized word or token
  • start_ms: The start time of the token in milliseconds
  • end_ms: The end time of the token in milliseconds

Example response

In this example, the word "beautiful" is split into three tokens with corresponding timestamp ranges.

{
  "tokens": [
    {"text": "Beau", "start_ms": 300, "end_ms": 420},
    {"text": "ti",  "start_ms": 420, "end_ms": 540},
    {"text": "ful", "start_ms": 540, "end_ms": 780}
  ]
}

Use cases

Use caseDescription
Subtitles & captionsSync spoken words with video playback.
Audio editingLocate and extract segments of interest.
Keyword spottingJump to where specific words are spoken.
VisualizationBuild real-time transcript viewers with time markers.
Live captioningStream partial results with timing for broadcast or accessibility tools.

On this page