Soniox
Docs
Shared concepts

Timestamps

Learn how to use timestamps and understand their granularity.

Overview

Soniox Speech-to-Text AI provides precise timestamps for every recognized token (word or sub-word). Timestamps let you align transcriptions with audio, so you know exactly when each word was spoken.

Timestamps are always included by default — no extra configuration needed.


Output format

Each token in the response includes:

  • text → The recognized token.
  • start_ms → Token start time (in milliseconds).
  • end_ms → Token end time (in milliseconds).

Example response

In this example, the word “Beautiful” is split into three tokens, each with its own timestamp range:

{
  "tokens": [
    {"text": "Beau", "start_ms": 300, "end_ms": 420},
    {"text": "ti",   "start_ms": 420, "end_ms": 540},
    {"text": "ful",  "start_ms": 540, "end_ms": 780}
  ]
}