Timestamps

Overview

Soniox Speech-to-Text AI provides precise timestamps for every recognized token (word or sub-word). Timestamps let you align transcriptions with audio, so you know exactly when each word was spoken.

Timestamps are always included by default — no extra configuration needed.

Output format

Each token in the response includes:

text → The recognized token.
start_ms → Token start time (in milliseconds).
end_ms → Token end time (in milliseconds).

Example response

In this example, the word “Beautiful” is split into three tokens, each with its own timestamp range:

{
  "tokens": [
    {"text": "Beau", "start_ms": 300, "end_ms": 420},
    {"text": "ti",   "start_ms": 420, "end_ms": 540},
    {"text": "ful",  "start_ms": 540, "end_ms": 780}
  ]
}

Overview

Output format

Example response

On this page