Word timestamps and forced alignment in speech-to-text

Watch a captioned video and each line lands as its words are spoken, close enough that you stop noticing. Somebody had to know, to within a fraction of a second, when every word in that audio occurs. That knowledge is the timestamp, and there are exactly two ways to get it: as a byproduct of recognizing the words, or as a separate pass that times words you already have.

Timestamps from speech recognition

A modern recognizer can hand you timing for free, because to recognize a word at all it has to decide where that word sits in the audio. The output is usually per token: the text, a start time, an end time, and often a confidence score.

{ "tokens": [
  { "text": "transfer", "start_ms": 1200, "end_ms": 1620 },
  { "text": "four",     "start_ms": 1700, "end_ms": 1880 },
  { "text": "hundred",  "start_ms": 1900, "end_ms": 2240 }
]}

These are recognition timestamps. They come from the same pass that produced the words, so they require no extra step, and you use them whenever the recognizer is the source of the transcript.

Creating caption cues from timestamps

Captions are not raw words. They are short lines that appear and disappear on a schedule a viewer can read, and timestamps are the raw material. The grouping rule is roughly: start a new line when the current one grows past a caption's width (forty-some characters is common), when it has been building too long, or when a real gap opens in the speech. Each finished line then carries the start time of its first token and the end time of its last, and those two numbers become the caption cue.

The exact thresholds are a matter of readability craft, covered in captions and subtitles. Good captions come down to slicing a timestamped token stream sensibly.

Forced alignment of an existing transcript

Sometimes you do not need recognition at all, because you already have the words. An audiobook ships with its script. A song has lyrics. A news package has an approved transcript that must not change. What you lack in each case is the timing. This is the forced alignment case: given the audio and the known text, find when each word was spoken. If you hand an aligner the first chapter and its audio, it returns the same words with clocks attached: "Call" from 0.42 to 0.61 seconds, "me" to 0.74, "Ishmael" ending at 1.30.

The difference from recognition matters. Recognition asks "what words are these?" and gets the timing as a byproduct. Forced alignment is already told the words and asked only "when?" Because the text is fixed, alignment can be more precise and it never disagrees with the script, which is what you want when the transcript is authoritative.

Timing errors in difficult audio

Timing is clean only when the speech is clean. Music and noise stretch the boundaries, because the recognizer cannot tell exactly where a word fades into a guitar. Overlapping speakers break the assumption that one word occupies one slice of time. And long gaps, whether silence, applause, or a held note, leave the aligner guessing whether the next word starts at the gap's end or somewhere inside it.^[1]

The practical defenses are to align at the granularity you need and no finer (word-level timing is steadier than phoneme-level on messy audio), lean on voice activity detection to mark the gaps, and treat timestamps as accurate to a tolerance rather than to the millisecond. Captions tolerate a little slack. A karaoke highlight tolerates less. A forced-alignment dataset for training tolerates least, so align on the cleanest audio you have.

flowchart TB A[Audio] --> B[Recognizer] B --> C[Words + timestamps] A --> D[Forced aligner] E[Known transcript] --> D D --> F[Same words,<br/>now timed]

Two routes to timing. Recognition produces words and times together; forced alignment adds times to words you already trust.

Applications of aligned timestamps

Once the timestamps exist, everything downstream is a small step. Captions and subtitles are the obvious one. Click-to-seek transcripts follow, where tapping a word jumps the player to its start time, and so does read-along highlighting, produced by matching the playback clock against the token whose window contains it. Editors get precision: cut a sentence from the text and the matching span of audio goes with it. And researchers get clean training data, which is much of why forced alignment exists at all.

Common questions

What is the difference between timestamps and forced alignment?

Timestamps are the times a recognizer attaches to the words it transcribes, produced as a byproduct of recognition. Forced alignment starts from a transcript you already have and finds when each of those known words was spoken. Use recognition timestamps when the recognizer is your source of text; use forced alignment when the text is fixed and you only need timing.

How precise are word timestamps?

Good systems place words within a few tens of milliseconds on clean speech, which is tight enough for captions, seeking, and highlighting. Precision degrades with music, noise, and overlapping speakers, where the true boundary of a word is genuinely fuzzy, so timestamps are best treated as accurate to a tolerance rather than exact.

Can I get timestamps for each character or phoneme, not just each word?

Sometimes. Word-level timing is the common output and the most stable. Finer granularity at the phoneme or character level is available from some systems and from forced aligners, and it is useful for linguistics, dubbing, and tight read-along effects, but it is more sensitive to messy audio.

Do I need forced alignment if my recognizer already gives timestamps?

Only when you have a transcript that must not change. If you are generating the transcript with recognition, its timestamps are enough. Forced alignment earns its place when the text is authoritative, such as an approved script, published lyrics, or a legal transcript, and you need accurate timing without letting recognition alter a word.

References

Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. arXiv preprint arXiv:2303.00747.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. Interspeech 2017.