Caption timing must align visual text with the corresponding acoustic event; recognition speed alone does not provide that alignment. Word timestamps and forced alignment supply timing for captions, seekable transcripts, and synchronized highlighting, although their precision depends on the audio and alignment method.
Timestamps from speech recognition
A modern recognizer can hand you timing for free, because to recognize a word at all it has to decide where that word sits in the audio. The output is usually per token: the text, a start time, an end time, and often a confidence score.
{ "tokens": [
{ "text": "transfer", "start_ms": 1200, "end_ms": 1620 },
{ "text": "four", "start_ms": 1700, "end_ms": 1880 },
{ "text": "hundred", "start_ms": 1900, "end_ms": 2240 }
]}
These are recognition timestamps. They come from the same pass that produced the words, so they require no extra step, and you use them whenever the recognizer is the source of the transcript.
Creating caption cues from timestamps
Captions are not raw words. They are short lines that appear and disappear on a schedule a viewer can read, and timestamps are the raw material. The grouping rule is roughly: start a new line when the current line gets too long, when too much time has passed, or when there is a real gap in the speech.
function toLines(tokens, maxChars = 42, maxGapMs = 700) {
const lines = []; let cur = [];
for (const t of tokens) {
const prev = cur[cur.length - 1];
const gap = prev ? t.start_ms - prev.end_ms : 0;
const text = cur.map(x => x.text).join(" ");
if (cur.length && (text.length > maxChars || gap > maxGapMs)) {
lines.push(cur); cur = [];
}
cur.push(t);
}
if (cur.length) lines.push(cur);
return lines; // each line carries its first start_ms and last end_ms
}
The exact thresholds are a matter of readability craft, covered in captions and subtitles. Good captions come down to slicing a timestamped token stream sensibly.
Forced alignment of an existing transcript
Sometimes you do not need recognition at all, because you already have the words. An audiobook ships with its script. A song has lyrics. A news package has an approved transcript that must not change. What you lack in each case is the timing. This is the forced alignment case: given the audio and the known text, find when each word was spoken.
# Conceptual: align a trusted transcript to its audio
words = aligner.align(audio="chapter1.wav", transcript="chapter1.txt")
# -> [("Call", 0.42, 0.61), ("me", 0.61, 0.74), ("Ishmael", 0.74, 1.30), ...]
The difference from recognition matters. Recognition asks "what words are these?" and gets the timing as a byproduct. Forced alignment is already told the words and asked only "when?" Because the text is fixed, alignment can be more precise and it never disagrees with the script, which is what you want when the transcript is authoritative.
Timing errors in difficult audio
Timing is clean only when the speech is clean. Three situations bend it. Music and noise stretch the boundaries, because the recognizer cannot tell exactly where a word fades into a guitar. Overlapping speakers break the assumption that one word occupies one slice of time. And long gaps, whether silence, applause, or a held note, leave the aligner guessing whether the next word starts at the gap's end or somewhere inside it.
The practical defenses are to align at the granularity you need and no finer (word-level timing is steadier than phoneme-level on messy audio), lean on voice activity detection to mark the gaps, and treat timestamps as accurate to a tolerance rather than to the millisecond. Captions tolerate a little slack. A karaoke highlight tolerates less. A forced-alignment dataset for training tolerates least, so align on the cleanest audio you have.
Applications of aligned timestamps
Once the timestamps exist, everything downstream is a small step. You get captions and subtitles. You get click-to-seek transcripts, where tapping a word jumps the player to its start time. You get highlighting that follows the voice, the read-along effect, produced by matching the current playback time to the token whose window contains it. You get precise editing, where cutting a sentence from the text cuts the matching span of audio. And you get clean training data for other models, which is much of why forced alignment exists in research at all.
Common questions
What is the difference between timestamps and forced alignment?
Timestamps are the times a recognizer attaches to the words it transcribes, produced as a byproduct of recognition. Forced alignment starts from a transcript you already have and finds when each of those known words was spoken. Use recognition timestamps when the recognizer is your source of text; use forced alignment when the text is fixed and you only need timing.
How precise are word timestamps?
Good systems place words within a few tens of milliseconds on clean speech, which is tight enough for captions, seeking, and highlighting. Precision degrades with music, noise, and overlapping speakers, where the true boundary of a word is genuinely fuzzy, so timestamps are best treated as accurate to a tolerance rather than exact.
Can I get timestamps for each character or phoneme, not just each word?
Sometimes. Word-level timing is the common output and the most stable. Finer granularity at the phoneme or character level is available from some systems and from forced aligners, and it is useful for linguistics, dubbing, and tight read-along effects, but it is more sensitive to messy audio.
Do I need forced alignment if my recognizer already gives timestamps?
Only when you have a transcript that must not change. If you are generating the transcript with recognition, its timestamps are enough. Forced alignment earns its place when the text is authoritative, such as an approved script, published lyrics, or a legal transcript, and you need accurate timing without letting recognition alter a word.
Related concepts
- Captions and subtitles: SRT, VTT, and timing
- Confidence scores
- Partial vs final results
- Voice activity detection
- How speech-to-text works
References
- Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. arXiv preprint arXiv:2303.00747.