A caption may be lexically correct yet unusable if it contains too much text for its display interval. Caption construction therefore requires segmentation and timing in addition to accurate transcription. A timestamped transcript supplies the source text and temporal information from which SRT or VTT cues are constructed.
Required timestamp data
Captioning needs to know when each word was spoken, so it begins with a transcript carrying per-word timestamps. Assume tokens like these, with millisecond start and end times:
[
{ "text": "we", "start_ms": 1000, "end_ms": 1120 },
{ "text": "shipped", "start_ms": 1120, "end_ms": 1480 },
{ "text": "it", "start_ms": 1480, "end_ms": 1600 },
{ "text": "on", "start_ms": 1700, "end_ms": 1820 },
{ "text": "Friday", "start_ms": 1820, "end_ms": 2300 }
]
Everything downstream is grouping these into lines and putting timecodes on the groups.
Caption segmentation
This stage separates good captions from bad. Three constraints govern where a caption breaks, and they pull against each other.
Line length keeps each line short, around 37 to 42 characters, rarely more than two lines per caption, so the text does not crowd out the picture. Reading speed sets how long a line must stay up: a viewer reads only so fast, capped near 17 characters per second for adults (broadcast and streaming style guides each set their own ceiling), with a floor around one second so a short line does not flash by and a ceiling around seven so a long one does not loiter. Phrase boundaries decide where the break lands: at a natural seam, after a clause or punctuation mark, never mid-name and never between an article and its noun. Split "the / contract" across two lines and the eye has to start the sentence twice.
function toCues(tokens, { maxChars = 42, maxGapMs = 700, maxDurMs = 7000 } = {}) {
const cues = []; let cur = [];
const flush = () => { if (cur.length) { cues.push(cur); cur = []; } };
for (const t of tokens) {
const prev = cur[cur.length - 1];
const text = cur.map(x => x.text).join(" ");
const gap = prev ? t.start_ms - prev.end_ms : 0;
const dur = prev ? t.end_ms - cur[0].start_ms : 0;
if (cur.length && (text.length + t.text.length > maxChars || gap > maxGapMs || dur > maxDurMs)) flush();
cur.push(t);
}
flush();
return cues; // each cue: tokens with a start (first) and end (last)
}
Write SRT
SubRip (.srt) is the oldest and most widely accepted format. Each cue is a number, a timecode line with comma as the decimal separator, the text, and a blank line.
1
00:00:01,000 --> 00:00:02,300
We shipped it on Friday.
2
00:00:03,100 --> 00:00:05,400
It should arrive Monday morning.
The format is unforgiving and gives no error when you get it wrong. The arrow is --> with a space on each side, the milliseconds take a comma and not a dot, and the blank line between cues is mandatory. Get one wrong and a strict player drops the cue silently.
Write VTT
WebVTT (.vtt) is the web standard, used by HTML5 video. It starts with a WEBVTT header, uses a dot for the decimal separator, drops the required index, and adds optional cue settings for positioning and styling.
WEBVTT
00:00:01.000 --> 00:00:02.300
We shipped it on Friday.
00:00:03.100 --> 00:00:05.400 line:90% align:center
It should arrive Monday morning.
The two formats are close enough that converting between them is mostly swapping the comma for a dot, adding or removing the header and indices, and deciding what to do with VTT's extra settings.
Common formatting problems
Clean prose is the easy part. Whether captions ship usually comes down to how you handle the messy cases.
Speaker labels, usually a >> or a name prefix, mark who is talking when the picture does not make it obvious. Non-speech cues, "[music]" and "[laughter]," belong in captions for accessibility and get dropped from subtitles, where a hearing viewer does not need them. Overlapping speech forces a choice: when two people talk at once and both lines will not fit, someone's words get cut. And positioning matters the moment a caption would sit on top of a lower-third or burned-in text, which is what VTT's cue settings exist to fix.
Live captions
Everything above assumes a finished recording. Live captioning adds the constraints of streaming. You work from partial results that may still change, and you pick a presentation style. Roll-up captions scroll word by word as text firms up, which suits the provisional nature of live recognition. Pop-on captions show a complete line at once, which reads better but adds latency because you wait for the line to finalize. This is the trade-off between responsiveness and stability.
Common questions
What is the difference between captions and subtitles?
Captions are in the same language as the audio and include non-speech sounds like "[applause]," serving viewers who cannot hear. Subtitles assume the viewer can hear and usually translate into another language. The file formats, SRT and VTT, are the same. The content conventions differ.
Should I use SRT or VTT?
VTT for the web: HTML5 video supports it natively and it allows positioning and styling. SRT for the widest compatibility with players, editors, and platforms that expect the older format. They carry the same timed text and convert easily. The differences are the decimal separator, the header, and VTT's optional cue settings.
Why do my captions feel too fast to read?
They were segmented by recognition output, not by reading speed. A caption must stay on screen long enough to read, capped near 17 characters per second, with a minimum duration around a second. Long lines that change too quickly outrun the viewer even when every word is correct.
How do I make live captions if the text keeps changing?
Work from partial results and pick a presentation style. Roll-up captions scroll as words firm up and tolerate the provisional text. Pop-on captions wait for a complete line, which reads better but adds delay. Either way you trade responsiveness against stability, the same trade-off as all streaming recognition.
Related concepts
- Word timestamps and forced alignment
- Streaming speech recognition
- Partial vs final results
- Punctuation, capitalization, and ITN
- Real-time vs async transcription
References
- Wilken, P., Georgakopoulou, P., & Matusov, E. (2022). SubER: A Metric for Automatic Evaluation of Subtitle Quality. arXiv preprint arXiv:2205.05805.