Generating captions and subtitles: SRT, VTT, and timing rules

A caption can be word-for-word correct and still unusable, because it puts more text on screen than a viewer can read before it disappears. Captioning is transcription plus two extra crafts: cutting the text into readable pieces, and timing each piece to the voice. The raw material for both is a transcript with per-word timestamps.

Required timestamp data

Captioning needs to know when each word was spoken, so it begins with a transcript carrying per-word timestamps. Assume tokens like these, with millisecond start and end times:

[
  { "text": "we",      "start_ms": 1000, "end_ms": 1120 },
  { "text": "shipped", "start_ms": 1120, "end_ms": 1480 },
  { "text": "it",      "start_ms": 1480, "end_ms": 1600 },
  { "text": "on",      "start_ms": 1700, "end_ms": 1820 },
  { "text": "Friday",  "start_ms": 1820, "end_ms": 2300 }
]

Everything downstream is grouping these into lines and putting timecodes on the groups.

Caption segmentation

This stage separates good captions from bad. The constraints that govern where a caption breaks pull against each other.

Line length keeps each line short, around 37 to 42 characters, rarely more than two lines per caption, so the text does not crowd out the picture. Reading speed sets how long a line must stay up: a viewer reads only so fast, capped near 17 characters per second for adults (broadcast and streaming style guides each set their own ceiling), with a floor around one second so a short line does not flash by and a ceiling around seven so a long one does not loiter. Phrase boundaries decide where the break lands: at a natural seam, after a clause or punctuation mark, never mid-name and never between an article and its noun. If you split "the / contract" across two lines, the eye has to start the sentence twice.

In code, segmentation is a dozen unremarkable lines that accumulate tokens and flush a cue whenever a limit trips. The craft is entirely in the thresholds, and subtitle quality even has its own metric, SubER, which scores segmentation and timing together with the words, precisely because word accuracy alone misses what makes captions readable.^[1]

Write SRT

SubRip (.srt) is the oldest and most widely accepted format. Each cue is a number, a timecode line with comma as the decimal separator, the text, and a blank line.

1
00:00:01,000 --> 00:00:02,300
We shipped it on Friday.

2
00:00:03,100 --> 00:00:05,400
It should arrive Monday morning.

The format is unforgiving and gives no error when you get it wrong. The arrow is --> with a space on each side, the milliseconds take a comma and not a dot, and the blank line between cues is mandatory. If you get one of these wrong, a strict player drops the cue silently.

Write VTT

WebVTT (.vtt) is the web standard, used by HTML5 video. It starts with a WEBVTT header, uses a dot for the decimal separator, drops the required index, and adds optional cue settings for positioning and styling.

WEBVTT

00:00:01.000 --> 00:00:02.300
We shipped it on Friday.

00:00:03.100 --> 00:00:05.400 line:90% align:center
It should arrive Monday morning.

The two formats are close enough that converting between them is mostly swapping the comma for a dot, adding or removing the header and indices, and deciding what to do with VTT's extra settings.

flowchart LR A[Timestamped<br/>tokens] --> B[Segment into<br/>readable cues] B --> C[Apply timing<br/>rules] C --> D[SRT or VTT]

The pipeline. The middle stage, segmentation, is where readability is won or lost.

Common formatting problems

Clean prose is the easy part. Whether captions ship usually comes down to how you handle the messy cases.

Speaker labels, usually a >> or a name prefix, mark who is talking when the picture does not make it obvious. Non-speech cues, "[music]" and "[laughter]," belong in captions for accessibility and get dropped from subtitles, where a hearing viewer does not need them. Overlapping speech forces a choice: when two people talk at once and both lines will not fit, someone's words get cut. And positioning matters the moment a caption would sit on top of a lower-third or burned-in text, which is what VTT's cue settings exist to fix.

Live captions

Everything above assumes a finished recording. Live captioning adds the constraints of streaming. You work from partial results that may still change, and you pick a presentation style. Roll-up captions scroll word by word as text firms up, which suits the provisional nature of live recognition. Pop-on captions show a complete line at once, which reads better but adds latency because you wait for the line to finalize. This is the trade-off between responsiveness and stability.

Common questions

What is the difference between captions and subtitles?

Captions are in the same language as the audio and include non-speech sounds like "[applause]," serving viewers who cannot hear. Subtitles assume the viewer can hear and usually translate into another language. The file formats, SRT and VTT, are the same. The content conventions differ.

Should I use SRT or VTT?

VTT for the web: HTML5 video supports it natively and it allows positioning and styling. SRT for the widest compatibility with players, editors, and platforms that expect the older format. They carry the same timed text and convert easily. The differences are the decimal separator, the header, and VTT's optional cue settings.

Why do my captions feel too fast to read?

They were segmented by recognition output, not by reading speed. A caption must stay on screen long enough to read, capped near 17 characters per second, with a minimum duration around a second. Long lines that change too quickly outrun the viewer even when every word is correct.

How do I make live captions if the text keeps changing?

Work from partial results and pick a presentation style. Roll-up captions scroll as words firm up and tolerate the provisional text. Pop-on captions wait for a complete line, which reads better but adds delay. Either way you trade responsiveness against stability, the same trade-off as all streaming recognition.

References

Wilken, P., Georgakopoulou, P., & Matusov, E. (2022). SubER: A Metric for Automatic Evaluation of Subtitle Quality. arXiv preprint arXiv:2205.05805.