AI dubbing: how automated voice-over works

In September 2023, Spotify began dubbing podcasts into Spanish, French, and German in the host's own voice rather than a voice actor's, recreated by AI. A listener in Madrid could hear an American host speaking fluent Spanish in a voice unmistakably theirs. The episodes were translated, re-voiced, and shipped without anyone re-entering a booth.

Dubbing concentrates almost every hard problem in voice AI into one product, and the loudest question it raised was about consent rather than engineering.

AI dubbing pipeline

If you run speech-to-speech translation on recorded media instead of a live conversation, the priorities shift. The work is offline, which trades latency for quality and buys time per line. The base chain is familiar: recognize the source, translate it, synthesize the target. What sits on top are demands a live conversation never faces.

Voice preservation is the headline. If you replace a distinctive host with a generic voice, you lose the thing audiences came for. Carrying the speaker's vocal identity across languages, so the Spanish output sounds like the same person, uses the voice cloning technology of cross-lingual voice transfer.^[1] That is what made the Spotify pilot feel new.

Multiple speakers multiply the work. A film or interview has several voices, so the pipeline separates them with diarization, translates each, and re-voices each in a distinct, matching voice, so the listener can still tell who said what across the language change.

And timing is the demand nobody anticipates until they try it.

Matching the original timing

Languages do not take the same length to say the same thing. A line that runs three seconds in English needs four in Spanish or two in Japanese, and the new audio has to fit the slot the old audio left behind. This is isochrony. Traditional dubbing studios have wrestled with it for decades, rewriting lines to fit the time and the lip movements on screen.

AI dubbing inherits the whole problem. The translated speech gets compressed or stretched to match the original's duration, by adjusting the speaking rate or picking a translation that fits the slot, all without sounding rushed or dragged. If the fit fails, the dub drifts out of sync with the picture: the classic badly-dubbed-movie effect, now produced by a model instead of a tight deadline.

Lip synchronization

Harder than timing are the lips. A dub can be perfectly translated and perfectly timed and still look wrong, because the speaker's mouth is making English shapes while Spanish comes out of it.

Two responses exist. The audio side fits the speech to the existing visemes as closely as timing allows. The visual side, pursued by systems like Flawless AI's TrueSync, alters the video so the mouth movements match the new language, editing the picture to fit the dub rather than the dub to fit the picture. The second is more convincing, and more unsettling, because the recording no longer shows what the person's mouth actually did.

Two years before the Spotify pilot, AI dubbing's central problem arrived as a controversy. The 2021 documentary Roadrunner, about Anthony Bourdain, who had died in 2018, used an AI model trained on his voice to generate narration of lines he had written but never spoken aloud. When the director disclosed it, the reaction was sharp, and it had nothing to do with audio quality.

The objection was consent. Bourdain could not agree to be made to "say" words he never said, and listeners had not been told which lines were real.

Re-voicing a living, consenting host in their own voice, as in the Spotify pilot, is one thing. Synthesizing a dead person's voice without their agreement is another, and the technology does not distinguish them. The responsible position is the same one voice cloning and watermarking converge on: consent of the voice's owner, disclosure that the audio is synthetic, and provenance you can verify. For dubbing it also reaches the livelihoods of voice and dubbing actors, which is why digital-replica consent terms were a flashpoint in the 2023 entertainment-industry labor agreements.

Cheap dubbing changes what gets dubbed

AI dubbing collapses something slow and expensive, a studio and voice actors and weeks per language, into something cheap enough to dub the long tail of content that was never worth dubbing before. The capability is real and improving fast.

Whether a given dub is acceptable rather than merely possible turns on timing, performance, and above all consent. The pipeline solves recognition, translation, and synthesis. The questions that remain, who agreed to be re-voiced and who was told, sit outside it.

Common questions

How does AI dubbing work?

It runs the speech-to-speech translation chain, recognize, translate, synthesize, on recorded media, then adds the three demands conversation never faces: voice preservation, multiple speakers, and timing. Offline processing makes it possible, because it buys time per line that a live conversation does not have.

Can AI dubbing keep the original speaker's voice?

Yes, through cross-lingual voice transfer, the voice cloning technology that carries vocal identity across languages. It made Spotify's 2023 pilot feel new, and it is the one feature that only works with the speaker's consent.

Why do dubbed translations sometimes go out of sync?

Because the same sentence varies by 30 percent or more in spoken duration across languages, yet the dub has to fit the slot the original occupied. This packing problem is isochrony, unique to dubbing, since the picture is already cut and cannot stretch to accommodate the translation.

Is AI dubbing ethical?

It depends on consent and disclosure, because the technology cannot tell a consenting host from a dead man who never agreed. Re-voicing a living, informed speaker is widely accepted; the Bourdain documentary was not, and the same labor concern made digital-replica consent a flashpoint in the 2023 entertainment-industry agreements. The line is drawn by the people running the system, not by the pipeline.

References

Jia, Y., Ramanovich, M. T., Wang, Q., & Zen, H. (2022). Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. arXiv preprint arXiv:2107.08661.