In September 2023, Spotify began dubbing podcasts into Spanish, French, and German in the host's own voice rather than a voice actor's, recreated by AI. A listener in Madrid could hear an American host speaking fluent Spanish in a voice unmistakably theirs. The episodes were translated, re-voiced, and shipped without anyone re-entering a booth.
Dubbing concentrates almost every hard problem in voice AI into one product, and the loudest question it raised was about consent rather than engineering.
AI dubbing pipeline
Run speech-to-speech translation on recorded media instead of a live conversation and the priorities shift. The work is offline, which trades latency for quality and buys time per line. The base chain is familiar: recognize the source, translate it, synthesize the target. Three demands then sit on top that conversation never faces.
The first is voice preservation. Replace a distinctive host with a generic voice and you lose the thing audiences came for. Carrying the speaker's vocal identity across languages, so the Spanish output sounds like the same person, uses the voice cloning technology of cross-lingual voice transfer. That is what made the Spotify pilot feel new.
The second is multiple speakers. A film or interview has several voices. The pipeline separates them with diarization, translates each, and re-voices each in a distinct, matching voice, so the listener can still tell who said what across the language change.
The third is the one nobody anticipates until they try it: timing.
Matching the original timing
Languages do not take the same length to say the same thing. A line that runs three seconds in English needs four in Spanish or two in Japanese, and the new audio has to fit the slot the old audio left behind. This is isochrony. Traditional dubbing studios have wrestled with it for decades, rewriting lines to fit the time and the lip movements on screen.
AI dubbing inherits the whole problem. The translated speech gets compressed or stretched to match the original's duration, by adjusting the speaking rate, trimming, or picking a translation that fits, all without sounding rushed or dragged. Get it wrong and the dub drifts out of sync with the picture: the classic badly-dubbed-movie effect, now produced by a model instead of a tight deadline.
Lip synchronization
Harder than timing are the lips. A dub can be perfectly translated and perfectly timed and still look wrong, because the speaker's mouth is making English shapes while Spanish comes out of it.
Two responses exist. The audio side fits the speech to the existing visemes as closely as timing allows. The visual side, pursued by systems like Flawless AI's TrueSync, alters the video so the mouth movements match the new language, editing the picture to fit the dub rather than the dub to fit the picture. The second is more convincing, and more unsettling, because the recording no longer shows what the person's mouth actually did.
Consent requirements
Two years before the Spotify pilot, AI dubbing's central problem arrived as a controversy. The 2021 documentary Roadrunner, about Anthony Bourdain, who had died in 2018, used an AI model trained on his voice to generate narration of lines he had written but never spoken aloud. When the director disclosed it, the reaction was sharp, and it had nothing to do with audio quality.
The objection was consent. Bourdain could not agree to be made to "say" words he never said, and listeners had not been told which lines were real.
Re-voicing a living, consenting host in their own voice, as in the Spotify pilot, is one thing. Synthesizing a dead person's voice without their agreement is another, and the technology does not distinguish them. The responsible position is the same one voice cloning and watermarking converge on: consent of the voice's owner, disclosure that the audio is synthetic, and provenance you can verify. For dubbing it also reaches the livelihoods of voice and dubbing actors, which is why digital-replica consent terms were a flashpoint in the 2023 entertainment-industry labor agreements.
Current limitations
AI dubbing will do to localization what it is doing to translation. It collapses something slow and expensive, a studio and voice actors and weeks per language, into something cheap enough to dub the long tail of content that was never worth dubbing before. The capability is real and improving fast.
Whether a given dub is acceptable rather than merely possible turns on timing, performance, and above all consent. The pipeline solves recognition, translation, and synthesis. The questions that remain, who agreed to be re-voiced and who was told, sit outside it.
Common questions
How does AI dubbing work?
It runs the speech-to-speech translation chain, recognize, translate, synthesize, on recorded media, then adds the three demands conversation never faces: voice preservation, multiple speakers, and timing. Offline processing makes it possible, because it buys time per line that a live conversation does not have.
Can AI dubbing keep the original speaker's voice?
Yes, through cross-lingual voice transfer, the voice cloning technology that carries vocal identity across languages. It made Spotify's 2023 pilot feel new, and it is the one feature that only works with the speaker's consent.
Why do dubbed translations sometimes go out of sync?
Because the same sentence varies by 30 percent or more in spoken duration across languages, yet the dub has to fit the slot the original occupied. This packing problem is isochrony, unique to dubbing, since the picture is already cut and cannot stretch to accommodate the translation.
Is AI dubbing ethical?
It depends on consent and disclosure, because the technology cannot tell a consenting host from a dead man who never agreed. Re-voicing a living, informed speaker is widely accepted; the Bourdain documentary was not, and the same labor concern made digital-replica consent a flashpoint in the 2023 entertainment-industry agreements. The line is drawn by the people running the system, not by the pipeline.
Related concepts
- Speech-to-speech translation
- Voice cloning
- Audio watermarking and deepfakes
- Speaker diarization
- Multilingual TTS
References
- Jia, Y., Ramanovich, M. T., Wang, Q., & Zen, H. (2022). Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. arXiv preprint arXiv:2107.08661.