What is voice cloning? How it works, and where consent comes in

On January 21, 2024, two days before the New Hampshire primary, thousands of voters answered the phone to hear what sounded exactly like President Biden telling Democrats not to vote, to "save your vote for November." It was not him. It was a cloned voice, reportedly generated with a commercial tool for a trivial sum, commissioned by a political consultant who later admitted it.^[6] Within weeks, the US Federal Communications Commission ruled that AI-generated voices in robocalls are illegal under existing law.^[5]

The episode illustrates the issues the field now debates: how easy cloning has become, how convincing it is, how cheaply it scales, and why the central problem is no longer technical.

Voice-cloning methods

A neural TTS system already separates what is said from who says it. The "who" is a speaker embedding, a compact set of numbers describing a voice's timbre and habits, and the model can be steered to any point in that space.^[4] Cloning is finding the point that corresponds to a particular real person.

There are two ways to get there. Zero-shot cloning, the kind that alarmed everyone around 2023, feeds the model a short reference clip and asks it to imitate that voice immediately, with no training. A few seconds is enough to capture the timbre. Fine-tuning takes more audio, minutes to hours, and adjusts the model toward the target voice, producing a more faithful and stable result that holds up across long passages where a zero-shot copy might drift.^[1]^[2]^[3]

The token-based models described in how neural TTS works made this natural: prompt the model with a sample the way you prompt a language model with an example, and it continues in that voice. Cloning stopped being a special pipeline and became a default capability of the architecture.

	Zero-shot cloning	Fine-tuning
Audio needed	Seconds	Minutes to hours
Training	None; the clip is a prompt	The model is adjusted to the voice
Fidelity	Recognizable timbre	Faithful and stable
Where it drifts	Long passages	Rarely

Two routes to the same voice. The clip-as-prompt route is what made cloning cheap.

Beneficial and harmful uses

What makes cloning hard to govern is that the beneficial and harmful uses are the same technique, distinguished only by consent.

Several uses are legitimate. In 2021, the actor Val Kilmer, who had lost his speaking voice to throat cancer, had it rebuilt from old recordings, letting him perform again. People with degenerative conditions like ALS now "bank" their voice while they still can, so a synthetic version can speak for them later.^[7] Localization studios recreate an actor's voice across languages.^[8] Each is cloning with the subject's knowledge and permission.

The same few seconds of audio that restores a voice to someone who lost it can also impersonate a candidate or a CEO authorizing a wire transfer, and the "grandparent scam" now arrives with a real-sounding grandchild in fake distress. Nonconsensual cloning of public figures and private people alike is now cheap and fast.

What separates the two is whether the person being cloned agreed, something the audio itself does not reveal.

For most of TTS history, the hard problem was quality: making a voice sound human. Cloning solved enough of that to move the field's central question from whether a voice can be copied to whether it should be, and how to prove who did it. Two responses are emerging, one legal and one technical.

The legal response is moving quickly. Beyond the FCC's robocall ruling, Tennessee passed the ELVIS Act in 2024 to protect a person's voice as a property right,^[9] US lawmakers proposed federal "right of voice" legislation, the European Union's AI Act requires disclosure of AI-generated and manipulated media, and the 2023 Hollywood labor agreements added explicit consent terms for digital voice replicas.^[11] Each treats a voice as property that may not be copied without the owner's consent.

The technical response is provenance: marking synthetic audio so it can be identified as synthetic, and detecting clones that are not marked. This is difficult, because a clone is designed to be indistinguishable, and it is the subject of audio watermarking and deepfakes. Watermarking embeds an inaudible signal at generation time; detection tries to spot clones after the fact.^[10]^[12]^[13]^[14] Neither is solved, and as the robocall case showed, a fake can do its work before it is recognized.

Deployment requirements

The responsible end of the industry has converged on a stance: cloning a voice requires the consent of its owner, and synthetic speech should be disclosed and ideally watermarked. Systems, in turn, should make nonconsensual cloning harder rather than frictionless. This is increasingly a matter of law as well as ethics, and the legal exposure of cloning someone without permission is now real.

For most products the practical implication is to use licensed preset voices or voices you have explicit permission to clone, disclose that the audio is synthetic, and keep provenance.

Common questions

How much audio does it take to clone a voice?

A few seconds is enough for zero-shot cloning to capture a recognizable timbre, which makes short-call fraud feasible. A faithful, stable clone that holds up over long passages takes more, minutes to hours, used to fine-tune the model. The short clip copies how a voice sounds without capturing the finer habits that make a long performance convincing.

Is voice cloning legal?

It depends on consent and use. Cloning your own voice, or someone's with their permission, for disclosed purposes is fine. Cloning a person without consent, especially to impersonate them, is increasingly illegal: the US FCC banned AI voices in robocalls, Tennessee's ELVIS Act protects voices as property, and the EU AI Act requires disclosure of synthetic media. The law is moving quickly.

What are legitimate uses of voice cloning?

Restoring a voice lost to illness or injury, "banking" a voice before a degenerative condition takes it, recreating an actor's voice across languages for localization, and personal or branded voices used with permission. Each relies on the subject's informed consent, the factor that separates these uses from impersonation.

Can cloned voices be detected?

Not reliably, because a good clone is built to be indistinguishable. The technical responses are watermarking, embedding an inaudible marker when the audio is generated, and detection, trying to spot clones after the fact. Both are active, imperfect work, covered in audio watermarking and deepfakes.

References

Pokhrel, K., et al. (2026). VOICY: A Privacy-Centric Modular Architecture for Zero-Shot Voice Cloning and Fine-Tuned Speech Synthesis. Proceedings of the ACM.
Gorodetskii, A., & Ozhiganov, I. (2022). Zero-shot long-form voice cloning with dynamic convolution attention. arXiv preprint arXiv:2201.10375.
Voice cloning: Comprehensive survey. arXiv preprint arXiv:2505.00579 (2025).
Jia, Y., et al. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in Neural Information Processing Systems, 31.
Federal Communications Commission (2024). Declaratory Ruling FCC 24-17. Federal Communications Commission.
Department of Justice, New Hampshire (2024). Steven Kramer Charged with Voter Suppression Over AI-Generated President Biden Robocalls. New Hampshire Department of Justice.
Bunnell, H. T., et al. (2017). The ModelTalker Project: A Web-Based Voice Banking Pipeline for ALS/MND Patients. Interspeech 2017.
Exploring automated voice casting for content localization using deep learning. IEEE Access (2021).
Why Tennessee's ELVIS act is the king of artificial intelligence protections. Vanderbilt Journal of Entertainment & Technology Law.
Knott, A., et al. (2024). AI content detection in the emerging information ecosystem: new obligations for media and tech companies. Ethics and Information Technology.
Shetler, K. (2024). AI and consent: What the SAG-AFTRA and WGA agreements tell us about the future of generative AI. Seton Hall University Student Scholarship.
Human perception of audio deepfakes. Proceedings of the 30th ACM International Conference on Multimedia (2022).
Wen, Y., et al. (2025). SoK: How robust is audio watermarking in generative AI models?. arXiv preprint arXiv:2503.19176.
Media Integrity and Authentication: Status, Directions, and Futures. arXiv preprint arXiv:2602.18681 (2026).