Multilingual TTS and language mixing

Speaker identity, pronunciation, and code-switching across languages

Updated June 29, 2026

Consider the sentence "I'll email François the São Paulo figures by Tuesday" and the two names in it. A weak multilingual voice flattens them into the surrounding accent, "Fran-coys," "Sao Paolo," or lurches into a full French accent for one word and back. Either way it fails to apply the name's own pronunciation, which holds regardless of the sentence around it.

That sentence is ordinary. Real speech is full of borrowed words and foreign names, and handling them is most of what separates a genuinely multilingual voice from one that supports several languages one at a time.

Pronunciation of borrowed words

English brand names appear in German sentences, French culinary terms in English ones, anglicisms everywhere. The voice has to decide: pronounce the loanword the way the source language would, or the way an assimilated speaker of the host language would?

What went wrong: there is no single right answer, and a naive system picks one rule and applies it everywhere. A German speaker says "Computer" with a German accent, not an American one, so reading every English word in full American English sounds wrong in a German sentence; but reading a brand meant to sound English with a heavy German accent sounds wrong too. The correct pronunciation of a loanword depends on how nativized it is, which is knowledge about usage, not spelling, and the model often does not have it.[4][5]

Pronunciation of foreign names

Names are the sharpest case because they resist the host language's rules entirely. "Siobhan," "Nguyen," "Xiomara," and "São Paulo" do not follow English spelling-to-sound rules, and should not be bent to them.

What went wrong: the voice applied the surrounding language's grapheme-to-phoneme rules to a word that does not obey them, the same failure that makes names hard in any TTS, now amplified because the name is from a different language. A São Paulo read as if it were an English word is unrecognizable to anyone who knows the place. This is the case that most often needs explicit pronunciation control.[6][7][8][9]

Code-switched sentences

Some speakers mix languages within a single utterance as a matter of course: Hindi and English in "Hinglish," Spanish and English, Tagalog and English. "Main office mein meeting hai" is one sentence in two languages, and the speaker did not pause at the boundary.

What went wrong: a TTS that assumes one language per utterance cannot represent this. It either reads the whole thing with one language's sound system, mangling the other half, or it must detect the switch points and change phonetic systems mid-sentence while keeping the voice continuous. This is the synthesis mirror of code-switching in recognition, and it is hard because the model must apply two sets of pronunciation rules to one breath of speech.[1][2][3]

Accent transfer between languages

The newest requirement, introduced in TTS voices: one voice, recognizably the same person, across many languages. Because identity and language are largely separate inside a neural model, a voice can in principle speak a language its source speaker never knew. In practice, projecting one voice into a new language can drag the source language's accent along.

What went wrong: the speaker embedding that defines the voice was learned mostly from one language, so it absorbed some of that language's sound, and carrying it into another language carries the accent too. The result is a voice that is consistent but subtly foreign in every language except its first, an accent pile-up. Keeping a voice's identity while shedding its origin accent remains frontier work and is not a solved feature.[10][11][12][15]

Language-specific prosody

Even with the sounds right, the melody can be wrong. Prosody differs by language: question intonation, stress placement, rhythm, and the very notion of a stressed syllable vary, and some languages are tonal, where pitch changes word meaning.

What went wrong: a model that learned its prosody mostly from one language and applies that shape to another produces speech that is intelligible but carries the wrong melody and stress. Tonal languages leave little room for error: getting the pitch contour wrong in Mandarin changes the word rather than adding an accent.[13][14]

Methods for improving multilingual TTS

The defenses give the model the language information it lacks. Marking which language each span belongs to, explicitly or through reliable detection, lets the model apply the right rules at the right place instead of guessing. Strong multilingual models trained on many languages together, rather than stitched from monolingual ones, share a sound space that makes borrowed words and names degrade more gracefully. Pronunciation control handles the specific names and terms that must be exact. The one-language-per-request reality also helps: if a system reads one primary language well and handles borrowings inside it, structuring text to match works better than expecting a smooth mid-sentence switch it was not built to make.

Common questions

Can one TTS voice speak multiple languages?

Yes. In neural systems, the voice's identity and the language are largely separate, so a voice can be projected into languages its original speaker never spoke. Quality varies by language, and keeping the voice's identity without dragging its original accent into every other language is difficult.

Why does my TTS mispronounce foreign names?

Because it applies the surrounding language's spelling-to-sound rules to a name that follows different rules. "Siobhan" or "São Paulo" do not obey English pronunciation, so an English voice bends them into something unrecognizable. The fix is usually explicit pronunciation control, specifying the actual sounds.

Can a TTS read a sentence that mixes two languages?

Often only partially. Most systems process one primary language per request and handle borrowed words and names within it, but a true mid-sentence switch between two full languages, like Hinglish, is harder, because the model must apply two pronunciation systems in one utterance. Support for this varies widely.

Why does a multilingual voice sound slightly foreign in some languages?

Because the voice's learned identity absorbed the sound of the language it was mostly trained on, and carrying it into another language carries some of that accent along. Shedding the origin accent while keeping the voice recognizable is an unsolved frontier problem, so many multilingual voices are consistent but subtly accented away from their first language.

References

  1. Zhou, X., et al. (2020). Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. arXiv preprint arXiv:2010.08136.
  2. CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset. arXiv preprint arXiv:2509.14161 (2025).
  3. Enhancing Code-switched Text-to-Speech Synthesis Capability. arXiv preprint arXiv:2409.10969 (2024).
  4. Influence of L1 and L2 on the Pronunciation of Loanwords in Japanese. ResearchGate.
  5. Wells, D., et al. (2021). The CSTR entry to the Blizzard Challenge 2021. ISCA Archive.
  6. Polyglot neural language models: A case study in cross-lingual phonetic representation learning. ACL Anthology.
  7. Spiegel, M. F., et al. (2003). Proper Name Pronunciations for Speech Technology Applications. ResearchGate.
  8. Detection of foreign words and names in written text. Pace University Dissertations.
  9. Improving pronunciation accuracy of proper names with language origin classes. Carnegie Mellon University Master's Thesis.
  10. Chen, M., et al. (2019). Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding. Interspeech 2019.
  11. Wu, et al. (2024). Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement. Interspeech 2024.
  12. Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data. arXiv preprint arXiv:2603.07534 (2026).
  13. Jiang, Z., et al. (2022). Dict-TTS: Learning to pronounce with prior dictionary knowledge for text-to-speech. Advances in Neural Information Processing Systems (NeurIPS) 2022.
  14. Chen, S. H., et al. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech and Audio Processing.
  15. Multi-Scale Accent Modeling and Disentangling for Multi-Speaker TTS. Harvard ADS.