Multilingual speech AI: models, scripts, and code-switching

A speaker in Delhi says, "Main kal office jaa raha hoon, but I'll call you first." One sentence, two languages, no pause at the border. If you point a recognizer locked to Hindi at that audio, the English clause comes back as Hindi-shaped guesses. If you lock it to English instead, the Hindi half turns into word salad with good spelling. Nothing malfunctioned. The system was built on an assumption the sentence quietly broke: that a recording has a language, singular.

That assumption has relatives, and each one fails in its own way. What follows is a tour of the wreckage.

The language was chosen before anyone spoke

For most of the field's history, supporting a language meant training a separate recognizer for it and putting a language selector in front. The selector is the weak point. A transcription job configured for German will cheerfully transcribe a Slovene caller as German, and the output looks like a transcript: words and punctuation in convincing order. It is fiction with correct formatting.

The newer design trains one model on many languages at once. Whisper learned from 680,000 hours of audio, 117,000 of them covering 96 languages other than English, and it infers the language from the signal itself. ^[1] Sharing one model has a quieter benefit that turned out to matter enormously: languages help each other. What a model learns about acoustics and phrase structure in a well-resourced language transfers, partly, to a related language with a fraction of the data. ^[4]^[2]

Approach	How it works	Needs the language up front	Survives a mid-sentence switch	Shares learning across languages
Fixed-language pipeline	One recognizer or voice per language, behind a selector	Yes	No	No
Joint multilingual model	One model trained on many languages at once	No	Depends on the model	Yes

Note the honest entry in the bottom row. A jointly trained model is a precondition for handling mixed speech, not a guarantee of it.

The switch happens between two words

Language identification sounds like a solved preliminary: run a classifier over the file, get a label, proceed. For a monolingual recording, it mostly is. The Delhi sentence defeats it, because no single label can be right for the whole utterance, and the switch lands between adjacent words, exactly where a pipeline that re-checks the language at pauses will never look.

Recognizing that sentence takes language decisions at the word level or finer, and a decoder willing to hold both vocabularies at once. This is code-switching, and it is a separate capability from multilingual coverage. A model can know ninety languages and still assume every sentence uses exactly one of them. Language identification covers the coarser problem, including the awkward circularity inside it: picking the right phonology sometimes requires already knowing the words.

Right words, wrong alphabet

Hindi can be written in Devanagari (मैं) or transliterated into Latin script (main). Serbian uses Cyrillic and Latin interchangeably. Chinese has Traditional and Simplified character forms, and Japanese conventionally mixes three scripts in ordinary prose. In every one of these cases a transcript can be word-for-word correct and still useless, because it arrived in a writing system the application cannot use. A messaging app wants Latin-script Hinglish; a formal Hindi transcript wants Devanagari.

Nothing in the audio resolves this. Script is a convention of the output, and so is spoken normalization in the other direction: a synthesizer reading "$5" says "five dollars" in English and "cinq dollars" in French, and what "3/4" means depends on where the listener lives. Any system that makes these choices silently has made them wrong for someone. Look for the configuration switch, and worry when there isn't one.

The voice that loses itself

Recognition fails quietly, in text nobody may ever read. Synthesis fails out loud. A multilingual voice has to stay recognizably the same person across languages, which turns out to be possible because modern models store who is speaking separately from what is being said. ^[5]

The failures cluster at the joints. A French sentence quoting an English product name forces a choice: pronounce it with French phonology and get it wrong, or snap into flawless English for two words and sound briefly possessed. Names and borrowed words sit permanently on this fault line. Prosody is its own trap. A voice can produce phonetically correct Japanese with the stress and melody of English, and every native listener will hear it instantly, even the ones who cannot say what exactly is off.

The languages nobody recorded

More than 7,000 languages are spoken in the world, and for its first several decades speech technology covered a few dozen of them. The limit was never modeling talent. It was data. ^[2] Recognition needs speech paired with accurate transcripts; synthesis needs hours of clean, consistent audio from a single speaker. Outside the commercially loud languages, neither exists in quantity.

The projects that broke the coverage barrier did it by finding data where nobody had thought to look.

There is a catch, and it has a name. Conneau and colleagues called it the curse of multilinguality: if you hold a model's capacity fixed and keep adding languages, per-language accuracy eventually falls, because the languages compete for the same parameters. ^[4] A bigger model pushes the ceiling up without removing it. This is why "supports a hundred languages" is a coverage claim, not a quality claim, and why serious multilingual work reports accuracy per language the way MMS does instead of advertising the count.

The place where these failures compound is a live phone call: a caller opens in one language and drops into another for a name or a phrase, while a voice agent has half a second to keep up. Multilingual voice agents picks up there.

Common questions

Can adding languages make a model worse at each one?

It can, and the effect has a name: the curse of multilinguality. At fixed model size, languages compete for capacity, so per-language accuracy eventually drops as coverage grows. ^[4] Bigger models and better-balanced data push the point of decline further out. The honest way to read "supports N languages" is as a coverage claim, so measure the languages you actually care about.

Why not detect the language first and then pick the right model?

That works only when each recording sticks to one language. Detection at the file level cannot represent a sentence that changes language in the middle, and a mostly French recording still contains English names and phrases. Code-switched speech needs language decisions at the word or segment level, made during recognition rather than before it. The same limit applies to synthesis when one sentence mixes material from two languages.

Can one synthetic voice speak several languages?

Yes, within limits. Models like AudioPaLM keep who is speaking separate from what is being said, so a short prompt in one language can drive speech in another. ^[5] Quality varies by language pair, and the weak spots are predictable: foreign names, and prosody that keeps the rhythm of the wrong language.

Is a heavy accent the same problem as a different language?

No. An accent varies pronunciation and rhythm inside one language's vocabulary and grammar. A language switch brings a different vocabulary and a different sound system with it. A recognizer can be excellent across accents of English and still fall over on code-switched speech, because that requires holding two languages in one utterance.

Who decides which script the transcript uses?

The application does, or should. The audio itself does not carry a writing system: spoken Hindi is compatible with Devanagari and with Latin transliteration, and spoken Chinese with Traditional and Simplified characters. When a system offers a script option, set it deliberately. When it does not, find out what its default is before it surprises you.

References

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., et al. (2023). Scaling Speech Technology to 1,000+ Languages. arXiv preprint arXiv:2305.13516.
Seamless Communication, Barrault, L., Chung, Y.-A., Cora Meglioli, M., Dale, D., et al. (2023). SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of ACL 2020; arXiv:1911.02116.
Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., et al. (2023). AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925.