Over half the world's people speak more than one language, and they do not keep those languages in separate boxes. A Mumbai engineer says "Main kal office jaa raha hoon, but I'll call you first" in a single breath, Hindi and English with no seam between them. A Geneva receptionist switches to French the moment she hears a French bonjour. For billions of people this is ordinary, and almost every speech system ever shipped was built to assume it never happens.
The phrase "supports a hundred languages" carries that assumption. It can describe two products that share almost nothing, and which one you bought only becomes clear when something goes wrong.
Unified and separate language models
With monolingual systems you pick the language from a menu, it does that single job well, but it is helpless the moment you pick wrong or the speaker drifts into another language. The second is one model that reads the language off the audio itself and can change its answer word by word. Both can support the same number of languages. Only the single model can transcribe a sentence that switches between two of them.
The split comes down to where the language decision lives. A monolingual stack needs the language named before it starts, so the decision is yours, made before anyone has said a word. A true multilingual model treats language the way it treats the words, as something to predict from the signal, and it predicts it continuously, per token rather than per file.
| Approach | How it works | Needs the language up front | Handles a mid-sentence switch | Shares data across languages |
|---|---|---|---|---|
| One model per language | Separate recognizer and voice behind a picker | Yes | No | No |
| One multilingual model | A single model trained across all of them, language predicted from the signal | No | Yes | Yes |
The same choice decides how a system does in both directions. Recognition has to work out which language it is hearing; synthesis has to choose which language to speak, and in what voice. A design that asks you to name the language up front fails the same way going in and coming out.
Low-resource languages
There are about 7,164 living languages (Ethnologue, 2024), and the top two dozen are spoken by half the planet. Build excellent recognition and synthesis for a dozen of them and you serve most paying customers while covering a rounding error of the world's languages.
The bottleneck is data, and the two directions starve differently. A recognizer learns from transcribed audio, and the supply is wildly lopsided: hundreds of thousands of hours for English, a few hundred for a language with fifty million speakers, almost nothing for most of the list. Synthesis is worse, because a recognizer can learn from noisy audio scraped from anywhere, while a natural voice needs clean studio recordings of a consenting speaker, which barely exist past a handful of commercial languages.
Scaling past that scarcity has taken some unusual moves. OpenAI's Whisper (2022) learned 96 languages from 680,000 hours of audio, roughly 117,000 of them outside English, by training on whatever the web offered instead of curated corpora. To reach languages with no transcribed audio at all, Meta went looking for the one text that has been recorded aloud in more languages than almost anything else.
Quality does not decline smoothly down the list. It drops sharply: strong for the first ten or fifteen languages, usable for the next thirty or so, and after that the transcript is more guess than record and the synthetic voice, where one exists at all, sounds foreign in the language it is reading.
Scripts
For many languages, even once the words are known, the script is a real decision. Hindi can be written in Devanagari (मैं) or transliterated into the Latin alphabet (main). Chinese has Traditional and Simplified characters for the same spoken Mandarin. Serbian uses Cyrillic and Latin interchangeably. Japanese mixes three scripts by convention.
For recognition there is no universal right script: a chat app may want Latin-script Hinglish because that is how people type, while a court transcript wants Devanagari. For generation the problem flips. The model reads whatever script it is handed and decides how to speak it, including the language-specific normalization that turns "$5" into "five dollars" in English and "cinq dollars" in French, or reads "3/4" as a different date depending on the convention. On both sides the failure is the same: deciding silently and giving the user no way to override. Script and spoken form are settings, not defaults you should have to reverse-engineer.
Multilingual speech recognition
Recognition starts in a chicken-and-egg bind. To transcribe accurately you want to bias the decoder toward the right language's sounds and words, but to know the language you often have to transcribe first. A system that demands a language code up front has not solved that, it has handed it to you, the developer, who often does not know either because the call came in from a stranger.
The honest answer is language identification done by the model, per token rather than per file. Whole-file detection ("this recording is mostly Portuguese") throws away every word that disagreed with the majority. Per-token identification labels each word as it decodes, which is the resolution the next problem needs.
That problem is code-switching: the speaker changing languages mid-utterance, like the Hindi-English sentence above, with no pause to switch models at. A pipeline that re-detects language during silences has no silence to use when the switch falls between two adjacent words. Only a model that can reassign the language token by token keeps the transcript intact across the seam, which is why code-switching is the cleanest test of whether a system is genuinely multilingual or just a stack of monolingual ones.
Multilingual speech synthesis
Generation has a problem recognition never faces: identity. A transcript has no accent. A synthetic voice does, and the current bar for multilingual synthesis is keeping that voice recognizably the same person across languages. It works because a neural model stores a speaker embedding separately from language, so a voice can be projected into a language its original speaker never knew.
The other half is the words that do not belong to the language. A voice reading a French sentence with "WhatsApp," a German one with "Customer Success Manager," or any sentence with a foreign name has to pronounce those words with roughly the right foreign phonology, dropped into the surrounding rhythm, without lurching into a different accent for one word and back. Get it wrong and the voice either Frenchifies "WhatsApp" into something unsearchable or breaks character to say one word in flat American English. Alphanumerics are worse: a reference code, a phone number, or a price is grouped and read differently in each language, and the voice has to know whose rules apply to a string that has no language of its own. Underneath all of it sits prosody, the melody and stress that differ from language to language, which the voice has to switch along with the words. It can get every phoneme technically right and still sound wrong because it carried English intonation into a German sentence.
Remaining technical problems
Even the single-model design has a ceiling. Researchers call it the curse of multilinguality: for a fixed model size, each language you add eventually starts taking quality from the others, and the only cure is a bigger model. Scale has stayed ahead of the curse so far, which is why adding languages has mostly helped rather than hurt the strong ones, but the trade is real and always present.
The hardest cases now are the human ones: a caller who switches languages mid-clause, a voice that has to stay one person across borrowed words and names, a voice agent that follows a customer between languages without dropping a turn. Each is handled by the same decision, made once at the start: predict the language from the audio instead of asking the developer to name it.
Common questions
Does supporting more languages make a model worse at its best one?
Usually not, and often the opposite. Sharing acoustic and linguistic structure across languages tends to help low-resource languages a lot and leave high-resource ones like English roughly unchanged. The limit is the curse of multilinguality, where a fixed-size model eventually trades quality between languages, but scaling the model offsets it, so a well-built multilingual model rarely pays for its breadth on the big languages.
Why can't I just detect the language first, then pick a model?
Because whole-file detection answers a coarser question than recognition or synthesis needs. It can call a clip "mostly French" and still mistranscribe every English name in it, and it is useless when the language changes mid-sentence. The same is true going out: a per-message language setting cannot handle a sentence that mixes two. Language has to be resolved at the word level.
Can one synthetic voice speak multiple languages?
Yes. In a neural voice, identity and language are largely separate, so a voice can be projected into languages its speaker never recorded. Quality varies by language, and the hard cases are accent, sounding native in each rather than applying one accent to all, and borrowed words from a different language than the sentence around them.
Is a regional accent the same problem as a different language?
No, and conflating them causes trouble. An accent is one language spoken with another's sound habits, handled by training data broad enough to include it. A different language means different words and grammar entirely. A model can be strong on accents and still fail at code-switching, because handling accents never required a second language's vocabulary.
Which script will I get for languages that have more than one?
That depends on the system, and a good one lets you choose. Hindi can come back in Devanagari or Latin transliteration, Chinese in Traditional or Simplified, and the right choice depends on your application, not the audio. Avoid a system that decides silently, leaving you to reverse-engineer a default instead of setting it.
Related concepts
- Code-switching in speech recognition
- Spoken language identification
- What is speech recognition
- What is text-to-speech
- Multilingual TTS
- Multilingual voice agents
References
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.