Building multilingual voice agents that switch languages live

A caller starts in English, the agent answers, and then the caller, more comfortable in Spanish, switches. A good agent switches with them. A common one keeps answering in English, politely and uselessly, because nothing in its pipeline noticed the language changed, or noticed too late, or could not act on it. The caller repeats themselves in Spanish, the agent replies in English again, and the caller hangs up.

That standoff is the signature failure of multilingual agents, and it comes from many places rather than one bug. An agent works like a relay team, recognition handing to the language model handing to synthesis, and every runner has to carry the same baton. Drop it at one exchange and the whole leg is lost, however fast the others run.

Fixed-language agents

The caller switches languages and the agent does not. It was configured for one language up front, with no live language identification, so recognition keeps decoding the new language as garbled versions of the old one and the rest of the pipeline never gets a chance to switch. A multilingual agent needs recognition that detects the language as it runs, not a setting chosen before the call, ideally labeling per token so a switch is visible the moment it happens.^[1]

Delayed language identification

The agent does switch, but a beat too slow. It answers the first Spanish sentence in English, then corrects itself. Language identification needs a few seconds of speech to be confident, and the agent acted before it was sure, or waited so long it had already responded. Live language detection races against the response clock, and on the first words of a switch, the audio that would settle the question has not all arrived yet. This is the same uncertainty as recognition's partial results, now deciding which language to respond in.

Code-switching within an utterance

The caller mixes two languages in a single sentence, "I need to cancel mi reserva for Friday," and the agent picks one and mangles the rest. The agent assumed one language per turn; the user did not. Code-switching is normal for bilingual speakers, and it demands recognition that can label languages within an utterance and synthesis that can mix languages in a reply. An agent built around one-language-per-turn has no representation for a sentence that is two languages, so half of it falls through.

Language mismatches between components

This is the subtlest failure, and the one unique to agents: recognition correctly hears Spanish, but the language model replies in English, or the model replies in Spanish and the voice cannot speak it, so it comes out wrong or defaults back to English. The language decision did not survive the trip through the pipeline. An agent is only as multilingual as its weakest stage. Recognition detects the language, the orchestrator carries that decision forward, the model is instructed to reply in it, and the TTS needs a voice that speaks it. A break anywhere, a model that drifts to English, a voice with no coverage for the language, collapses the whole thing, however good the other stages are.

Speaker identity across languages

The agent switches languages and, with it, the voice: a different timbre, or the same voice now carrying a foreign accent in the new language. The agent used a different voice per language, or one voice that does not hold its identity across languages, the origin-accent problem. To a caller, a voice that changes identity when the language changes feels like being handed to a different agent. Keeping one recognizable voice across languages is a synthesis frontier, and an agent that ignores it sounds disjointed even when every word is right.

Quality differences between languages

The agent is excellent in English and noticeably worse in the caller's language: more recognition errors, clumsier replies, a stiffer voice. Nothing is broken, but per-language quality is not uniform. A low-resource language has weaker recognition, weaker model fluency, and a less polished voice, and an agent inherits all three at once, so the gap compounds across the pipeline. Know the asymmetry, set expectations per language, and test each language you claim to support rather than assuming English-level quality everywhere.

The agent...	Where the baton dropped
Never switches	No live detection; language fixed before the call
Switches a beat late	Detection raced the response clock and lost
Mangles a mixed sentence	One-language-per-turn assumption
Hears Spanish, answers English	The decision died between components
Changes voice with the language	Identity not held across languages
Fluent in English, clumsy elsewhere	Per-language quality compounding through the stack

The failures look alike to the caller and have different fixes, which is why naming the dropped baton matters.

Requirements for multilingual agents

Every failure above is a coordination failure rather than a quality failure. The components can each be excellent in isolation and the agent can still answer Spanish in English, because the question was never whether a component speaks the language. It was whether the detected language reaches the next component before the user notices.

So the detection has to be live and per-token, visible the instant a switch happens, and the detected language has to travel all the way to the voice instead of dying at the orchestrator. Get the propagation right and the agent follows a caller across languages and within a single breath. Lose it at any one stage and you get the polite English answer to a Spanish question.

Common questions

What makes a voice agent multilingual?

Live detection, not a language menu. The bar is that the agent works out which language the user is speaking while the call is running and replies in it, even when the user switches mid-conversation or mid-sentence. Per-language coverage in the abstract is not enough; the detection has to happen fast enough that the user never hears the wrong language.

Why does my agent keep replying in the wrong language?

The detected language is dying somewhere in the pipeline before it reaches the reply. Recognition hears the switch, but the model or the voice stays on the old language, or there was no live detection to begin with. The fix is propagation: an agent is only as multilingual as its weakest stage, so trace which stage drops the language and make it carry the decision forward.

Can a voice agent handle a user who mixes languages in one sentence?

Only if it is built for code-switching from the recognizer up. It needs recognition that labels languages within a single utterance and a voice that can mix languages in one reply. An agent that assumes one language per turn has no way to represent "I need to cancel mi reserva," so it picks one language and mangles the rest.

Why is the agent worse in some languages than others?

Because quality is not uniform across the stack, and a low-resource language is weaker at every stage at once. Weaker recognition, less fluent model output, and a stiffer voice compound rather than average out, so the gap is larger than any one component would suggest. The practical rule: test each language you claim to support instead of assuming English-level quality everywhere.

References

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.