Code-switching

Recognition of utterances that contain more than one language

Updated June 13, 2026

"Voy al store a comprar milk, y después te llamo." A Spanish-English bilingual in Texas says a sentence like that without noticing she did anything unusual. A system told to transcribe Spanish will mangle "store" and "milk." A system told to transcribe English will mangle everything else. Neither system is broken; each was configured for one language when the speaker used two.

Single-language speech recognition is a useful simplification that does not hold up in practice. It makes the engineering tractable, but a large share of the people the system serves do not speak one language at a time.

Fixed-language recognition

The most common failure starts before any audio arrives. The API asks for a language code. You pass es. From that point the decoder is biased toward Spanish: the acoustic model expects Spanish phonemes, the language model expects Spanish word sequences, and the search is pruned to keep Spanish hypotheses alive and starve the rest.

Then the speaker says "store." The decoder cannot return a word it has been told does not exist, so it returns the nearest Spanish-shaped thing: maybe "estor," maybe "es to." The error is not random noise. It is the system doing exactly what you configured, the worst kind of bug, because the logs look clean.

Whole-utterance language detection does not save you here. Detecting that the utterance is "mostly Spanish" still throws away every English word in it. Language has to be resolved at the word, and sometimes the morpheme, not the utterance.

Types of code-switching

Linguists split this into two cases, and the distinction matters for engineering.

Inter-sentential switching happens at sentence boundaries: one full sentence in Hindi, the next in English. This is the gentler case. A system that re-detects language per segment has a fighting chance, because each segment is internally monolingual.

Intra-sentential switching happens inside one sentence, with no pause and no punctuation to hide behind. "Main kal office jaa raha hoon, but I'll call you first" is one breath of Hindi-English, the variety people call Hinglish. There is no segment boundary to re-detect at, and a pipeline that switches models at silence has no silence to switch at. The language identity changes between adjacent words, sometimes between a Hindi verb stem and an English noun glued into Hindi grammar.

flowchart LR A[Audio in] --> B{Pick one<br/>language} B -->|Spanish| C[Spanish decoder] C --> D["store becomes estor<br/>milk becomes mil"] A --> E[Multilingual model] E --> F[Per token<br/>language ID] F --> G["Voy al store<br/>es es en"]
Whole-utterance language detection commits to one language and mistranscribes the rest. Per-token identification keeps every word in the language it was actually spoken in.

Borrowed words and proper nouns

Even a fully monolingual sentence is rarely monolingual at the edges. Brand names, product names, and personal names keep their original phonology. A French speaker says "WhatsApp" and "iPhone" with English-shaped sounds in the middle of clean French. A German says "Customer Success Manager" inside a German sentence and means it as a German job title.

These are not errors by the speaker, so they should not be errors in the transcript. But a French-only decoder has no good entry for "WhatsApp" and will spell it the way a French word would sound, breaking search, analytics, and any downstream system keyed on the literal string. Proper nouns hurt most, because a misspelled name is often the one token a human needed. This is where context biasing and custom vocabulary earns its keep: you can hand the system the names it should expect.

Phonetically similar words across languages

Some failures are subtle because the wrong answer sounds almost right. Languages share sounds, so a word in one language can be a near-homophone of a word in another, and a model biased toward the wrong language picks the wrong-language word with high confidence.

English "see" and Spanish "sí." English "no" and Spanish "no" (same sound, different word, sometimes different intent). Mandarin syllables that land close to English function words. When the language prior is wrong, these collisions resolve the wrong way every time, and confidence stays high because the acoustic match is good. The model is confident and wrong, harder to catch than an outright garble.

Per-token language identification

The fix is to treat language as something the model predicts rather than something you set in advance. A model trained on many languages at once, decoding without a hard language lock, assigns a language to each token as it goes. "Voy al store" comes back as three tokens tagged Spanish, Spanish, English. Nobody had to declare the language up front, because the model infers it word by word from the audio it heard.

This is the same machinery described in language identification, pushed down from the utterance to the token. It is harder to train: you need data that contains real code-switching, not two monolingual corpora stapled together, because the hard cases live at the switch points. But it is the only design that matches how bilingual people talk. A system that supports many languages but only one at a time has solved separate monolingual problems without handling the case where languages mix inside one utterance. (See multilingual speech AI for why this is hard to train and to evaluate.)

Script and transliteration

There is a second decision behind the words: which script to write them in. Hindi can be written in Devanagari (मैं) or transliterated into the Latin alphabet (main). Both are legitimate, and which one a user wants depends on the application: a chat product may want Latin Hinglish because that is how people type it, while a formal transcript may want Devanagari.

For a code-switched sentence this gets sharper. If half the words are English in Latin script and half are Hindi, do you render the Hindi in Devanagari and produce a mixed-script line, or transliterate everything into one alphabet? No single answer is universally right, but one is clearly wrong: deciding silently and giving the user no control. Script is a product choice, and it should be exposed as one.

Common questions

Is code-switching the same as having an accent?

No. An accent is one language pronounced with the sound habits of another. Code-switching is using two languages, with both vocabularies and often both grammars, in the same stretch of speech. A system can handle accents well and still fail completely at code-switching, because accent handling does not require knowing a second language's words.

Can I just run two separate recognizers and merge the results?

You can, and it is a common stopgap, but it degrades exactly where you need it most. Each recognizer mistranscribes the other language's words with high confidence, so merging by confidence picks the wrong word at switch points. It also doubles cost and latency. A single multilingual model that does per-token language identification avoids the merge problem entirely.

Why does whole-utterance language detection still fail on code-switched audio?

Because it answers the wrong question. Detecting that an utterance is "70 percent Hindi" tells you nothing useful about the 30 percent that is English, and it is those English words (often the names and nouns that carry the meaning) that get destroyed. Language has to be resolved at the word level to keep a code-switched transcript intact.

Which language pairs are hardest?

Pairs that share sounds and switch often. Mandarin-English, Spanish-English, and Hindi-English are heavily studied because the communities are large and the switching is dense. Difficulty rises with phonetic overlap (more near-homophones to confuse) and with how deep into the sentence the switching goes.

Building with Soniox? Code-switching is handled natively by the multilingual model, which identifies language per token without a fixed language setting. See the Soniox documentation.

References

  1. Wang, W., Ma, G., Li, Y., & Du, B. (2023). Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition. arXiv preprint arXiv:2307.05956.