Custom vocabulary and context biasing in speech recognition

A recognizer has heard "nice" a million times and your company name zero times, so when someone says "Soniox," it writes "so nice" or "Sonic's": it chooses common real words over a name it has never encountered. A cardiologist's "amiodarone" gets the same treatment, and so does the surname of every third person on a customer call. The model is doing what it was trained to do: prefer the words it has already seen.

Context biasing changes what the recognizer expects without changing the model itself.

Preference for common words

A speech recognizer bets on the most probable transcription given the sound. That probability blends two things: how well the words match the audio, and how likely those words are as language. The second term is the language prior, and it lets recognition hold up against noise and accents. It also buries rare words. A made-up product name and a common phrase can be acoustically close, and the prior tips the scale toward the phrase every time.

The words that suffer are the ones that matter most in a deployment: personal and place names, brand and product names, medical and legal and technical jargon, and strings that are not really words at all, such as part numbers and reference codes (their own problem, covered in alphanumerics). All are underrepresented in any general training set, so they sit low in the prior and rarely win against a common-word alternative.

How context biasing changes decoding

Biasing nudges that language prior at runtime. Instead of retraining, you hand the recognizer a small amount of context, and it temporarily raises the probability of the words and phrases you named. A close acoustic match to one of those terms can then win against the everyday word it used to lose to.

The mechanism goes by several names: shallow fusion, on-the-fly rescoring, keyword boosting. All work the same way: the terms you supply get a thumb on the scale during decoding.^[1] This is a bias, not a filter. A boosted word is made more likely, not mandatory, so audio that clearly is not your term still transcribes normally.

Types of context biasing

Context biasing comes as three related controls, from weakest to strongest.

The everyday tool is a context or vocabulary list: the terms you expect, optionally with weights. A meeting bot loads the attendees' names, and a pharmacy line loads its drug formulary. Supplying the term, and sometimes a hint of how it is pronounced or used, flips it from a reliable miss to a reliable hit.

Language hints are a soft steer toward the languages you expect, without forbidding others. On multilingual audio this sharpens language identification and keeps the recognizer from drifting into the wrong language on ambiguous sounds, while it still copes if someone says something unexpected.

Language restrictions are the hard constraint: they forbid any language outside a named set. This is the strongest and most dangerous lever, because anything you leave off the list becomes impossible to transcribe correctly. Reach for it only when you are certain of the possible languages and a stray one would cause real harm, as with a regulated form that accepts exactly two languages.

Lever	What it does	When to use	Risk if misused
Vocabulary / context list	Raises odds of named terms	Names, jargon, codes	Over-triggers if weighted too high
Language hints	Softly favors expected languages	Known but not guaranteed languages	Low; it only biases
Language restrictions	Forbids unlisted languages	Certain, closed language set	Off-list language becomes untranscribable

Three levers, from softest to hardest. Each constrains the recognizer more, and risks more if you are wrong about what will be said.

Avoiding excessive bias

The discipline of biasing is restraint. A short, accurate context list of the terms that actually appear beats a giant dump of every word in your industry, because every term you boost is a term the recognizer is now more willing to hear. Load the fifty product names that come up on your calls, not the ten thousand SKUs in the catalog. The catalog dump feels thorough but does real damage: the recognizer is now primed to mishear ordinary speech as part numbers nobody said.

Weights deserve the same caution. If a term keeps losing close calls, raise it a little. If a boosted term starts appearing where it should not, lower it. Tune against real transcripts of real audio, not the demo sentence, the same lesson that runs through evaluating recognizers beyond WER. Context biasing is one of the most powerful knobs in a deployment and one of the easiest to overuse, because the failures it fixes are so predictable that it tempts you to overcorrect.

Common questions

What is the difference between custom vocabulary and context biasing?

Custom vocabulary is the most familiar form of context biasing: a list of names, terms, or codes the model should be ready to hear. Context biasing is the broader practice, with three levers of increasing strength: vocabulary lists, language hints, and language restrictions, all of which nudge the language prior at runtime instead of retraining the model.

Will adding a word to my vocabulary list guarantee it is transcribed?

No, by design, and you do not want it to. Biasing is a bias, not a filter: a boosted word is made more likely, not mandatory, so it still has to match the audio to win. If you push the weight high enough to guarantee a term, the recognizer starts hearing it where it is not, and "so anxious" comes out as your brand name.

When should I use language restrictions instead of hints?

Hints by default, restrictions almost never. A restriction is the strongest and most dangerous lever, because anything you leave off the list becomes impossible to transcribe correctly. Reach for it only when the language set is certain and closed and a stray language would cause real harm, such as a regulated form that accepts exactly two.

Does context biasing help with phone numbers and IDs?

Barely. Biasing works on words and named phrases, but the difficulty in phone numbers, account IDs, and emails is formatting digits and symbols, not recognizing rare words, so a vocabulary list has little to grip. That is a distinct problem, covered in alphanumerics in speech recognition.

References

Gong, X., Lv, A., Wang, Z., Zhu, H., & Qian, Y. (2025). BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM. arXiv preprint arXiv:2505.19179.