Voice AI glossary: Terms used through The Voice AI Wiki

A

Acoustic model

The part of a speech recognizer that scores how well a slice of audio matches each speech sound. Pairs with a language model, which scores which words plausibly follow which. See how speech-to-text works.

Alphanumerics

Strings that mix letters and digits: order IDs, license plates, reference codes. Hard to transcribe because digits sound alike and the right formatting depends on what kind of value it is. See alphanumerics.

ASR

Automatic speech recognition. The task of turning spoken audio into text, also called speech-to-text or STT. See what is speech recognition.

B

Backchannel

A short sound a listener makes ("mm-hm," "right," "yeah") to signal "I'm following, keep going," without claiming the floor. Naive turn detection mistakes it for a bid to speak. See turn-taking and barge-in.

Barge-in

A user interrupting the agent mid-reply, expecting it to stop and listen. The agent has to keep listening while it talks and cancel its own speech instantly. See turn-taking and barge-in.

Beam search

A decoding strategy that keeps the few most promising partial transcripts alive at once instead of committing to the single best word at each step, so a guess that starts weak can still win if it ends up more plausible. The beam width, how many candidates you carry, trades accuracy against speed.

C

Code-switching

Switching between two or more languages within one conversation, sometimes one sentence. Breaks recognizers that assume a single language. See code-switching.

Confidence score

A number a recognizer attaches to a word or segment estimating how sure it is. Useful for flagging uncertain output, but not a probability of correctness. See confidence scores.

Context biasing

Supplying a recognizer with words it should expect (names, products, jargon) so it favors them when the audio is ambiguous. Also called custom vocabulary. See context biasing.

CTC

Connectionist temporal classification. A training method (Graves et al., 2006) that maps audio frames to text without anyone first labeling which frame belongs to which sound, by allowing a blank symbol and collapsing repeats. It removed the hand-alignment step older systems needed and became a foundation of end-to-end recognition.

D

DER

Diarization Error Rate. The fraction of audio time assigned to the wrong speaker, summing missed speech, false alarms, and speaker confusion. The standard score for speaker diarization.

Diarization

Partitioning audio by speaker to answer "who spoke when," using anonymous labels (Speaker 1, Speaker 2) rather than names. See speaker diarization.

E

Endpoint detection

Deciding when a speaker has finished an utterance so the system can finalize and respond. Also called endpointing or end-of-turn detection. See endpoint detection.

End-to-end model

One neural network trained to map audio directly to text (or audio to audio), replacing the older chain of separate acoustic model, dictionary, and language model. See what is speech recognition.

F

Forced alignment

Given audio and its known transcript, computing the start and end time of each word. The basis of word-level timestamps. See timestamps and forced alignment.

Formant

A resonance of the vocal tract, a band of frequencies the throat and mouth amplify as you shape them. Each vowel is essentially a pattern of two or three formants, which show up as dark horizontal bands on a spectrogram.

G

G2P

Grapheme-to-phoneme conversion: mapping spelled words to the phonemes they sound like, needed because spelling does not predict pronunciation. See how neural TTS works.

H

Hallucination

When a recognizer outputs words that were never spoken, often fluent and confident, usually during silence or noise. See ASR hallucinations.

I

Interim results

See partial results.

ITN

Inverse text normalization. Converting spoken-form text into written form: "twenty twenty four" becomes "2024," "five dollars" becomes "$5." See punctuation and inverse text normalization.

K

Keyword spotting

Detecting specific words or phrases in audio without transcribing everything; wake words are the familiar case. See keyword spotting and wake words.

L

Language identification

Detecting which language is being spoken, ideally per word rather than per whole utterance. See language identification.

Latency

The delay between input and response. In streaming recognition it splits into time-to-first-token and finalization delay. See speech-to-text latency.

M

MFCC

Mel-frequency cepstral coefficients. A compact set of numbers (around 13 per frame) that summarize a spectrogram on a frequency scale tuned to human hearing, keeping what distinguishes one sound from another. The classic input feature, used everywhere before models learned to extract their own.

MOS

Mean Opinion Score. The average of listener ratings from 1 to 5, the long-standing measure of synthetic speech quality. Saturates once systems all score above ~4.3. See evaluating TTS.

Mu-law

A companding scheme that squeezes audio into 8 bits per sample by giving quiet sounds finer resolution than loud ones, the standard on North American and Japanese telephony. Part of why a phone call sounds thin and grainy compared to the original voice. See telephony transcription.

P

Partial results

The provisional transcript a streaming recognizer shows while you are still speaking, revised as more audio arrives, before it commits to a final. See partial vs final results.

PCM

Pulse-code modulation. Raw, uncompressed digital audio: a list of amplitude samples. The baseline format most recognizers accept. See audio formats.

Phoneme

The smallest unit of sound that can change a word's meaning: swap the /p/ in "pat" for /b/ and you get "bat," so /p/ and /b/ are separate phonemes. English has around 44; the count varies by language and by who is counting.

Prosody

The melody of speech: rhythm, stress, intonation. What makes synthetic speech sound engaged rather than flat. See prosody.

R

Real-time factor

Processing time divided by audio duration, the answer to "can it keep up?" An RTF of 0.5 means a minute of audio is transcribed in thirty seconds; anything at or above 1.0 falls behind a live stream and the backlog only grows.

S

Sample rate

How many times per second audio is measured, in hertz. Telephony is 8 kHz; most modern recognition uses 16 kHz; music is 44.1 kHz. Higher rates keep more high-frequency detail. See audio formats.

Semantic VAD

A common industry name for semantic endpointing: reading the partial transcript to judge whether a sentence is actually complete, rather than trusting a fixed silence timer. A misnomer by this wiki's own definitions, since the decision is about turn completion, not voice activity. See endpoint detection.

SIP

Session Initiation Protocol. The signaling standard that sets up and tears down voice-over-IP calls, the on-ramp for putting agents on the phone. See voice agents on the phone.

Speaker embedding

A compact vector describing what a voice sounds like, separate from what it says: the "who" that neural TTS, diarization, and cloning all work with. See TTS voices.

Speaker identification

Determining which known, enrolled person is speaking. Distinct from diarization, which stays anonymous. See diarization vs speaker identification vs verification.

Speaker verification

Confirming a person is who they claim to be from their voice, a yes/no decision against one enrolled voiceprint. See diarization vs speaker identification vs verification.

Spectrogram

A picture of audio with time along one axis, frequency up the other, and brightness for how much energy sits at each frequency moment to moment. Unlike a raw waveform, it shows the frequency patterns that distinguish one speech sound from another. See how speech-to-text works.

SRT

SubRip Subtitle, a plain-text caption format with numbered cues and start/end timestamps. See captions and subtitles.

SSML

Speech Synthesis Markup Language. XML tags that tell a TTS engine how to say something: pauses, emphasis, pronunciation. See pronunciation control.

Streaming

Processing audio incrementally as it arrives and emitting output before the input is complete, the opposite of batch processing a finished file. See streaming speech recognition.

STT

Speech-to-text. A synonym for ASR. See what is speech recognition.

T

Time-to-first-audio

The delay between requesting speech and hearing the first sound, the latency metric that decides whether a synthetic voice feels responsive. See streaming TTS.

TTS

Text-to-speech. Turning written text into spoken audio. See what is text-to-speech.

Turn detection

Deciding when control of a conversation should pass between participants: when the agent should start, keep listening, or stop because it was interrupted. Broader than endpointing. See VAD vs endpointing vs turn detection.

U

Utterance

A continuous stretch of speech from one speaker, bounded by silence on either side or by the other person taking over. It is the chunk an endpointer is trying to decide has ended.

V

VAD

Voice activity detection. Frame-by-frame classification of audio as speech or non-speech. The lowest layer beneath endpointing and turn detection. See voice activity detection.

Voice agent

Software you talk to that talks back in real time, wiring recognition, a decision model, and synthesis into a live loop with turn-taking. See what is a voice agent.

Voice cloning

Building a synthetic voice that imitates a specific person, sometimes from seconds of audio, which puts consent and provenance front and center. See voice cloning.

Vocoder

The model that turns a spectrogram into an actual waveform, the final stage of classic neural TTS. The name comes from Homer Dudley's 1930s voice coder. See how neural TTS works.

W

Wake word

The trigger phrase an always-on detector listens for before waking the full system: "Alexa," "Hey Siri." See keyword spotting and wake words.

WebSocket

A persistent, two-way connection over a single TCP socket, the usual transport for streaming audio up and transcripts down. See streaming speech recognition.

WER

Word Error Rate. The share of words a recognizer gets wrong, counting substitutions, deletions, and insertions against a reference. The headline accuracy metric, and a coarse one. See word error rate.

X

x-vector

A neural speaker embedding (Snyder et al., 2018) that boils any segment of speech down to a fixed-length vector capturing who is talking, not what they said. Two clips from the same person land near each other in that space, which makes it the workhorse behind a generation of diarization systems. See speaker diarization.

Z

Zero-shot cloning

Copying a voice from a few seconds of reference audio, with no per-voice training. Captures timbre fast; captures personality far less. See voice cloning.