A
Acoustic model
The part of a speech recognizer that scores how well a slice of audio matches each speech sound. Pairs with a language model, which scores which words plausibly follow which. See how speech-to-text works.
Alphanumerics
Strings that mix letters and digits: order IDs, license plates, reference codes. Hard to transcribe because digits sound alike and the right formatting depends on what kind of value it is. See alphanumerics.
ASR
Automatic speech recognition. The task of turning spoken audio into text, also called speech-to-text or STT. See what is speech recognition.
B
Backchannel
A short sound a listener makes ("mm-hm," "right," "yeah") to signal "I'm following, keep going," without claiming the floor. Naive turn detection mistakes it for a bid to speak. See turn-taking and barge-in.
Barge-in
A user interrupting the agent mid-reply, expecting it to stop and listen. The agent has to keep listening while it talks and cancel its own speech instantly. See turn-taking and barge-in.
Beam search
A decoding strategy that keeps the few most promising partial transcripts alive at once instead of committing to the single best word at each step, so a guess that starts weak can still win if it ends up more plausible. The beam width, how many candidates you carry, trades accuracy against speed.
C
Code-switching
Switching between two or more languages within one conversation, sometimes one sentence. Breaks recognizers that assume a single language. See code-switching.
Confidence score
A number a recognizer attaches to a word or segment estimating how sure it is. Useful for flagging uncertain output, but not a probability of correctness. See confidence scores.
Context biasing
Supplying a recognizer with words it should expect (names, products, jargon) so it favors them when the audio is ambiguous. Also called custom vocabulary. See context biasing.
CTC
Connectionist temporal classification. A training method (Graves et al., 2006) that maps audio frames to text without anyone first labeling which frame belongs to which sound, by allowing a blank symbol and collapsing repeats. It removed the hand-alignment step older systems needed and became a foundation of end-to-end recognition.
D
DER
Diarization Error Rate. The fraction of audio time assigned to the wrong speaker, summing missed speech, false alarms, and speaker confusion. The standard score for speaker diarization.
Diarization
Partitioning audio by speaker to answer "who spoke when," using anonymous labels (Speaker 1, Speaker 2) rather than names. See speaker diarization.
E
Endpoint detection
Deciding when a speaker has finished an utterance so the system can finalize and respond. Also called endpointing or end-of-turn detection. See endpoint detection.
End-to-end model
One neural network trained to map audio directly to text (or audio to audio), replacing the older chain of separate acoustic model, dictionary, and language model. See what is speech recognition.
F
Forced alignment
Given audio and its known transcript, computing the start and end time of each word. The basis of word-level timestamps. See timestamps and forced alignment.
Formant
A resonance of the vocal tract, a band of frequencies the throat and mouth amplify as you shape them. Each vowel is essentially a pattern of two or three formants, which show up as dark horizontal bands on a spectrogram.
H
Hallucination
When a recognizer outputs words that were never spoken, often fluent and confident, usually during silence or noise. See ASR hallucinations.
I
Interim results
See partial results.
ITN
Inverse text normalization. Converting spoken-form text into written form: "twenty twenty four" becomes "2024," "dollars fifty" becomes "$0.50." See punctuation and inverse text normalization.
L
Language identification
Detecting which language is being spoken, ideally per word rather than per whole utterance. See language identification.
Latency
The delay between input and response. In streaming recognition it splits into time-to-first-token and finalization delay. See speech-to-text latency.
M
MFCC
Mel-frequency cepstral coefficients. A compact set of numbers (around 13 per frame) that summarize a spectrogram on a frequency scale tuned to human hearing, keeping what distinguishes one sound from another. The classic input feature, used everywhere before models learned to extract their own.
MOS
Mean Opinion Score. The average of listener ratings from 1 to 5, the long-standing measure of synthetic speech quality. Saturates once systems all score above ~4.3. See evaluating TTS.
Mu-law
A companding scheme that squeezes audio into 8 bits per sample by giving quiet sounds finer resolution than loud ones, the standard on North American and Japanese telephony. Part of why a phone call sounds thin and grainy compared to the original voice. See telephony transcription.
P
Partial results
The provisional transcript a streaming recognizer shows while you are still speaking, revised as more audio arrives, before it commits to a final. See partial vs final results.
PCM
Pulse-code modulation. Raw, uncompressed digital audio: a list of amplitude samples. The baseline format most recognizers accept. See audio formats.
Phoneme
The smallest unit of sound that can change a word's meaning: swap the /p/ in "pat" for /b/ and you get "bat," so /p/ and /b/ are separate phonemes. English has around 44; the count varies by language and by who is counting.
Prosody
The melody of speech: rhythm, stress, intonation. What makes synthetic speech sound engaged rather than flat. See prosody.
R
Real-time factor
Processing time divided by audio duration, the answer to "can it keep up?" An RTF of 0.5 means a minute of audio is transcribed in thirty seconds; anything at or above 1.0 falls behind a live stream and the backlog only grows.
S
Sample rate
How many times per second audio is measured, in hertz. Telephony is 8 kHz; most modern recognition uses 16 kHz; music is 44.1 kHz. Higher rates keep more high-frequency detail. See audio formats.
Semantic VAD
Endpointing that reads the partial transcript to judge whether a sentence is actually complete, rather than trusting a fixed silence timer. See endpoint detection.
SIP
Session Initiation Protocol. The signaling standard that sets up and tears down voice-over-IP calls, the on-ramp for putting agents on the phone. See voice agents on the phone.
Speaker identification
Determining which known, enrolled person is speaking. Distinct from diarization, which stays anonymous. See diarization vs speaker identification vs verification.
Speaker verification
Confirming a person is who they claim to be from their voice, a yes/no decision against one enrolled voiceprint. See diarization vs speaker identification vs verification.
Spectrogram
A picture of audio with time along one axis, frequency up the other, and brightness for how much energy sits at each frequency moment to moment. Unlike a raw waveform, it shows the frequency patterns that distinguish one speech sound from another. See how speech-to-text works.
SRT
SubRip Subtitle, a plain-text caption format with numbered cues and start/end timestamps. See captions and subtitles.
SSML
Speech Synthesis Markup Language. XML tags that tell a TTS engine how to say something: pauses, emphasis, pronunciation. See pronunciation control.
Streaming
Processing audio incrementally as it arrives and emitting output before the input is complete, the opposite of batch processing a finished file. See streaming speech recognition.
STT
Speech-to-text. A synonym for ASR. See what is speech recognition.
T
TTS
Text-to-speech. Turning written text into spoken audio. See what is text-to-speech.
Turn detection
Deciding when control of a conversation should pass between participants: when the agent should start, keep listening, or stop because it was interrupted. Broader than endpointing. See VAD vs endpointing vs turn detection.
U
Utterance
A continuous stretch of speech from one speaker, bounded by silence on either side or by the other person taking over. It is the chunk an endpointer is trying to decide has ended.
V
VAD
Voice activity detection. Frame-by-frame classification of audio as speech or non-speech. The lowest layer beneath endpointing and turn detection. See voice activity detection.
Voice agent
Software you talk to that talks back in real time, wiring recognition, a decision model, and synthesis into a live loop with turn-taking. See what is a voice agent.
Voice cloning
Building a synthetic voice that imitates a specific person, sometimes from seconds of audio, which puts consent and provenance front and center. See voice cloning.
W
WebSocket
A persistent, two-way connection over a single TCP socket, the usual transport for streaming audio up and transcripts down. See WebSockets vs HTTP for audio.
WER
Word Error Rate. The share of words a recognizer gets wrong, counting substitutions, deletions, and insertions against a reference. The headline accuracy metric, and a coarse one. See word error rate.
X
x-vector
A neural speaker embedding (Snyder et al., 2018) that boils any segment of speech down to a fixed-length vector capturing who is talking, not what they said. Two clips from the same person land near each other in that space, which makes it the workhorse behind a generation of diarization systems. See speaker diarization.
Z
Zero-shot cloning
Copying a voice from a few seconds of reference audio, with no per-voice training. Captures timbre fast; captures personality far less. See voice cloning.