The Voice AI Wiki
A knowledge base for voice AI: speech recognition, synthesis, real-time translation, voice agents, and everything in between.
Foundations
Orientation: what the field is, what the words mean, and how it got here, before you drill into a topic.
Speech-to-text
The core. Every concept that sits between a microphone and a transcript.
- What is speech recognition?ASR explained from audio to text.
- A brief history of speech recognitionFrom Audrey to transformers.
- How speech-to-text worksFrom waveform to words, the neural way.
- Speech recognition vs voice recognitionTwo phrases people use as synonyms. They are not.
- Real-time vs async transcriptionWhich one you actually need.
- Streaming speech recognitionHow live transcription works over WebSockets.
- Partial vs final resultsWhy the text changes while you speak.
- What is endpoint detection?How machines know you stopped talking.
- What is voice activity detection (VAD)?Telling speech from silence.
- VAD vs endpoint detection vs turn detectionThree things everyone confuses, settled.
- What is speaker diarization?Who spoke when, explained.
- Diarization vs identification vs verificationDiarization, identification, verification.
- Spoken language identificationHow AI detects the language you speak.
- Code-switchingWhy most recognizers break mid-sentence.
- Custom vocabulary and context biasingTeaching a recognizer your words.
- Alphanumerics in speech recognitionWhy phone numbers, IDs, and emails go wrong.
- Word timestamps and forced alignmentPutting words on a clock.
- Confidence scores in speech recognitionWhat they actually tell you.
- Punctuation, capitalization, and ITNWhy "twenty three" becomes "23".
- Hallucinations in speech recognitionWhy transcripts invent words.
- What is Word Error Rate (WER)?How STT accuracy is measured.
- Speech recognition evaluation beyond WERWhy a single number misleads.
- Speech-to-text latencyWhat sub-200ms actually means.
- Audio formats for speech recognitionSample rates, codecs, and telephony.
- Transcribing noisy audioFar-field, overlapping, and hard speech.
- Captions and subtitles: SRT, VTT, and timingSRT, VTT, and timing rules.
- Telephony transcription8kHz audio, channels, and codecs.
Text-to-speech
Making machines speak: how voices are built, streamed, and judged.
- What is text-to-speech?How neural TTS turns text into voice.
- A brief history of text-to-speechTwo centuries of teaching machines to talk.
- How neural text-to-speech worksFrom text to waveform.
- Streaming TTSWhy time-to-first-audio decides UX.
- ProsodyWhat makes synthetic speech sound human.
- TTS voicesHow voices are made, and what makes one good.
- What is voice cloning?How it works, and where consent comes in.
- Controlling pronunciation in TTSSSML, phonemes, and beyond.
- Multilingual TTS and language mixingBorrowed words, names, language mixing.
- TTS audio formats and sample ratesChoosing output quality.
- How TTS quality is evaluatedMOS scores and their limits.
- Audio watermarking and deepfakesProvenance for synthetic audio.
Speech translation
Translating speech as it happens, before the sentence is even finished.
- Speech translationHow machines hear one language and speak another.
- Real-time speech translationTranslating before the sentence ends.
- Cascaded vs end-to-end translationTwo architectures compared.
- One-way vs two-way translationTranslation modes explained.
- Speech-to-speech translationThe full pipeline, end to end.
- AI dubbingHow automated voice-over works.
Voice agents
Closing the loop: software that listens, decides, and speaks back fast enough to hold a conversation.
- What is a voice agent?The architecture of AI that talks back.
- Voice agent architectureSTT → LLM → TTS, explained.
- Voice agent latency budgetWhere your 800ms goes, ms by ms.
- Turn-taking and barge-inHow agents know when to talk.
- Speech-to-speech models vs pipelinesThe architecture debate, balanced.
- Voice agent frameworksPipecat, LiveKit Agents, and friends.
- Tool calling in voice agentsActions from spoken requests.
- Voice agents on the phoneTwilio, SIP, and codecs.
- Testing voice agentsEvaluating one before launch.
- Multilingual voice agentsSwitching languages live.
Audio intelligence
Everything you can pull out of audio once the words are no longer the point.
- What is audio intelligence?Beyond transcription.
- Conversation summarizationTurning calls into notes.
- Sentiment analysis on speechText signals vs acoustic signals.
- PII redaction in transcriptsAutomated redaction in transcripts.
- Keyword spotting and wake wordsAlways-listening AI, explained.
- Audio event detectionWhen AI hears more than speech.
Glossary
Terms used through The Voice AI Wiki