How to test and evaluate a voice agent before launch

The demo works. It worked because you spoke clearly, in a quiet room, and asked exactly the questions the agent was built for. The first real caller will do none of those things, and the second one will interrupt it mid-sentence. Testing a voice agent is the discipline of finding out what breaks before they do.

Why standard software tests are insufficient

Ordinary software testing assumes determinism: the same input gives the same output, so you assert on it. A voice agent breaks both halves of that assumption. Its language model is stochastic, so the same question yields different but valid replies and you cannot assert on exact strings. Its input is audio, which varies endlessly with accent, noise, device, and line quality, so a single clean test clip proves almost nothing. And it is a real-time, multi-turn system, so its behavior depends on timing and conversation state, not on the latest utterance alone. Voice-agent testing is therefore statistical and behavioral: run many varied scenarios and measure rates and outcomes, rather than running one and checking for equality.

Test individual components

When the live loop fails, you have six suspects: recognition, reasoning, the tool call, synthesis, timing, and the audio itself. Any can be the one that ruined the call, and a transcript of the wreckage rarely tells you which. Test the pieces in isolation first, while you can still see which is at fault.

Test recognition on your audio, not a benchmark's: real recordings with your accents, your jargon, and your noise and line conditions, scored on the entities that matter, the discipline argued for in beyond WER. Test the voice for the pronunciations and prosody your content needs. Knowing each component's behavior turns a later loop failure from a mystery into "the recognizer missed the account number," which you can then fix.

Simulate complete conversations

You cannot hand-test every path, so simulate users. Drive the agent with another model playing the caller: give an LLM a persona and a goal ("an impatient customer who wants a refund and keeps interrupting") and let it converse with your agent over many runs. A small roster covers a lot of ground: the caller who reschedules an appointment in a heavy accent and pauses mid-sentence, the one who disputes a charge and changes their mind halfway through, the one who reads the order number out of order.

What you record from each run is whether the caller's goal was achieved, and because the agent is stochastic, each persona runs many times and the result is a rate. Running each persona repeatedly surfaces the variance a single scripted test hides, and adversarial personas, such as the interrupter, the topic-changer, and the mumbler, find the edges your happy-path script never reaches.

Test with representative audio

Simulation has a trap. If the simulated caller's turns are synthesized as clean TTS, you are testing your agent on pristine audio it will never receive in production. Inject reality instead: run recorded human audio with real accents, feed in noisy and far-field clips, and test over an actual telephony path at 8 kHz. An agent that works on studio audio can still fail on a phone call, so the realism of your test audio sets a ceiling on how much the rest of the testing tells you.

Test timing and response content

Content correctness is not enough; the agent also has to feel right. Test the conversational behaviors that text testing ignores. Does it end turns correctly, or cut people off and leave dead air? Does it yield to barge-in promptly? Does its latency stay acceptable, and hold up under concurrent load rather than on one call at a time? These are all measurable, and they are where an agent that "answers correctly" can still be unusable.

Measure task completion in production

The metric that matters most is task success, not transcript accuracy. Did the agent achieve the caller's goal, call the right tool with the right parameters, fill the right fields, and escalate when it should have?^[1] This is the voice-agent version of the beyond-WER argument: measure the outcome. A perfect transcript that leads to the wrong action is a failure, and an imperfect one that completes the task is a success.

Testing does not stop at launch. Log real calls (with consent and care for privacy), track task-success and latency metrics over time, build an evaluation set from real failures, and watch for drift as you change prompts, models, or providers. Production is the largest and most honest test suite you have.

You test	You catch
Components in isolation	Which stage actually failed
Simulated personas, many runs	Variance, and the edges of the happy path
Real audio over a real phone path	The gap between the demo and production
Timing, turn-taking, and load	Agents that answer correctly but feel wrong
Logged production calls	Drift, and the failures nobody predicted

Each layer of testing catches what the one before it cannot see.

Common questions

Why can't I unit-test a voice agent like normal software?

Because the same input does not give the same output. The model samples differently from one run to the next, and the recognizer's borderline calls shift with the audio, so one passing case proves nothing. Measure pass rates over many varied runs, the way you would grade a flaky physical process, rather than checking equality on a single clip.

How do I test a voice agent without thousands of real calls?

Simulate callers. Give an LLM a persona and a goal, point it at your agent, and run each persona many times to surface the variance that one scripted test hides. Make some of them adversarial, such as the interrupter, the topic-changer, and the mumbler, since those find the edges. Feed them recorded human audio rather than clean TTS, or you are testing on audio the agent will never receive in production.

What should I measure when evaluating a voice agent?

Task success first: did it reach the caller's goal and call the right tool with the right parameters. A perfect transcript that drives the wrong action is a failure, and an imperfect one that completes the task is a success. After that, measure turn-taking, barge-in responsiveness, and whether latency holds up under concurrent load rather than on a single call.

Why does my agent pass testing but fail on real calls?

Your tests ran clean audio and scripted paths. The first real caller brings an accent, a barking dog, an 8 kHz phone line, and a request you never wrote. Test over an actual telephony path with realistic and adversarial audio, then keep grading logged calls after launch. Production is the largest and most honest test set you have.

References

Okur, E., Sahay, S., Fuentes Alba, R., & Nachman, L. (2022). End-to-End Evaluation of a Spoken Dialogue System for Learning Basic Mathematics. arXiv preprint arXiv:2211.03511.