Turn-taking and barge-in: how voice agents know when to talk

You ask a question, the agent answers, and you realize halfway through that it misunderstood, so you start to correct it, and it keeps talking, calmly, over you, until it finishes its wrong answer. Or the reverse: you pause for half a second to find a word and the agent leaps in, finishing your sentence for you incorrectly. Either way the machine does not understand how turns work, so instead of a conversation you get two people talking past each other.

Turn-taking is the choreography humans do without thinking and agents have to be built to do.

Premature turn completion

The agent decides you are done before you are, ships a half-formed query, and starts answering the wrong question.

What went wrong: end-of-turn was called on silence alone, and your pause was a thinking pause rather than a finished thought. This is the endpoint detection problem in its native habitat. A silence threshold tight enough to feel responsive will fire during the half-second-and-longer gaps people leave mid-sentence, especially when reciting numbers ("my code is four, seven... two") or thinking aloud. An agent that cannot tell silence from completeness interrupts on a schedule.

Delayed turn completion

The opposite failure, and it feels worse because it is so awkward. You finish, the line goes dead, you wonder if it heard you, you start to repeat yourself, and that is when it finally answers, now colliding with your second attempt.

What went wrong: the silence timeout was set long to avoid cutting people off, and the full timeout is added to every response as pure latency. Worse, the agent's own thinking-and-speaking delay means its late answer arrives exactly as you give up waiting and start again, so the two of you collide. The dead-air failure and the collision failure are the same tuning dial pushed too far the other way.

Missed interruptions

You start speaking while the agent is mid-sentence, and it plows on, deaf, until it finishes. By then you have repeated yourself twice and the conversation is a mess.

What went wrong: there is no barge-in. A walkie-talkie agent treats its own turn as uninterruptible, so it has no mechanism to detect that you started and yield. Real conversation is full-duplex: both parties can produce sound at once, and the listener stops when the other takes the floor. An agent that only listens when it is not speaking cannot do this, which is why barge-in support does more than almost any other feature to let a conversation flow without collisions. It requires the agent to keep listening while it talks and to stop its own streaming TTS the instant you mean to take over.

False interruption detection

A stranger failure: the agent stops mid-sentence for no reason, as if someone spoke, but nobody did. Or on a phone call it keeps cutting out.

What went wrong: the agent's own audio leaked from the speaker back into the microphone, the voice activity detector saw "speech," and the barge-in logic concluded the user had interrupted, so the agent yielded to its own echo. This is why acoustic echo cancellation (AEC) is required in a full-duplex agent. Without it, the agent cannot tell its own voice from yours, and barge-in makes the agent stop itself whenever its own speaker output reaches the microphone. Telephony makes it worse, part of why agents feel different on the phone.

Backchannels mistaken for turns

You say "mm-hm" or "right" while the agent talks, the small noises that mean keep going, and the agent stops, thinking you wanted the floor.

What went wrong: a backchannel is sound from the listener that is not a bid to take the turn, and the agent treated all incoming speech as a turn-grab. Distinguishing "go on" from "stop, my turn" is subtle even for humans. An agent that yields on every backchannel becomes impossible to listen to, while one that ignores all incoming speech cannot be interrupted at all. The right behavior sits between the two, and depends on more than acoustics.

Why silence alone is insufficient

The thread through these failures is that acoustic silence is not the end of a turn. Silence tells you the sound stopped, not whether the speaker is finished. The current generation of turn detection blends the silence timer with a semantic signal: a model reads the partial transcript and judges whether the utterance is complete, holding the turn open through the pause inside "my account number is four, seven..." and releasing it quickly after a finished sentence.^[1] This move from fixed endpointing to semantic end-of-turn detection lets an agent stop interrupting thinkers without adding lag for everyone else.

A competent turn-taking stack is therefore several mechanisms cooperating: VAD to hear speech, endpointing plus a semantic check to decide the user's turn is over, echo cancellation so the agent does not trip on itself, barge-in logic to stop the agent and yield when the user takes the floor, and enough restraint to ignore backchannels. Each one fixes a specific failure above, and an agent missing any of them has a recognizable way of feeling wrong.

Common questions

What is barge-in in a voice agent?

It is the user interrupting the agent while it is speaking, and the agent stopping to listen. Supporting it requires the agent to keep listening during its own playback, detect that the user has started, and immediately stop its text-to-speech and yield the floor. Without barge-in, the agent talks over the user until it finishes, which feels broken.

Why does my voice agent interrupt me when I pause?

Because it treats silence as the end of your turn, and your pause was thought, not completion. The pause statistics and the failure catalog live in endpoint detection; the short version is that semantic end-of-turn detection, which checks whether your sentence is actually complete, holds the turn open where a bare silence timer would fire.

Why does the agent stop talking when no one interrupted it?

Most likely its own audio leaked from the speaker into the microphone, and it mistook the echo for the user speaking. Acoustic echo cancellation removes the agent's own voice from the incoming audio so it does not flinch at itself. This is essential in any full-duplex agent and especially on phone calls.

How do humans take turns so smoothly compared to agents?

People predict when a speaker will finish, using grammar and rhythm and, in person, gaze, and they begin their reply within about 200 milliseconds, sometimes overlapping. Agents have weaker prediction and, on the phone, no visual cues, and they must complete recognition, reasoning, and synthesis within roughly that same window, which is why turn-taking is so much harder for a machine.

References

Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.
Stivers, T., Enfield, N. J., Brown, P., et al. (2009). Universals and Cultural Variation in Turn-Taking in Conversation. Proceedings of the National Academy of Sciences, 106(26).