Speech-to-speech models vs cascaded pipelines for voice agents

Native speech-to-speech models arrived in force in 2024: audio in, audio out, no transcription step in the middle. The immediate reaction was that the cascaded pipeline was finished. Why chain three models when one can do the whole job, faster and more naturally? Two years of building with both gives a more honest answer. Each approach gives up something real, and which one you want depends on whether you most need control or naturalness.

Think of a hi-fi system you can open up, where you swap the amplifier and tap any signal between the boxes, against a sealed smart speaker that sounds better out of the box but cannot be taken apart. Both play music, but only the open one lets you inspect and replace what is inside.

Cascaded voice-agent pipelines

The cascaded approach treats the agent as a pipeline of components you assemble and control. Every advantage follows from that separation.

It is controllable. Between recognition and the model, and between the model and synthesis, you can insert anything you need: tool calls, knowledge retrieval, business logic, compliance filters, guardrails, logging. For most real products, that logic between the stages is where the application lives.

It is debuggable and auditable. Every stage produces inspectable text, so when the agent says something wrong you read the transcript and see whether it misheard, misreasoned, or misspoke, and you keep a record for compliance. It is modular: you can swap in a better recognizer or a different model without rebuilding, using the best component for each job. And it hands you the transcript for free, which you want anyway.

The cost has two parts. Latency stacks across three stages, though pipelining them, overlapping recognition, reasoning, and synthesis, recovers most of it. And transcription discards how things were said: by the time the model sees text, the user's tone, hesitation, and emotion are gone, flattened into words.

End-to-end speech-to-speech models

The S2S model collapses the pipeline into one system that consumes audio and emits audio directly.^[1] Its advantages mirror the pipeline's costs.

It is fast. One model does in a single pass what the chain does in three, with no hand-offs to coordinate. It is also expressive on both ends. Because it never reduces speech to text, it can hear paralinguistics such as tone, emotion, laughter, sarcasm, and hesitation, and it can produce them, responding to a frustrated user differently from a cheerful one, laughing, changing its delivery. It also handles the fine timing of interruptions and turn-taking more naturally, because the same model manages listening and speaking as one behavior.

The costs are the pipeline's strengths inverted. It is hard to control: inserting tool calls, retrieved knowledge, compliance rules, and guardrails into a model that goes straight from audio to audio is far less natural than dropping them between pipeline stages, and tool-calling and grounding for these models are less mature. It is hard to debug and audit, because there is no intermediate transcript to read or store unless the model is built to emit one. When such an agent misbehaves, you have the audio it produced and little else to explain why. It also tends toward lock-in: the whole agent is one vendor's model, with no way to swap a single stage.

Control, latency, and naturalness

The decision comes down to one axis. The pipeline gives you control, inspection, and modularity at some cost in latency and expressiveness. The S2S model gives you latency and expressiveness at a real cost in control and inspectability. Most other differences follow.

	Cascaded pipeline	Speech-to-speech model
Structure	STT → LLM → TTS, separate	One model, audio to audio
Latency	Stacks; pipelining recovers most	Lowest, single pass
Tone and emotion	Lost at transcription	Heard and produced
Tool calls, logic, guardrails	Easy, between stages	Harder, fewer hooks
Transcript	Free	Not guaranteed
Debug and audit	High, read the stages	Low, no intermediate
Swap a component	Yes	No, one model
Interruption timing	Good, with work	Often more natural

Choosing an architecture

Choose the pipeline when you need control and accountability: an agent that calls tools, follows business rules, retrieves from a knowledge base, must be auditable for compliance, or has to let you swap components as the field moves. That describes most enterprise and task-oriented agents, which is why the pipeline remains the workhorse.

Choose the speech-to-speech model when naturalness and latency dominate and tight control matters less: an open-ended conversational companion, or a low-stakes assistant where expressiveness and snappy timing are the product. Watch the middle: it is filling in. Hybrid approaches pair an S2S model with a text side-channel for tools and logging, keeping the naturalness while regaining some control. Neither approach wins outright, so choose by which axis, control or naturalness, matters more for your product.

Common questions

What is a speech-to-speech model?

A single model that takes audio in and produces audio out directly, with no text conversion in between. Because it never transcribes, it hears and expresses tone and emotion, and it runs fast and sounds natural. The trade-off is that it is harder to control with tools and rules, and it gives you no transcript by default.

Is a speech-to-speech model better than a cascaded pipeline?

Neither is universally better. Speech-to-speech models win on latency and expressiveness; cascaded pipelines win on control, debuggability, tool use, and the transcript. Pick the pipeline if your agent needs tight control and auditability. Pick the model if you need natural, low-latency conversation.

Why can't a cascaded pipeline react to my tone of voice?

Its speech-to-text stage reduces your speech to words and discards how you said them before the language model sees it. The model gets text with no record of frustration, sarcasm, or hesitation. A speech-to-speech model keeps the audio and responds to tone. The difference is structural, not a tuning issue.

Can I still use tools and business logic with a speech-to-speech model?

It is harder. A pipeline lets you insert tool calls, retrieval, and guardrails cleanly between stages. A model that goes straight from audio to audio offers fewer hooks, and its tool-calling support is less mature. Hybrid designs add a text side-channel to a speech-to-speech model to regain some of this control.

References

Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., et al. (2023). AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925.