Prosody: what makes synthetic speech sound human

Consider one sentence: "I didn't say she stole the money." Say it aloud several times, stressing a different word each time, and listen to the meaning move. Seven words, at least five distinct claims, and the text identical in every one; the only thing that changed is the prosody.^[7]^[8]

The words leave the meaning open and prosody settles it, so a TTS system reading flat text has to recover information the text does not contain.^[3]^[4]^[5]

Stress on	The sentence now implies
I didn't say she stole the money	Someone else said it
I didn't say she stole the money	I implied it without saying it
I didn't say she stole the money	Someone else stole it
I didn't say she stole the money	She did something milder, like borrow it
I didn't say she stole the money	She stole something else

One sentence, five meanings, zero changes to the text. Everything is carried by where the stress lands.

Insufficient pitch variation

The most common failure is the absence of prosody: a voice that reads every sentence with the same even pitch and unvarying stress, technically correct and subtly deadening.

What went wrong: the model produced intelligible sounds but made no decision about what the sentence means, so it emphasized nothing. Listeners hear a flat read as bored or robotic rather than neutral, and it makes them do the work of figuring out which word mattered, a job the voice was supposed to do. Monotone therefore gives the voice away even when every phoneme is perfect.^[9]^[11]

Incorrect word emphasis

Emphasis in the wrong place is worse than no emphasis. The voice stresses "the" or lands the pitch peak on a function word, and the sentence's meaning shifts or breaks down.

What went wrong: choosing which word carries the stress, the focus, requires knowing what the sentence is about and what came before it. In "I bought the red one," the stress on "red" implies a contrast with some other color, a fact that comes from the conversation rather than the sentence. A system with no model of the discourse guesses, and a wrong guess can imply a contrast that does not exist or hide the one that does.^[2]^[25]

Incorrect sentence intonation

"You're leaving." versus "You're leaving?" Same words; the only signal is the final pitch, rising for the question, falling for the statement.

What went wrong: when punctuation is missing, ambiguous, or ignored, the model defaults to a falling, declarative contour and turns genuine questions into flat assertions. Yes-no questions especially rely on the terminal rise, and a voice that drops the pitch at the end of "are you sure" makes a question sound like a verdict. Tag questions and uptalk add further difficulties.^[13]^[14]^[15]

List intonation

Read "we need eggs, milk, bread, and cheese" and a human voice steps the pitch up through the items and lets it fall on the last, signaling "this is a list and it is ending." A weak TTS reads four disconnected nouns.

What went wrong: the model treated the items independently instead of as a structure with a beginning, middle, and end. List intonation, like the contrast in the emphasis example, is a property of the whole phrase, so a system that decides prosody locally, word by word, cannot produce the arc that tells the listener where they are in the list.^[16]^[17]

Incorrect pause placement

Speech includes pauses. They fall at clause boundaries, before important words, after questions. A bad TTS pauses mid-phrase, runs two sentences together, or breaks "New York" in half.

What went wrong: pausing is phrasing, and phrasing follows syntax and meaning. A pause in the wrong place sounds odd, and it can imply a boundary that changes parsing, the spoken version of a misplaced comma. Long stretches without any pause, common when a system reads run-on input, exhaust the listener because nothing marks the structure.^[6]^[18]^[19]^[20]^[21]

Long-form prosody

Single sentences are largely solved. Long-form reading, an audiobook, a long agent reply, exposes a higher-level failure: each sentence is fine, but they do not connect. The pitch resets to the same starting point every sentence, nothing builds, and a page of competent sentences adds up to a flat performance.

What went wrong: prosody operates at scales above the sentence, paragraphs have arcs, topics have emphasis, and most systems decide prosody within a window too small to see them.^[22]^[23]^[24] This is the same context-versus-latency tension streaming TTS lives with: the more text the model considers, the better the prosodic shape, but the longer it waits to start.

Why prosody is difficult to model

The common thread is that good prosody requires information the text does not contain: intent, contrast, discourse structure, emotion. Modern neural models predict prosody implicitly from large amounts of expressive speech and have become good at the default reading of ordinary sentences. They remain weakest where meaning diverges from text: the contrastive stress a human would choose from knowing the situation, or the sarcasm that inverts a sentence's plain reading.

The fixes work from both ends. Better context, feeding the model whole sentences or paragraphs rather than fragments, improves the structural prosody (questions, lists, phrasing). Explicit control, marking emphasis, pauses, and pitch with SSML and pronunciation controls, lets a human supply the intent the model cannot infer, which is how production systems get a specific line to sound a specific way. Prosody is also most of what evaluation struggles to capture, the reason MOS scores saturate while voices still sound subtly off even when every word is pronounced correctly.^[26]^[27]

Common questions

What is prosody in speech?

The pitch, loudness, rhythm, and pauses layered over the words: the melody and timing of speech rather than the individual sounds. It carries meaning the words do not, such as whether something is a question, which word is emphasized, and the speaker's attitude. It is built from three measurable signals: pitch, duration, and energy.^[1]^[10]^[12]

Why does synthetic speech sound robotic even when the words are clear?

Usually because of weak prosody. A voice can pronounce every word perfectly and still read flat, with no emphasis and a monotone pitch, which listeners hear as bored or mechanical. Naturalness comes less from clean phonemes than from the right melody, stress, and phrasing, which require understanding the meaning.

Can I control where a TTS voice puts emphasis?

Often yes, using SSML or similar markup to mark emphasis, insert pauses, or adjust pitch. This matters because the correct emphasis frequently depends on intent the model cannot infer from text alone, such as a contrast that exists only in the surrounding conversation. Explicit control supplies that intent.

Why is long-form narration harder than single sentences?

Because prosody has structure above the sentence: paragraphs build, topics carry emphasis, and the pitch should not reset identically every sentence. Many systems decide prosody in a window too small to see that structure, so they produce competent individual sentences that do not connect into a shaped performance.

References

Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., Busso, C., Deng, Z., et al. (2004). An acoustic study of emotions expressed in speech. Interspeech 2004.
Shue, Y. L., Shattuck-Hufnagel, S., Iseli, M., Jun, S. A., Veilleux, N., & Alwan, A. (2010). On the acoustic correlates of high and low nuclear pitch accents in American English. Speech Communication, 52(2), 106–122.
Van Santen, J. P. (1997). Prosodic modelling in text-to-speech synthesis. Proc. Eurospeech 1997.
Hirschberg, J. (2006). Speech synthesis: prosody. Encyclopedia of Language & Linguistics.
Kaur, N., & Singh, P. (2023). Conventional and contemporary approaches used in text to speech synthesis: a review. Artificial Intelligence Review, 56(6), 5837–5880.
Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., et al. (2000). ProSynth: an integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech & Language, 14(3), 177–210.
Yosha, I., Maimon, G., Adi, Y., & Keshet, J. (2026). Stresstest: Can your speech LM handle the stress?. Findings of the Association for Computational Linguistics: ACL 2026.
Bolinger, D. (1972). Accent is predictable (if you're a mind-reader). Language, 48(3), 633–644.
Ehret, J., Bönsch, A., Aspöck, L., Röhr, C. T., Vorländer, M., & Kuhlen, T. W. (2021). Do prosody and embodiment influence the perceived naturalness of conversational agents' speech?. ACM Transactions on Applied Perception, 18(4), 1–21.
Prasetio, B. H., & Widasari, E. R. (2026). Acoustic Correlates of Affective Prosody Across Emotions and Stress Levels: A Phonetic Analysis With Interpretable AI-Assisted Modeling. IEEE Transactions on Audio, Speech, and Language Processing.
Rosenberg, A. (2018). Speech, Prosody, and Machines: Nine Challenges. Speech Prosody 2018.
Albert, A., & Niebuhr, O. (2018). Using periodic energy to enrich acoustic representations of pitch in speech. Speech Prosody 2018.
Ladd, D. R. (2008). Intonational Phonology. Cambridge University Press.
Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation. Massachusetts Institute of Technology.
Cruttenden, A. (1997). Intonation. Cambridge University Press.
Steindel Burin, M., & Tyler, A. (2018). List intonation. Research Handbook on Social Interaction.
Local, J., Kelly, J., & Wells, W. H. (1986). Towards a phonology of conversation: turn-taking in Tyneside English. Journal of Linguistics, 22(2), 411–437.
Price, P. J., Ostendorf, M., Shattuck-Hufnagel, S., & Fong, C. (1991). The use of prosody in syntactic disambiguation. The Journal of the Acoustical Society of America, 90(6), 2956–2970.
Engelhardt, P. E., Bailey, K. G., & Ferreira, F. (2006). Do-it-yourself syntax: The role of prosody in syntactic disambiguation. Journal of Memory and Language, 54(1), 50–74.
Lehiste, I. (1973). Phonetic disambiguation of syntactic ambiguity. Glossa, 7(2), 107–122.
Fitzpatrick, E., & Bachenko, J. (1989). A computational grammar of discourse-neutral prosodic phrasing in English. Computational Linguistics, 15(4), 278–288.
Peiró-Lilja, A., & Farrús, M. (2018). Paragraph prosodic patterns to enhance text-to-speech naturalness. Speech Prosody 2018.
Farrús, M., & Hernando, J. (2016). Paragraph-based prosodic cues for speech synthesis applications. Speech Prosody 2016.
Wang, X., Takaki, S., & Yamagishi, J. (2022). ParaTTS: Learning linguistic and prosodic cross-sentence information in paragraph-based TTS. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 2831–2844.
Sluijter, A. M., & Van Heuven, V. J. (1996). Spectral balance as an acoustic correlate of linguistic stress. The Journal of the Acoustical Society of America, 100(4), 2471–2485.
Wang, Y., Stanton, D., Zhang, Y., Ryan, R., Battenberg, E., Shor, J., et al. (2025). Towards Responsible Evaluation for Text-to-Speech. arXiv preprint arXiv:2510.06927.
Cooper, E., & Yamagishi, J. (2023). The limits of the Mean Opinion Score for speech synthesis evaluation. Computer Speech & Language, 80, 101498.