Sentiment analysis on speech: text vs acoustic signals

"I'm fine." Read the transcript and that is neutral, maybe positive. Hear it said slowly, flatly, through gritted teeth, and it is anything but. The two readings differ because they measure different things: one the content of the speech, the other the manner. A system that uses only one will be wrong exactly where the two diverge.

The design decision is whether to choose between the two methods or combine them.

Text-based sentiment

Text-based sentiment runs on the transcript. It is ordinary natural-language sentiment analysis, using lexical cues, negation, context, and now large language models, applied to recognized speech.

Its strengths are real. It is mature and good at explicit content: a transcript full of "terrible," "refund," and "cancel" is unambiguously negative, and a language model can pick up nuance, topics, and intent that the acoustic signal never carries. It is also cheap, because the transcript usually exists already, and it scales.

Its weakness is everything that is not in the words. It misses sarcasm, because "oh, great, another outage" is positive on its face. It misses the polite mask, the furious "I'm fine." And it inherits every transcript error: if recognition mangled the words, the sentiment reads the mangling. Text-based sentiment is strong on what was said and blind to how.

Acoustic sentiment

Acoustic sentiment, usually called speech emotion recognition (SER), ignores the words and reads the sound. It measures the prosodic signals, pitch and its movement, loudness, speaking rate, voice quality, and maps them to emotion, either as categories (angry, happy, sad, neutral) or along continuous dimensions.

Its strength is the mirror of text's weakness: it hears how something was said. A raised voice, a clipped tempo, or a trembling pitch reveals anger or distress that the words hide, and it depends far less on the specific language, since tone travels across languages better than vocabulary does. It catches the gritted-teeth "fine."

Its weaknesses are equally real. It is less mature and far more sensitive to noise, microphone, and channel. It varies enormously by speaker and culture, so what reads as anger in one voice is normal animation in another. And it is ambiguous: high arousal looks the same whether someone is furious or thrilled, so the acoustic signal often cannot tell excitement from rage on its own.

Causes of different sentiment results

When text and acoustics agree, sentiment is easy and either method works. The interesting cases are the disagreements, and they carry information, not noise. Polite words over an angry voice signal suppressed frustration. Cheerful words in a flat voice signal sarcasm. A system that reads only the words or only the sound throws away the signal the mismatch encodes.

This is why the best systems are multimodal, fusing both: text supplies valence and explicit content, acoustics supply arousal and the unspoken tone, and the relationship between them catches sarcasm and masking that neither sees alone.

Method comparison

	Text-based sentiment	Acoustic sentiment (SER)
Reads	The transcript (words)	The audio (pitch, energy, rhythm)
Captures	What was said	How it was said
Strong on	Valence, explicit content, intent	Arousal, tone, distress
Misses	Sarcasm, tone, masking	Valence (anger vs excitement)
Language	Tied to the language	Travels across languages
Maturity	High, reuses NLP and LLMs	Lower, noisy, speaker-variable
Depends on	Transcript accuracy	Audio quality

Limits of emotion inference

Emotion recognition deserves more humility than it usually gets. There is a serious scientific debate, argued by researchers including Lisa Feldman Barrett and colleagues, about whether emotions map reliably onto vocal or facial expressions at all, across people and cultures; the evidence that a given tone means a given inner emotion is weaker than products often imply. ^[2] Add cultural bias, individual variation, and the consent and privacy questions that come with inferring people's feelings,^[1] and the responsible stance is to treat acoustic emotion labels as uncertain signals, not facts, and to be especially careful when using them to judge or act on individuals. Sentiment analysis is useful for spotting trends and flagging calls for review, but shaky ground for deciding what a specific person "really felt."

Common questions

What is the difference between text-based and acoustic sentiment analysis?

Text-based sentiment reads the transcript and infers feeling from the words, capturing what was said. Acoustic sentiment, or speech emotion recognition, reads the audio, pitch, energy, and rhythm, and captures how it was said. They use different evidence and often disagree, because words and tone can point in opposite directions.

Why do the two methods give different answers?

Because they measure different things. The words might be polite while the voice is angry, or cheerful while the tone is sarcastic. Text reads valence and content well but misses tone; acoustics read arousal and tone well but cannot reliably tell positive from negative. The disagreement itself signals sarcasm or suppressed emotion, which is why fusing both works best.

Can AI reliably tell how someone feels from their voice?

Only roughly, and with real caveats. Acoustic models detect arousal (how activated someone is) fairly well but valence (positive versus negative) poorly, and there is genuine scientific debate about whether vocal tone maps reliably onto specific emotions at all. Cultural and individual variation is large, so emotion labels are best treated as uncertain signals, not facts about a person.

Which should I use for analyzing customer calls?

Both, if you can. Text-based sentiment is mature, cheap, and good at explicit content and intent; acoustic sentiment catches the frustration that polite words hide. Fusing them gives the richest reading. Whatever you use, treat the output as a signal for spotting trends and flagging calls for review rather than a definitive judgment of an individual.

References

Lin, Y.-C., Wu, H., Chou, H.-C., Lee, C.-C., & Lee, H.-Y. (2024). Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition. arXiv preprint arXiv:2406.05065.
Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., & Pollak, S. D. (2019). Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements. Psychological Science in the Public Interest, 20(1).