A vendor announces a "4.5 MOS, human-level" voice, and the claim can read as a settled result though it usually is not. The same model scores 4.5 on the vendor's chosen sentences and stumbles on your product's names, your questions, and your long paragraphs. An average over flattering samples is itself inflated.
The situation parallels speech recognition. MOS played the same role for TTS that word error rate played for recognition, the case made in beyond WER: one number tracked a field for decades, then stopped separating the top systems.[6][7]
How MOS works
MOS is a subjective listening test. You recruit listeners, play them speech samples, and have each one rate every sample on a five-point scale, usually for naturalness. Average the ratings and you have the score. Its appeal is that it measures human judgment directly, not through a proxy that only approximates it.
It produced the field's most famous yardstick. The 2016 WaveNet evaluation reported MOS for parametric, concatenative, and neural systems against human speech,[4] and the jump WaveNet made toward human is the number everyone cited, the same result shown in TTS voices. For tracking a field that was getting better year over year, MOS did its job.
Limitations of Mean Opinion Score
The trouble arrives at the top of the scale. When several systems all land between roughly 4.3 and 4.5, MOS stops telling them apart: the averages crowd together and the differences fall inside the noise of human rating.[1][2][3] A ruler whose smallest markings are wider than the things you are measuring has stopped being useful, and modern TTS lives in exactly that region.
Three weaknesses compound the saturation.
First, MOS is not comparable across studies. The score depends on who the listeners were, what instructions they got, which samples were in the test, and what the scale was anchored against. The same audio scores differently depending only on what it was played next to, so a 4.4 from one paper and a 4.4 from another are not the same measurement.[1] Without a shared reference in the test, the absolute number means little.
Second, MOS hides what is wrong. One naturalness score blends intelligibility, prosody, pronunciation accuracy, and artifacts into a single figure, so it cannot tell you why a voice scored 4.2. Was the prosody flat? Did it mispronounce names? Was there a glitch every minute? The specific failure, the thing you most want to act on, is the one MOS averages away.[5]
Third, MOS rewards the easy sentence. Tests run on short, neutral sentences, and vendors pick favorable ones. That tells you nothing about long-form reading, contrastive emphasis, questions, or the names and numbers that break voices in production, where a voice that aces short demo lines can still mispronounce names and lose prosody over a full paragraph.
Additional evaluation methods
Nobody has replaced MOS with a single better number, for the same reason no single number replaced WER: quality is several things at once. Evaluation has become a toolkit instead.
Comparison MOS (CMOS) plays two systems on the same sentence and asks listeners which they prefer, and by how much. As a direct A/B with a shared reference, it is far more sensitive than absolute MOS in the saturated region, and it is now the standard way to show one system beats another. MUSHRA is a more elaborate listening test that presents many samples at once with a hidden high-quality reference and a deliberately degraded anchor, forcing finer discrimination.[3]
Intelligibility tests measure whether listeners, or an ASR system run on the synthetic audio, can correctly transcribe what was said, catching the pronunciation and clarity failures naturalness ratings miss. Objective metrics like mel-cepstral distortion compare synthetic audio to a reference recording mathematically; they are cheap and repeatable, but weak proxies for human judgment. Automatic MOS predictors, neural models trained to estimate human MOS, are improving but inherit the limits of the scores they learned from.
The most useful category for a builder is task and edge-case tests. Probe what your use depends on: names, numbers, questions, lists, long passages, expressive lines, code-switching. Score those directly, along with the UX metrics a listening test ignores, like time-to-first-audio. For a voice cloning system, add speaker similarity: does it sound like the target?
Evaluating TTS for an application
The single most useful habit, repeated from TTS voices because it matters that much: ignore the sample on the website and feed the voice your own hardest text. A page of your real output, with your actual names, reference codes, questions, and the longest paragraph you will ever ask it to read, tells you more than any MOS on any leaderboard. Treat evaluation as a test you design around what you will use the voice for, not a number you collect from someone else, and choose the system that holds up on your own text.
Common questions
What is a Mean Opinion Score (MOS)?
A subjective measure of speech quality: listeners rate samples from 1 (bad) to 5 (excellent), usually for naturalness, and the ratings are averaged. MOS has been the standard TTS evaluation for decades and is still widely quoted, but it is saturating now that many systems cluster near the top of the scale.
Is a higher MOS always a better voice?
No. MOS is not comparable across studies, since it depends on the listeners, instructions, and samples used, and it averages away specific failures like bad prosody or mispronounced names. A high MOS on short, favorable sentences says little about long-form reading or your content.
What is the difference between MOS and CMOS?
MOS gives each sample an absolute rating and averages them. CMOS (comparison MOS) plays two systems on the same sentence and asks which listeners prefer. The shared reference makes CMOS discriminate far better when systems are all scoring high. It is now the standard way to show one system beats another.
How should I evaluate a TTS system for my product?
Test what you will actually use it for. Run your real names, numbers, questions, lists, and longest passages through it rather than its demo sentences, score pronunciation and prosody on those directly, and measure UX factors like time-to-first-audio. A single naturalness score on easy text does not predict how the voice behaves on your content.
Related concepts
References
- Kirkland, A., Mehta, S., Lameris, H., Henter, G. E., Székely, É., & Gustafson, J. (2023). Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation. Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW2023), 41–47.
- Wan, V., Agiomyrgiannakis, Y., Silen, H., & Vit, J. (2017). Google's Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders. INTERSPEECH 2017.
- Lajszczak, M., et al. (2024). Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation. arXiv preprint arXiv:2411.12719.
- van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
- Chen, Y., et al. (2025). Towards Responsible Evaluation for Text-to-Speech. arXiv preprint arXiv:2510.06927.
- Line, T.-H., & Sneha, H. (2026). Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography. arXiv preprint arXiv:2603.05267.
- Moving beyond word error rate to evaluate automatic speech recognition in clinical samples: Lessons from research into schizophrenia-spectrum disorders. Psychiatry Research (2025).