Text-to-Speech Voice Quality

Most text-to-speech engines can render individual words successfully. However, as soon as the engine speaks a sentence, it is easy to identify the voice as synthesized because it lacks human prosody — that is, the inflection, accent, and timing of speech. For this reason, most text-to-speech voices are difficult to listen to and require concentration to understand, especially for more than a few words at a time.

Some engines allow an application to define text-to-speech segments with human prosody attached, making the synthesized voice much clearer. The engine provides this capability by prerecording a human voice and allowing the application developer to transfer its intonation and speed to the text being spoken.

In effect, this acts as a highly effective voice compression algorithm. Although text with prosody attached requires more storage than ASCII text (1K per minute compared to a few hundred bytes per minute), it requires considerably less storage than prerecorded speech, which requires at least 30K per minute. These factors also influence the quality of a synthesized voice:

· Emotion. Although many text-to-speech engines can parse and interpret punctuation, such as periods, commas, exclamation points, and questions, none of the engines that are currently available can render the sound of human emotion.

· Mispronunciation. Text-to-speech engines use a set of pronunciation rules to translate text into phonemes. This is fairly easy for languages with phonetic alphabets, but it is very difficult for the English language, especially if last names are to be pronounced correctly. (Pronunciation rules seldom fail on common words, but they almost always fail on odd or foreign-sounding names.)

If an engine mispronounces a word, the only way that the user can change the pronunciation is by entering either the phonemes, which is not an easy task, or by choosing a series of "sound-alike" words that combine to make the correct pronunciation.