What Can Speech Do?

Now that I've described reasonable expectations of speech technology, I'll tell you a bit about the current capabilities of text-to-speech and speech recognition.

Text-to-speech comes in two flavors, synthesized text-to-speech and concatenated text-to-speech.

Synthesized speech is what people typically think of when I mention text-to-speech. It reads text by analyzing the words and having the computer figure out the phonetic pronunciations for the words. The phonemes are then passed into a complex algorithm that simulates the human vocal tract and emits the sound. This method allows the text-to-speech to speak any word, even made-up ones like "Zamphoon," but it produces a voice that has very little emotion and is distinctly not human. You'd use this if you knew that the application had to speak, but you couldn't predict what it would need to say. Synthesized speech usually requires a 486/33 megahertz machine with 1 megabyte of working-set RAM.

Concatenated text-to-speech does something different. It analyzes the text and pulls recordings, words, and phrases out of a prerecorded library. The digital audio recordings are concatenated. Because the voice is just a recording that you've made, it sounds good. Unfortunately, if the text includes a word or phrase that you didn't record, the text-to-speech can't say it. Concatenated text-to-speech can be viewed as a form of audio compression because words or common phrases have to be recorded only once. For example, many telephone applications will have a recording for, "Press 1 to play new message; press 2 to send a fax," and so on, and another recording for, "Press 1 to fast-forward; press 2 to rewind." A concatenated text-to-speech will have only one recording of "press" rather than four. If concatenated text-to-speech doesn't seem that much different to you from recording your own .WAV files, you're right. However, concatenated text-to-speech will save you development time and bugs, allowing you to add more features to your software. Because concatenated text-to-speech just plays a .WAV file, it takes very little processor power and only a bit of memory, since most of the audio is stored on disk.

Speech recognition is somewhat more complicated to classify than text-to-speech. Each speech recognition engine has three characteristics:

Continuous vs. discrete: If speech recognition is continuous, users can speak to the system naturally. If it's discrete, users need to pause between each word. Obviously, continuous recognition is preferred over discrete recognition, but continuous recognition takes more processing power.

Vocabulary size: Speech recognition can support a small or large vocabulary. Small-vocabulary recognition allows users to give simple commands to their computers. To dictate a document, the system must have large-vocabulary recognition. Large-vocabulary recognition takes a lot more processor power and memory than small-vocabulary recognition.

Speaker dependency: Speaker-independent speech recognition works right out of the box, while speaker-dependent systems require that each user spend about 30 minutes training the system to his or her voice.

Although any combination of the three characteristics is possible, two combinations are popular today.

"Command and Control" speech recognition is continuous, small vocabulary, and speaker independent. This means that users can use several hundred different commands or phrases. If a user says a command that is not in the list, the speech-recognition system will return either "not recognized," or will think it heard a similar-sounding command. Because users of Command and Control can say only specific phrases, the phrases must be either visible on the screen--so intuitive that all users will know what to say--or the users must learn what phrases they can say. Command and Control speech recognition requires a 486/66 megahertz machine with 1 megabyte of working-set RAM.

"Discrete Dictation" speech recognition is discrete, large vocabulary, and speaker dependent. It's used to dictate text into word processors or mail, or for natural-language commands. Although users may say anything they wish, they must leave pauses between words, making the speech unnatural. Discrete dictation requires a Pentium/60 megahertz machine with 8 megabytes of RAM.