Overview of Speech Technologies

Speech recognition is the ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker. Text-to-speech is the ability of a computer to convert text information into synthetic speech output.

Speech recognition and text-to-speech use engines, which are the programs that do the actual work of recognizing speech or playing text. Most speech-recognition engines convert incoming audio data to engine-specific phonemes, which are then translated into text that an application can use. (A phoneme is the smallest structural unit of sound that can be used to distinguish one utterance from another in a spoken language.) A text-to-speech engine performs the same process, in reverse. Engines are supplied by vendors that specialize in speech software; they may be bundled with new audio-enabled computers and sound cards, purchased separately, or licensed from the vendor.

The speech-recognition engine transcribes audio data received from an audio source, such as a microphone or a telephone line. The text-to-speech engine converts text to audio data, which is sent to an audio destination, such as a speaker, a headphone, or a telephone line. Under some circumstances, an engine may be able to transcribe audio data to or from a file.

An engine typically provides more than one mode for recognizing speech or playing text. For example, a speech-recognition engine will have a mode for each language or dialect that it can recognize. Likewise, a text-to-speech engine will have a mode for each voice, which plays text in a different speaking style or personality. Other modes may be optimized for a particular audio sampling rate, such as 8 kilohertz (kHz) for use over a telephone line.

Speech recognition can be as simple as a predefined set of voice commands that an application can recognize. More complex speech recognition involves the use of a grammar, which defines a set of words and phrases that can be recognized. A grammar may use rules to predict the most likely words to follow the word just spoken, or it may define a context that identifies the subject of dictation and the expected style of language.

Both speech-recognition and text-to-speech engines may make use of a pronunciation lexicon, which is a database of correct pronunciations for words and phrases to be recognized or played.

An engine's approach to recognizing speech or playing text determines the quality of speech in an application — that is, the accuracy of recognition or clarity of playback — and the amount of effort required from the user to get good accuracy or clarity. The engine's approach also affects the processor speed and memory required by an application; it may also influence the application's features or the design of its user interface.

This section provides an overview of speech technology. An understanding of both speech recognition and text-to-speech will help you decide how to best incorporate speech in your application and how to choose a technology that supports what you want to do.