Low-Level Speech Recognition

When an application uses the low-level speech-recognition interfaces, it is talking directly to the engine. This provides the application with much more control but also involves more work. Because the low-level API is more complex, this article won't go into detail about how the engine object is used. However, an architectural overview will give you an idea of the processes involved.

The low-level API consists of many more objects than the high-level interface (voice commands). Here's how the process works:

The application determines where the speech recognition's audio should come from and creates an audio-source object through which the engine acquires the data. Microsoft supplies an audio-source object that gets its audio from the multimedia wave-in device, but the application is able to use customized audio sources, such as an audio source which acquires audio from a .WAV file or a specialized hardware device.

The application, through a speech-recognition enumerator object (not shown here, but provided by Microsoft), locates a speech-recognition engine that it wants to use. It then creates an instance of the engine object and passes it the audio-source object.

The engine object has a dialog with the audio-source object to find a common data format for the digital audio data, such as pulse code modulation (PCM). Once a suitable format is established, the engine creates an audio-source notification sink that it passes to the audio-source object. From then on, the audio-source object submits digital audio data to the engine through the notification sink.

The application can then register a main notification sink that receives grammar-independent notifications, such as whether or not the user is speaking.

When it is ready, the application creates one or more grammar objects. These are similar to the Voice-menu object in voice commands but more flexible in syntax recognition.

To find out what words the user spoke; the application creates a grammar-notification sink for every grammar object. When the grammar object recognizes a word or phrase, or has other grammar-specific information for the application, it calls functions in the grammar-notification sinks. The application responds to the notifications and takes whatever actions are necessary.

Typically, when a grammar object recognizes speech, it sends the grammar-notification sink a string indicating what was spoken. However, the engine may know a lot more information than this, such as alternative phrases that may have been spoken, timing, or even who is speaking the phrase. An application can find this information by requesting a results object for the phrase and interrogating the results object for more information.

For more information look in the "Low-Level Speech Recognition API" section.