A speech-recognition engine has a set of attributes that affect the interaction between the engine and the audio source. An application can query and set a site's attributes by using the ISRAttributes interface provided by the engine object. To get the address of ISRAttributes, call the ISRCentral::QueryInterface member function with the IID_ISRAttributes interface identifier.
Many speech-recognition engines can automatically adjust the gain of the incoming audio signal (if the audio device supports it). The gain is the increase in signaling power, measured in decibels (dB), that occurs as a signal is boosted by an electronic device.
An application can use the ISRAttributes::AutoGainEnableSet member function to set the speed with which the engine adjusts the signaling power. When calling AutoGainEnableSet, the application specifies a value from 0 to 100. A value of 0 disables automatic gain, and a value of 100 causes the voice-command object to set the gain to the value for the previous utterance so that if the next utterance is spoken at the same level, the gain is set perfectly. A value between 0 and 100 moderates the automatic adjustments on a linear scale. For example, a value of 50 adjusts the gain to 50 percent of the level for the previous utterance.
When automatic gain is enabled, the speech-recognition engine may adjust the level at the end of an utterance and increase or decrease the gain. The engine adjusts the gain by using the IAudio::LevelGet and IAudio::LevelSet member functions to communicate with the audio-source object that represents the incoming audio stream. If the audio source does not support the IAudio::LevelSet member function, the application cannot set the automatic gain.
An application retrieves the current automatic gain value for a site by using the ISRAttributes::AutoGainEnableGet member function.
Echo canceling is a method of controlling echoes on an audio signal, in which the sender checks the inbound channel for a slightly delayed duplicate of its own transmission. In echo canceling, the sender adds an appropriately modified, reversed version of its transmission. The result is to erase the echo electronically but leave incoming data intact.
An application can use the ISRAttributes::EchoSet member function to indicate whether or not the engine should treat the incoming audio signal as though it has residual signal on it. In general, if an application indicates that the audio source uses echo canceling, the speech-recognition engine ignores low-level audio signals so that it does not attempt to recognize the residual signal of the echo cancellation.
An application can use the ISRAttributes::EchoGet member function to determine the current state of echo canceling.
The energy floor (or noise floor) is the noise value in the signal-to-noise ratio (SNR) for a particular environment. In general, the higher the noise value in the ratio, the more sensitive the speech-recognition engine is to background noise. For example, a quiet office would have a high SNR (low floor), whereas a noisy factory would have a low SNR (high floor).
If an application has information about the SNR that it will receive from the incoming audio stream, it can inform the speech-recognition engine by using the ISRAttributes::EnergyFloorSet member function. The engine can use the value internally to adjust its calculated noise floor in expectation of the audio stream's SNR. The energy floor is specified in negative decibels. For example, a value of 30 indicates an energy floor of –30 dB.
An application uses the EnergyFloorGet function to retrieve the current noise floor expected by the engine.
If an audio source uses a microphone as the audio input device, the application can use the ISRAttributes::MicrophoneSet member function to set the name of the microphone for that source. An application can use the microphone name to save or retrieve information about the microphone, such as the type of microphone used to train the engine and the conditions in which the microphone is recording. For example, an application could use this information to preserve the original training when the user changes microphones.
An application uses the ISRAttributes::MicrophoneGet member function to retrieve the microphone name for an audio source.
The real-time setting is the percentage of processor time that the engine developer expects the engine to use during constant speech. For example, if the real-time setting is 100, the engine takes one full minute of processor time to process one minute of speech. If the real-time setting is 50, the engine takes 30 seconds of processor time to process the same minute of speech. This value is difficult to compute precisely, so it should be regarded as an estimate.
The real-time setting can be more than 100 for non-real-time applications (for example, applications that transcribe prerecorded speech). For most engines, the amount of processor time required diminishes markedly during periods of silence.
If an application changes the real-time setting for an engine by calling the ISRAttributes::RealTimeSet member function, the engine attempts to meet the new real-time expectation. However, a real-time setting of 1 is not possible for today's personal computers.
An application retrieves the current real-time setting for an engine by using the RealTimeGet function.
One way an application can improve the recognition accuracy of an engine is to train the engine to take into account the unique qualities of the user's voice. By using the ISRDialogs interface of the engine object, an application can direct the engine to display its training dialog box.
Typically, an engine's training dialog box displays a sequence of words and phrases that the user must speak into the audio input device. The engine processes the user's spoken input and saves information that helps the engine improve its recognition accuracy for that user. The engine saves the user's name (that is, the speaker's name) along with the user's training information, and it uses the name to load the information whenever that user becomes the speaker for an audio source.
An application changes the speaker name for an audio source by using the ISRAttributes::SpeakerSet member function. Changing the speaker name unloads all training for the previous speaker and loads the training for the new speaker. If no training exists for the new speaker, the application starts with the engine's default training. The ISRAttributes::SpeakerGet member function retrieves the name of the current speaker for an audio source.
An application sets and queries the threshold level of an audio source by using the ISRAttributes::ThresholdSet and ThresholdGet member functions. The threshold level is a value that indicates the point below which the engine rejects an utterance as unrecognized. A threshold level of 0 indicates that the engine should match any utterance to the closest phrase match. A value of 100 indicates that the engine should determine absolutely that an utterance is the recognized phrase or else reject it.
For example, suppose the engine is expecting "What is the time?" If the threshold is 100 and the user mumbles "What'z tha time" or has a cold, the command may not be recognized. However, if the threshold is too low and the user says a similar-sounding phrase that is not being listened for (such as "What is mine?") the engine may recognize it as "What is the time?"
The engine object indicates whether an utterance is above or below the threshold level when it calls the ISRGramNotifySink interface to inform the application of the utterance. For more information about the ISRGramNotifySink interface, see "Grammar-Specific Notifications" later in this section.
A speech-recognition engine has two time-out values associated with it. The first, called the incomplete-phrase value, is the number of milliseconds that the speech-recognition engine waits before discarding an incomplete phrase because the user has stopped speaking. For example, if the incomplete-phrase value is 2000, a user speaking "Send mail to " could pause for two seconds before the engine would assume that the user has stopped speaking the phrase.
The second time-out value, called the complete-phrase value, is the number of milliseconds that the engine waits before regarding a phrase as complete after the user has stopped speaking. For example, if the complete-phrase value is 500, a user speaking "Send mail to Fred" would see results one-half second after he or she finished the phrase.
An application sets and queries the time-out values by using the ISRAttributes::TimeOutSet and TimeOutGet member functions.