Telephones don't have monitors, keyboards, or mice. Although this is not a limit of the speech recognition and text-to-speech technologies, it does require that application designers rely heavily on the technologies. Because of this, the technologies must be used even for purposes where they don't perform as well as monitors, keyboards, and mice. The lack of other input and output devices significantly changes the user interface. Application designers should be aware of the following effects:
For many uses, speech recognition is not as good of an input device as the keyboard and mouse. When the user is accessing the application over the phone he/she will be forced to use speech recognition exclusively so special user interface design should be taken so that speech recognition's weaknesses don't preclude the use of the application.
Applications that have a GUI continually provide visual cues about the application's state. They have title bars, text displays, and various buttons and other controls that give users a clue about what they can do. Users accessing the application over the telephone do not get this information, or if they do it is delivered to them at the much slower pace of speech. Users of a telephony application often forget the application's state, so it is helpful to remind them occasionally.
Speech is slower at communicating information that video, and it does not easily allow users to select which information they want detailed. This means that telephony applications need to give users trimmed down slices of information and allow the user to specify which of the pieces he/she wants more information about. For example, an E-mail application designed for a GUI will display a list of hundreds of messages. A telephony E-mail application cannot read out the titles of the hundreds of messages. It must provide user interface that allows the user to focus in on the messages he/she wishes to hear. The E-mail application might first ask the user if he/she wants to hear new messages or ones that were already read. From there it could organize messages by priority, etc.
In any particular state a user might have hundreds of options. A GUI can visually display all of the options on the screen, but a telephony application cannot read them all out. When users enter a new state telephony applications should read out an abridged list of options and allow users to ask for more options or more detailed information about an option.
When a user types a number or word into a field on an application they can see the results. Telephony applications cannot display the results so they must provide audio feedback to indicate that they heard the correct information. Because speech recognition often makes mistakes, telephony applications must also provide an easy mechanism for users to correct the mistakes.
Because it is not always obvious when the computer has stopped talking or that the computer has heard the user, the application should give audio feedback. The most effective audio feedback seems to be short beeps. Play one short beep at the end of a question to indicate that the user is expected to speak, another one when a response is recognized, and a third when speech is heard but unrecognized. This not only reassures the user that he/she is being heard, but it also hints to users that they're talking to a computer since real people don't beep.