This chapter covers a handful of issues that must be addressed when designing and installing SR/TTS applications, including hardware requirements, and the state of current SR/TTS technology and its limits. The chapter also includes some tips for designing your SR/TTS applications.
SR/TTS applications can be resource hogs. The section on hardware shows you the minimal, recommended, and preferred processor and RAM requirements for the most common SR/TTS applications. Of course, speech applications also need special hardware, including audio cards, microphones, and speakers. In this chapter, you'll find a general list of compatible devices, along with tips on what other options you have and how to use them.
You'll also learn about the general state of SR/TTS technology and its limits. This will help you design applications that do not place unrealistic demands on the software or raise users' expectations beyond the capabilities of your application.
Finally, this chapter contains a set of tips and suggestions for designing and implementing SR/TTS services. You'll learn how to design SR and TTS interfaces that reduce the chance of engine errors, and increase the usability of your programs.
When you complete this chapter, you'll know just what hardware is needed for speech systems and how to design programs that can successfully implement SR/TTS services that really work.
Speech systems can be resource intensive. It is especially important that SR engines have enough RAM and disk space to respond quickly to user requests. Failure to respond quickly results in additional commands spoken into the system. This has the effect of creating a spiraling degradation in performance. The worse things get, the worse things get. It doesn't take too much of this before users decide your software is more trouble than it's worth!
Text-to-speech engines can also tax the system. While TTS engines do not always require a great deal of memory to operate, insufficient processor speed can result in halting or unintelligible playback of text.
For these reasons, it is important to establish clear hardware and software requirements when designing and implementing your speech-aware and speech-enabled applications. Not all pcs will have the memory, disk space, and hardware needed to properly implement SR and TTS services. There are three general categories of workstation resources that should be reviewed:
The following three sections provide some general guidelines to follow when establishing minimal resource requirements for your applications.
Speech systems can tax processor and RAM resources. SR services require varying levels of resources depending on the type of SR engine installed and the level of services implemented. TTS engine requirements are rather stable, but also depend on the TTS engine installed.
The SR and TTS engines currently available for SAPI systems usually can be successfully implemented using as little as a 486/33 processor chip and an additional 1MB of RAM. However, overall pc performance with this configuration is pretty poor and is not recommended. A good suggested processor is a Pentium processor (P60 or better) with at least 16MB of total RAM. Systems that will be supporting dictation SR services require the most computational power. It is not unreasonable to expect the workstation to use 32MB of RAM and a P100 or higher processor. Obviously, the more resources, the better the performance.
In general, SR systems that implement command and control services
will only need an additional 1MB of RAM (not counting the application's
RAM requirement). Dictation services should get at least another
8MB of RAM-preferably more. The type of speech sampling, analysis,
and size of recognition vocabulary all affect the minimal resource
requirements. Table 16.1 shows published minimal processor and
RAM requirements of speech recognition services.
Levels of Speech-Recognition Services | ||
Discrete, speaker-dependent, whole word, small vocabulary | ||
Discrete, speaker-independent, whole word, small vocabulary | ||
Continuous, speaker-independent, sub-word, small vocabulary | ||
Discrete, speaker-dependent, whole word, large vocabulary | Pentium | |
Continuous, speaker-independent, sub-word, large vocabulary | RISC processor |
These memory requirements are in addition to the requirements of the operating system and any loaded applications. The minimal Windows 95 memory model should be 12MB. Recommended RAM is 16MB and 24MB is preferred. The minimal NT memory should be 16MB with 24MB recommended and 32MB preferred.
TTS engines do not place as much of a demand on workstation resources as SR engines. Usually TTS services only require a 486/33 processor and only 1MB of additional RAM. TTS programs themselves are rather small-about 150K. However, the grammar and prosody rules can demand as much as another 1MB depending on the complexity of the language being spoken. It is interesting to note that probably the most complex and demanding language for TTS processing is English. This is primarily due to the irregular spelling patterns of the language.
Most TTS engines use speech synthesis to produce the audio output. However, advanced systems can use diphone concatenation. Since diphone-based systems rely on a set of actual voice samples for reproducing written text, these systems can require an additional 1MB of RAM. To be safe, it is a good idea to suggest a requirement of 2MB of additional RAM, with a recommendation of 4MB for advanced TTS systems.
The general software requirements are rather simple. The Microsoft Speech API can only be implemented on Windows 32-bit operating systems. This means you'll need Windows 95 or Windows NT 3.5 or greater on the workstation.
Note |
All the testing and programming examples covered in this book have been performed using Windows 95. It is assumed that Windows NT systems will not require any additional modifications. |
The most important software requirements for implementing speech services are the SR and TTS engines. An SR/TTS engine is the back-end processing module in the SAPI model. Your application is the front end, and the SPEEch.DLL acts as the broker between the two processes.
The new wave of multimedia pcs usually has SR/TTS engines as part of their initial software package. For existing pcs, most sound cards now ship with SR/TTS engines.
Microsoft's Speech SDK does not include a set of SR/TTS engines. However, Microsoft does have an engine on the market. Their Microsoft Phone software system (available as part of modem/sound card packages) includes the Microsoft Voice SR/TTS engine. You can also purchase engines directly from third-party vendors.
Note |
Refer to appendix B, "SAPI Resources," for a list of vendors that support the Speech API. You can also check the CD-ROM that ships with this book for the most recent list of SAPI vendors. Finally, the Microsoft Speech SDK contains a list of SAPI engine providers in the ENGINE.DOC file. |
Complete speech-capable workstations need three additional pieces of hardware:
Just about any sound card can support SR/TTS engines. Any of the major vendors' cards are acceptable, including Sound Blaster and its compatibles, Media Vision, ESS technology, and others. Any card that is compatible with Microsoft's Windows Sound System is also acceptable.
Many vendors are now offering multifunction cards that provide speech, data, FAX, and telephony services all in one card. You can usually purchase one of these cards for about $250-$500. By installing one of these new cards, you can upgrade a workstation and reduce the number of hardware slots in use at the same time.
A few speech-recognition engines still need a DSP (digital signal processor) card. While it may be preferable to work with newer cards that do not require DSP handling, there are advantages to using DSP technology. DSP cards handle some of the computational work of interpreting speech input. This can actually reduce the resource requirements for providing SR services. In systems where speech is a vital source of process input, DSP cards can noticeably boost performance.
SR engines require the use of a microphone for audio input. This is usually handled by a directional microphone mounted on the pc base. Other options include the use of a lavaliere microphone draped around the neck, or a headset microphone that includes headphones. Depending on the audio card installed, you may also be able to use a telephone handset for input.
Most multimedia systems ship with a suitable microphone built into the pc or as an external device that plugs into the sound card. It is also possible to purchase high-grade unidirectional microphones from audio retailers. Depending on the microphone and the sound card used, you may need an amplifier to boost the input to levels usable by the SR engine.
The quality of the audio input is one of the most important factors in successful implementation of speech services on a pc. If the system will be used in a noisy environment, close-talk microphones should be used. This will reduce extraneous noise and improve the recognition capabilities of the SR engine.
Speakers or headphones are needed to play back TTS output. In private office spaces, free-standing speakers provide the best sound reproduction and fewest dangers of ear damage through high-levels of playback. However, in larger offices, or in areas where the playback can disturb others, headphones are preferred.
Tip |
As mentioned earlier in this chapter, some systems can also provide audio playback through a telephone handset. Conversely, the use of free-standing speakers and a microphone can be used successfully as a speaker-phone system. |
As advanced as SR/TTS technology is, it still has its limits. This section covers the general technology issues for SR and TTS engines along with a quick summary of some of the limits of the process and how this can affect perceived performance and system design.
Speech recognition technology can be measured by three factors:
Word selection deals with the process of actually perceiving "word items" as input. Any speech engine must have some method for listening to the input stream and deciding when a word item has been uttered. There are three different methods for selecting words from the input stream. They are:
Discrete speech is the simplest form of word selection. Under discrete speech, the engine requires a slight pause between each word. This pause marks the beginning and end of each word item. Discrete speech requires the least amount of computational resources. However, discrete speech is not very natural for users. With a discrete speech system, users must speak in a halting voice. This may be adequate for short interactions with the speech system, but rather annoying for extended periods.
A much more preferred method of handling speech input is word spotting. Under word spotting, the speech engine listens for a list of key words along the input stream. This method allows users to use continuous speech. Since the system is "listening" for key words, users do not need to use unnatural pauses while they speak. The advantage of word spotting is that it gives users the perception that the system is actually listening to every word while limiting the amount of resources required by the engine itself. The disadvantage of word spotting is that the system can easily misinterpret input. For example, if the engine recognizes the word run, it will interpret the phrases "Run Excel" and "Run Access" as the same phrase. For this reason, it is important to design vocabularies for word-spotting systems that limit the possibility of confusion.
The most advanced form of word selection is the continuous speech method. Under continuous speech, the SR engine attempts to recognize each word that is uttered in real time. This is the most resource-intensive of the word selection methods. For this reason, continuous speech is best reserved for dictation systems that require complete and accurate perception of every word.
The process of word selection can be affected by the speaker. Speaker dependence refers to the engine's ability to deal with different speakers. Systems can be speaker dependent, speaker independent, or speaker adaptive. The disadvantage of speaker-dependent systems is that they require extensive training by a single user before they become very accurate. This training can last as much as one hour before the system has an accuracy rate of over 90 percent. Another drawback to speaker-dependent systems is that each new user must re-train the system to reduce confusion and improve performance. However, speaker-dependent systems provide the greatest degree of accuracy while using the least amount of computing resources.
Speaker-adaptive systems are designed to perform adequately without training, but they improve with use. The advantage of speaker-adaptive systems is that users experience success without tedious training. Disadvantages include additional computing resource requirements and possible reduced performance on systems that must serve different people.
Speaker-independent systems provide the greatest degree of accuracy without performance. Speaker-independent systems are a must for installations where multiple speakers need to use the same station. The drawback of speaker-independent systems is that they require the greatest degree of computing resources.
Once a word item has been selected, it must be analyzed. Word analysis techniques involve matching the word item to a list of known words in the engine's vocabulary. There are two methods for handling word analysis: whole-word matching or sub-word matching. Under whole-word matching, the SR engine matches the word item against a vocabulary of complete word templates. The advantage of this method is that the engine is able to make an accurate match very quickly, without the need for a great deal of computing power. The disadvantage of whole-word matching is that it requires extremely large vocabularies-into the tens of thousands of entries. Also, these words must be stored as spoken templates. Each word can require as much as 512 bytes of storage.
An alternate word-matching method involves the use of sub-words called phonemes. Each language has a fixed set of phonemes that are used to build all words. By informing the SR engine of the phonemes and their representations it is much easier to recognize a wider range of words. Under sub-word matching, the engine does not require an extensive vocabulary. An additional advantage of sub-word systems is that the pronunciation of a word can be determined from printed text. Phoneme storage requires only 5 to 20 bytes per phoneme. The disadvantage of sub-word matching is that is requires more processing resources to analyze input.
It is important to understand the limits of current SR technology and how these limits affect system performance. Three of the most vital limitations of current SR technology are:
The first hurdle for SR engines is determining when the speaker is addressing the engine and when the words are directed to someone else in the room. This skill is beyond the SR systems currently on the market. Your program must allow users to inform the computer that you are addressing the engine. Also, SR engines cannot distinguish between multiple speakers. With speaker-independent systems, this is not a big problem. However, speaker-dependent systems cannot deal well in situations where multiple users may be addressing the same system.
Even speaker-independent systems can have a hard time when multiple speakers are involved. For example, a dictation system designed to transcribe a meeting will not be able to differentiate between speakers. Also, SR systems fail when two people are speaking at the same time.
SR engines also have limits regarding the processing of identified words. First, SR engines have no ability to process natural language. They can only recognize words in the existing vocabulary and process them based on known grammar rules. Thus, despite any perceived "friendliness" of speech-enabled systems, they do not really understand the speaker at all.
SR engines also are unable to hear a new word and derive its meaning from previously spoken words. The system is incapable of spelling or rendering words that are not already in its vocabulary.
Finally, SR engines are not able to deal with wide variations in pronunciation of the same word. For example, words such as either (ee-ther or I-ther) and potato (po-tay-toe or po-tah-toe) can easily confuse the system. Wide variations in pronunciation can greatly reduce the accuracy of SR systems.
Recognition accuracy can be affected by regional dialects, quality of the microphone, and the ambient noise level during a speech session. Much like the problem with pronunciation, dialect variations can hamper SR engine performance. If your software is implemented in a location where the common speech contains local slang or other region-specific words, these words may be misinterpreted or not recognized at all.
Poor microphones or noisy office spaces also affect accuracy. A system that works fine in a quiet, well-equipped office may be unusable in a noisy facility. In a noisy environment, the SR engine is more likely to confuse similar-sounding words such as out and pout, or in and when. For this reason it is important to emphasize the value of a good microphone and a quiet environment when performing SR activities.
TTS engines use two different techniques for turning text input into audio output-synthesis or diphone concatenation. Synthesis involves the creation of human speech through the use of stored phonemes. This method results in audio output that is understandable, but not very human-like. The advantage of synthesis systems is that they do not require a great deal of storage space to implement and that they allow for the modification of voice quality through the adjustment of only a few parameters.
Diphone-based systems produce output that is much closer to human speech. This is because the system stores actual human speech phoneme sets and plays them back. The disadvantage of this method is that it requires more computing and storage capacity. However, if your application is used to provide long sessions of audio output, diphone systems produce a speech quality much easier to understand.
TTS engines are limited in their ability to re-create the details of spoken language, including the rhythm, accent, and pitch inflection. This combination of properties is call the prosody of speech. TTS engines are not very good at adding prosody. For this reason, listening to TTS output can be difficult, especially for long periods of time. Most TTS engines allow users to edit text files with embedded control information that adds prosody to the ASCII text. This is useful for systems that are used to "read" text that is edited and stored for later retrieval.
TTS systems have their limits when it comes to producing individualized voices. Synthesis-based engines are relatively easy to modify to create new voice types. This modification involves the adjustment of general pitch and speed to produce new vocal personalities such as "old man," "child," "female," "male," and so on. However, these voices still use the same prosody and grammar rules.
Creating new voices for diphone-based systems is much more costly than for synthesis-based systems. Since each new vocal personality must be assembled from pre-recorded human speech, it can take quite a bit of time and effort to alter an existing voice set or to produce a new one. Diphone concatenation is costly for systems that must support multiple languages or need to provide flexibility in voice personalities.
There are a number of general issues to keep in mind when designing SR interfaces to your applications.
First, if you provide speech services within your application, you'll need to make sure you let the user know the services are available. This can be done by adding a graphic image to the display, telling the user that the computer is "listening," or you can add caption or status items that indicate the current state of the SR engine.
It is also a good idea to make speech services an optional feature whenever possible. Some installations may not have the hardware or RAM required to implement speech services. Even if the workstation has adequate resources, the user may experience performance degradation with the speech services active. It is a good idea to have a menu option or some other method that allows users to turn off speech services entirely.
When you add speech services to your programs, it is important to make sure you give users realistic expectations regarding the capabilities of the installation. This is best done through user documentation. You needn't go into great length, but you should give users general information about the state of SR technology, and make sure users do not expect to carry on extensive conversations with their new "talking electronic pal."
Along with indications that speech services are active, it is a good idea to provide users with a single speech command that displays a list of recognized speech inputs, and some general online help regarding the use and capabilities of the SR services of your program. Since the total number of commands might be quite large, you may want to provide a type of voice-activated help system that allows users to query the current command set and then ask additional questions to learn more about the various speech commands they can use.
It is also a good idea to add confirmations to especially dangerous or ambiguous speech commands. For example, if you have a voice command for "Delete," you should ask the user to confirm this option before continuing. This is especially important if you have other commands that may sound similar-if you have both "Delete" and "Repeat" in the command list you will want to make sure the system is quite sure which command was requested.
In general, it is a good idea to display the status of all speech processing. If the system does not understand a command, it is important to tell users rather than making them sit idle while your program waits for understandable input. If the system cannot identify a command, display a message telling the user to repeat the command, or bring up a dialog box that lists likely possibilities from which the user can select the requested command.
In some situations, background noise can hamper the performance of the SR engine. It is advisable to allow users to turn off speech services and only turn them back on when they are needed. This can be handled through a single button press or menu selection. In this way, stray noise will not be misinterpreted as speech input.
There are a few things to avoid when adding voice commands to an application. SR systems are not very successful when processing long series of numbers or single letters. "M" and "N" sound quite alike, and long lists of digits can confuse most SR systems. Also, although SR systems are capable of handling requests such as "move mouse left," "move mouse right," and so on, this is not a good use of voice technology. Using voice commands to handle a pointer device is a bit like using the keyboard to play musical notes. It is possible, but not desirable.
The key to designing good command menus is to make sure they are complete, consistent, and that they contain unique commands within the set. Good command menus also contain more than just the list of items displayed on the physical menu. It is a good idea to think of voice commands as you would keyboard shortcuts.
Useful voice command menus will provide access to all the common operations that might be performed by the user. For example, the standard menu might offer a top-level menu option of Help. Under the Help menu might be an About item to display the basic information about the loaded application. It makes sense to add a voice command that provides direct access to the About box with a Help About command.
These shortcut commands may span several menu levels or even stand independent of any existing menu. For example, in an application that is used to monitor the status of manufacturing operations within a plant, you might add a command such as Display Statistics that would gather data from several locations and present a graph onscreen.
When designing menus, be sure to include commands for all dialog boxes. It is not a good idea to provide voice commands for only some dialog boxes and not for others.
Tip |
You do not have to create menu commands for Windows-supplied dialog boxes (the Common Dialogs, the Message Box, and so on). Windows automatically supplies voice commands for these dialogs. |
Be sure to include voice commands for the list and combo boxes within a dialog box, as well as the command buttons, check boxes, and option buttons.
In addition to creating menus for all the dialog boxes of your applications, you should consider creating a "global" menu that is active as long as the application is running. This would allow users to execute common operations such as Get New Mail or Display Status Log without having to first bring the application into the foreground.
Tip |
It is advisable to limit this use of speech services to only a few vital and unique commands since any other applications that have speech services may also activate global commands. |
It is also important to include common alternate wordings for commonly used operations, such as Get New Mail and Check for New Mail, and so on. Although you may not be able to include all possible alternatives, adding a few will greatly improve the accessibility of your speech interface.
Use consistent word order in your menu design. For example, for action commands you should use the verb-noun construct, as in Save File or Check E-Mail. For questions, use a consistent preface such as How do I or Help Me , as in How do I check e-mail? or Help me change font. It is also important to be consistent with the use of singular and plural. In the above example, you must be sure to use Font or Fonts throughout the application.
Since the effectiveness of the SR engine is determined by its ability to identify your voice input against a list of valid words, you can increase the accuracy of the SR engine by keeping the command lists relatively short. When a command is spoken, the engine will scan the list of valid inputs in this state and select the most likely candidate. The more words on the list, the greater the chance the engine will select the wrong command. By limiting the list, you can increase the odds of a correct "hit."
Finally, you can greatly increase the accuracy of the SR engine by avoiding similar-sounding words in commands. For example, repeat and delete are dangerously similar. Other words that are easily confused are go and no, and even on and off. You can still use these words in your application if you use them in separate states. In other words, do not use repeat in the same set of menu options as delete.
There are a few things to keep in mind when adding text-to-speech services to your applications. First, make sure you design your application to offer TTS as an option, not as a required service. Your application may be installed on a workstation that does not have the required resources, or the user may decided to turn off TTS services to improve overall performance. For this reason, it is also important to provide visual as well as aural feedback for all major operations. For example, when processing is complete, it is a good idea to inform the user with a dialog box as well as a spoken message.
Because TTS engines typically produce a voice that is less than human-like, extended sessions of listening to TTS output can be tiring to users. It is a good idea to limit TTS output to short phrases. For example, if your application gathers status data on several production operations on the shop floor, it is better to have the program announce the completion of the process (for example, Status report complete) instead of announcing the details of the findings. Alternatively, your TTS application could announce a short summary of the data (for example, All operations on time and within specifications).
If your application must provide extended TTS sessions you should consider using pre-recorded WAV files for output. For example, if your application should allow users aural access to company regulations or documentation, it is better to record a person reading the documents, and then play back these recordings to users upon request. Also, if your application provides a limited set of vocal responses to the user, it is advisable to use WAV recordings instead of TTS output. A good example of this would be telephony applications that ask users questions and respond with fixed answers.
Finally, it is not advisable to mix WAV output and TTS output in the same session. This highlights the differences between the quality of recorded voice and computer-generated speech. Switching between WAV and TTS can also make it harder for users to understand the TTS voice since they may be expecting a familiar recorded voice and hear computer-generated TTS instead.
This chapter covered three main topics:
The Microsoft Speech SDK only works on 32-bit operating systems. This means you will need Windows 95 or Windows NT version 3.5 or greater in order to run SAPI applications.
The minimum, recommended, and preferred processor and RAM requirements for SAPI applications vary depending on the level of services your application provides. The minimum SAPI-enabled system may need as little as 1MB of additional RAM and be able to run on a 486/33 processor. However, it is a good idea to require at least Pentium 60 processor and an additional 8MB RAM. This will give your applications the additional computational power needed for the most typical SAPI implementations.
SAPI systems can use just about any of the current sound cards on the market today. Any card that is compatible with the Windows Sound System or with Sound Blaster systems will work fine. You should use a close-talk, unidirectional microphone, and use either external speakers or headphones for monitoring audio output.
You learned that SR technology uses three basic processes for interpreting audio input:
You also learned that SR systems have their limits. SR engines cannot automatically distinguish between multiple speakers, cannot learn new words, guess at spelling, or handle wide variations in word pronunciation (for example, to-may-toe or to-mah-toe).
TTS engine technology is based on two different types of implementation. Synthesis systems create audio output by generating audio tones using algorithms. This results in unmistakably computer-like speech. Diphone concatenation is an alternate method for generating speech. Diphones are a set of phoneme pairs collected from actual human speech samples. The TTS engine is able to convert text into phoneme pairs and match them to diphones in the TTS engine database. TTS engines are not able to mimic human speech patterns and rhythms (called prosody) and are not very good at communicating emotions. Also, most TTS engines experience difficulty with unusual words. This can result in odd-sounding phrases.
Finally, you learned some tips on designing and implementing speech services. Some of the tips covered here were:
In the next chapter, you'll use the information you learned here to start creating SAPI-enabled applications.