Chapter 14 covered the key factors in creating and implementing a complete speech system for pcs. You learned the three major parts of speech systems:
In Chapter 15 you learned the details of the SR and TTS interfaces defined by the Microsoft SAPI model. You also learned that the SAPI model is based on the Component Object Model (COM) interface and that Microsoft has defined two distinct levels of SAPI services:
You learned that the two levels of SAPI service each contain several COM interfaces that allow C programmers access to speech services. These interfaces include the ability to set and get engine attributes, turn the services on or off, display dialog boxes for user interaction, and perform direct TTS and SR functions.
Since the SAPI model is based on the COM interface, high-level languages such as Visual Basic cannot call functions directly using the standard API calls. Instead, Microsoft has developed OLE automation type libraries for use with Visual Basic and other VBA-compliant systems. The two type libraries are:
Chapter 16 focused on the hardware and software requirements of SAPI systems, the general technology and limits of SAPI services, and some design tips for creating successful SAPI implementations.
The Microsoft Speech SDK only works on 32-bit operating systems. This means you need Windows 95 or Windows NT Version 3.5 or greater in order to run SAPI applications.
The minimum, recommended, and preferred processor and RAM requirements for SAPI applications vary depending on the level of services your application provides. The minimum SAPI-enabled system may need as little as 1MB of additional RAM and be able to run on a 486/33 processor. However, it is a good idea to require at least a Pentium 60 processor and an additional 8MB RAM. This will give your applications the additional computational power needed for the most typical SAPI implementations.
SAPI systems can use just about any of the current sound cards on the market today. Any card that is compatible with the Windows Sound System or with Sound Blaster systems will work fine. You should use a close-talk, unidirectional microphone, and you can use either external speakers or headphones for monitoring audio output.
You learned that SR technology uses three basic processes for interpreting audio input:
You also learned that SR systems have their limits. SR engines cannot automatically distinguish between multiple speakers, cannot learn new words, guess at spelling, or handle wide variations in word pronunciation (for example, "toe- may- toe" versus "toe- mah- toe").
TTS engine technology is based on two different types of implementations. Synthesis systems create audio output by generating audio-tones using algorithms. This results in unmistakably computer-like speech. Diphone concatenation is an alternate method for generating speech. Diphones are sets of phoneme pairs collected from actual human speech samples. The TTS engine is able to convert text into phoneme pairs and match them to diphones in the TTS engine database. TTS engines are not able to mimic human speech patterns and rhythms (called prosody), and are not very good at communicating emotions. Also, most TTS engines experience difficulty with unusual words. This can result in odd-sounding phrases.
Finally, you learned some tips for designing and implementing speech services, including:
In Chapter 17 you learned that the Microsoft Speech SDK contains a set of OLE library files for implementing SAPI services using Visual Basic and other VBA-compatible languages. There is an OLE Automation Library for TTS services (VTXTAUTO.TLB), and one for SR services (VMCDAUTO.TLB). Chapter 17 showed you how to use the objects, methods, and properties in the OLE library to add SR and TTS services to your Windows applications.
You learned how to register and enable TTS services using the Voice Text object. You also learned how to adjust the speed and how to control the playback, rewind, fast forward, and pause methods of TTS output. Finally, you learned how to use a special Callback method to register a notification sink using a Visual Basic Class module.
You also learned how to register and enable SRT services using the Voice Command and Voice Menu objects. You learned how to build temporary and permanent menu commands and how to link them to program operations. You also learned how to build commands that accept a list of possible choices and how to use that list in a program. Finally, you learned how to use the Callback property to register a notification sink using the Visual Basic Class module.
In Chapter 18 you learned how the speech system uses grammar rules, control tags, and the International Phonetic Alphabet to perform its key operations.
You built simple grammars and tested them using the tools that ship with the Speech SDK. You also learned how to load and enable those grammars for use in your SAPI applications.
You added control tag information to your TTS input to improve the prosody and overall performance of TTS interfaces. You used Speech SDK tools to create and play back text with control tags, and you learned how to edit the stored lexicon to maintain improved TTS performance over time.
Finally, you learned how the International Phonetic Alphabet is used to store and reproduce common speech patterns. The IPA can be used by SR and TTS engines as a source for analysis and playback.
In Chapter 19 you learned how to write simple TTS and SR applications using C++. Since many of the SAPI features are available only through C++ coding, this chapter gave you a quick review of how to use C++ to implement SAPI services.
You built a simple TTS program that you can use to cut and paste any text for playback. You also built and tested a simple SR interface to illustrate the techniques required to add SRT services to existing applications.
In Chapter 20 you used all the information gathered from previous chapters to build a complete application that implements both TTS and SR services. The Voice-Activated Text Reader allows users to select text files to load, loads them into the editor page, and then reads them back to the user on command. All major operations can be performed using speech commands.
You also learned how to add SR services to other existing applications using a set of library modules that you can add to any Visual Basic project.
The future of SAPI is wide open. This section of the book gave you only a first glimpse of the possibilities ahead. At present, SAPI systems are most successful as command-and-control interfaces. Such interfaces allow users to use voice commands to start and stop basic operations that usually require keyboard or mouse intervention. Current technology offers limited voice playback services. Users can get quick replies or short readings of text without much trouble. However, long stretches of text playback are still difficult to understand.
With the creation of the generalized interfaces defined by Microsoft in the SAPI model, it will not be long before new versions of the TTS and SR engine appear on the market ready to take advantage of the larger base of Windows operating systems already installed. With each new release of Windows, and new versions of the SAPI interface, speech services are bound to become more powerful and more user-friendly.
Although we have not yet arrived at the level of voice interaction depicted in Star Trek and other futuristic tales, the release of SAPI for Windows puts us more than one step closer to that reality!