Chapter 15

SAPI Architecture


CONTENTS


Introduction

The Speech API is implemented as a series of Component Object Model (COM) interfaces. This chapter identifies the top-level objects, their child objects, and their methods.

The SAPI model is divided into two distinct levels:

Each of the two levels of SAPI services has its own set of objects and methods.

Along with the two sets of COM interfaces, Microsoft has also published an OLE Automation type library for the high-level SAPI objects. This set of OLE objects is discussed at the end of the chapter.

When you complete this chapter you'll understand the basic architecture of the SAPI model, including all the SAPI objects and their uses. Detailed information about the object's methods and parameters will be covered in the next chapter-"SAPI Basics."

Note
Most of the Microsoft Speech API is accessible only through C++ code. For this reason, many of the examples shown in this chapter are expressed in Microsoft Visual C++ code. You do not need to be able to code in C++ in order to understand the information discussed here. At the end of this chapter, the OLE Automation objects available through Visual Basic are also discussed.

High-Level SAPI

The high-level SAPI services provide access to basic forms of speech recognition and text-to-speech services. This is ideal for providing voice-activated menus, command buttons, and so on. It is also sufficient for basic rendering of text into speech.

The high-level SAPI interface has two top-level objects-one for voice command services (speech recognition), and one for voice text services (text-to-speech). The following two sections describe each of these top-level objects, their child objects, and the interfaces available through each object.

Voice Command

The Voice Command object is used to provide speech recognition services. It is useful for providing simple command-and-control speech services such as implementing menu options, activating command buttons, and issuing other simple operating system commands.

The Voice Command object has one child object and one collection object. The child object is the Voice Menu object and the collection object is a collection of enumerated menu objects (see Figure 15.1).

Figure 15.1 : The Voice Command object.

Voice Command Object

The Voice Command object supports three interfaces:

The Voice Command interface is used to enumerate, create, and delete voice menu objects. This interface is also used to register an application to use the SR engine. An application must successfully complete the registration before the SR engine can be used. An additional method defined for the Voice Command interface is the Mimic method. This is used to play back a voice command to the engine; it can be used to "speak" voice commands directly to the SR engine. This is similar to playing keystroke or mouse-action macros back to the operating system.

The Attributes interface is used to set and retrieve a number of basic parameters that control the behavior of the voice command system. You can enable or disable voice commands, adjust input gain, establish the SR mode, and control the input device (microphone or telephone).

The Dialogs interface gives you access to a series of dialog boxes that can be used as a standard set of input screens for setting and displaying SR engine information. The SAPI model identifies five different dialog boxes that should be available through the Dialogs interface. The exact layout and content of these dialog boxes is not dictated by Microsoft, but is determined by the developer of the speech recognition engine. However, Microsoft has established general guidelines for the contents of the SR engine dialog boxes. Table 15.1 lists each of the five defined dialog boxes along with short descriptions of their suggested contents.

Table 15.1. The Voice Command dialog boxes.
Dialog Box Name Description
About BoxUsed to display the dialog box that identifies the SR engine and show its copyright information.
Command VerificationCan be used as a verification pop-up window during a speech recognition session. When the engine identifies a word or phrase, this box can appear requesting the user to confirm that the engine has correctly understood the spoken command.
General DialogCan be used to provide general access to the SR engine settings such as identifying the speaker, controlling recognition parameters, and the amount of disk space allotted to the SR engine.
Lexicon DialogCan be used to offer the speaker the opportunity to alter the pronunciation lexicon, including altering the phonetic spelling of troublesome words, or adding or deleting personal vocabulary files.

The Voice Menu Object and the Menu Object Collection

The Voice Menu object is the only child object of the Voice Command object. It is used to allow applications to define, add, and delete voice commands in a menu. You can also use the Voice Menu object to activate and deactivate menus and, optionally, to provide a training dialog box for the menu.

The voice menu collection object contains a set of all menu objects defined in the voice command database. Microsoft SAPI defines functions to select and copy menu collections for use by the voice command speech engine.

The Voice Command Notification Callback

In the process of registering the application to use a voice command object, a notification callback (or sink) is established. This callback receives messages regarding the SR engine activity. Typical messages sent out by the SR engine can include notifications that the engine has detected commands being spoken, that some attribute of the engine has been changed, or that spoken commands have been heard but not recognized.

Note
Notification callbacks require a pointer to the function that will receive all related messages. Callbacks cannot be registered using Visual Basic; you need C or C++. However, the voice command OLE Automation type library that ships with the Speech SDK has a notification callback built into it.

Voice Text

The SAPI model defines a basic text-to-speech service called voice text. This service has only one object-the Voice Text object. The Voice Text object supports three interfaces:

The Voice Text interface is the primary interface of the TTS portion of the high-level SAPI model. The Voice Text interface provides a set method to start, pause, resume, fast forward, rewind, and stop the TTS engine while it is speaking text. This mirrors the VCR-type controls commonly employed for pc video and audio playback.

The Voice Text interface is also used to register the application that will request TTS services. An application must successfully complete the registration before the TTS engine can be used. This registration function can optionally pass a pointer to a callback function to be used to capture voice text messages. This establishes a notification callback with several methods, which are triggered by messages sent from the underlying TTS engine.

Note
Notification callbacks require a pointer to the function that will receive all related messages. Callbacks cannot be registered using Visual Basic; you need C or C++. However, the voice text OLE Automation type library that ships with the Speech SDK has a notification callback built into it.

The Attribute interface provides access to settings that control the basic behavior of the TTS engine. For example, you can use the Attributes interface to set the audio device to be used, set the playback speed (in words per minute), and turn the speech services on and off. If the TTS engine supports it, you can also use the Attributes interface to select the TTS speaking mode. The TTS speaking mode usually refers to a predefined set of voices, each having its own character or style (for example, male, female, child, adult, and so on).

The Dialogs interface can be used to allow users the ability to set and retrieve information regarding the TTS engine. The exact contents and layout of the dialog boxes are not determined by Microsoft but by the TTS engine developer. Microsoft does, however, suggest the possible contents of each dialog box. Table 15.2 shows the four voice text dialogs defined by the SAPI model, along with short descriptions of their suggested contents.

Table 15.2. The Voice Text dialog boxes.
Dialog NameDescription
About BoxUsed to display the dialog box that identifies the TTS engine and shows its copyright information.
Lexicon DialogCan be used to offer the speaker the opportunity to alter the pronunciation lexicon, including altering the phonetic spelling of troublesome words, or adding or deleting personal vocabulary files.
General DialogCan be used to display general information about the TTS engine. Examples might be controlling the speed at which the text will be read, the character of the voice that will be used for playback, and other user preferences as supported by the TTS engine.
Translate DialogCan be used to offer the user the ability to alter the pronunciation of key words in the lexicon. For example, the TTS engine that ships with Microsoft Voice has a special entry that forces the speech engine to express all occurrences of "TTS" as "text to speech," instead of just reciting the letters "T-T-S."

Low-Level SAPI

The low-level SAPI services provide access to a much greater level of control of Windows speech recognition and text-to-speech services. This level is best for implementing advanced SR and TTS services, including the creation of dictation systems.

Just as there are two basic service types for high-level SAPI, there are two primary COM interfaces defined for low-level SAPI-one for speech recognition and one for text-to-speech services. The rest of this chapter outlines each of the objects and their interfaces.

Note
This section of the chapter covers the low-level SAPI services. These services are available only from C or C++ programs-not Visual Basic. However, even if you do not program in C, you can still learn a lot from this section of the chapter. The material in this section can give you a good understanding of the details behind the SAPI OLE automation objects, and may also give you some ideas on how you can use the VB-level SAPI services in your programs.

Speech Recognition

The Speech Recognition object has several child objects and collections. There are two top-level objects in the SR system: the SR Engine Enumerator object and the SR Sharing object. These two objects are created using their unique CLSID (class ID) values. The purpose of both objects is to give an application information about the available speech recognition engines and allow the application to register with the appropriate engine. Once the engine is selected, one or more grammar objects can be created, and as each phrase is heard, an SR Results object is created for each phrase. This object is a temporary object that contains details about the phrase that was captured by the speech recognition engine. Figure 15.2 shows how the different objects relate to each other, and how they are created.

Figure 15.2 : Mapping the low-level SAPI objects.

When an SR engine is created, a link to a valid audio input device is also created. While it is possible to create a custom audio input device, it is not required. The default audio input device is an attached microphone, but can also be set to point to a telephone device.

The rest of this section details the low-level SAPI SR objects and their interfaces.

The SR Enumerator and Engine Enumerator Objects

The role of the SR Enumerator and Engine Enumerator objects is to locate and select an appropriate SR engine for the requesting application. The Enumerator object lists all available speech recognition modes and their associated installed engines. This information is supplied by the child object of the Enumerator object: the Engine Enumerator object. The result of this search is a pointer to the SR engine interface that best meets the service request.

The Enumerator and Engine Enumerator objects support only two interfaces:

Note
The SR Enumerator and Engine Enumerator objects are used only to locate and select an engine object. Once that is done, these two objects can be discarded.

The SR Sharing Object

The SR Sharing object is a possible replacement for the SR Enumerator and Engine Enumerator objects. The SR Sharing object uses only one interface, the ISRSharing interface, to locate and select an engine object that will be shared with other applications on the pc. In essence, this allows for the registration of a requesting application with an out-of-process memory SR server object. While often slower than creating an instance of a private SR object, using the Sharing object can reduce strain on memory resources.

The SR Sharing interface is an optional feature of speech engines and may not be available depending on the design of the engine itself.

The SR Engine Object

The SR Engine Object is the heart of the speech recognition system. This object represents the actual speech engine and it supports several interfaces for the monitoring of speech activity. The SR Engine is created using the Select method of the ISREnum interface of the SR Enumerator object described earlier. Table 15.3 lists the interfaces supported by the SR Engine object along with a short description of their uses.

Table 15.3. The interfaces of the SR Engine object.
Interface NameDescription
ISRCentral The main interface for the SR Engine object. Allows the loading and unloading of grammars, checks information status of the engine, starts and stops the engine, and registers and releases the engine notification callback.
ISRDialogs Used to display a series of dialog boxes that allow users to set parameters of the engine and engage in training to improve the SR performance.
ISRAttributes Used to set and get basic attributes of the engine, including input device name and type, volume controls, and other information.
ISRSpeaker Allows users to manage a list of speakers that use the engine. This is especially valuable when more than one person uses the same device. This is an optional interface.
ISRLexPronounce This interface is used to provide users access to modify the pronunciation or playback of certain words in the lexicon. This is an optional interface.

The SR Engine object also provides a notification callback interface (ISRNotifySink) to capture messages sent by the engine. These messages can be used to check on the performance status of the engine, and can provide feedback to the application (or speaker) that can be used to improve performance.

The Grammar Object

The Grammar object is a child object of the SR Engine object. It is used to load parsing grammars for use by the speech engine in analyzing audio input. The Grammar object contains all the rules, words, lists, and other parameters that control how the SR engine interprets human speech. Each phrase detected by the SR engine is processed using the loaded grammars.

The Grammar object supports three interfaces:

The Grammar object also supports a notification callback to handle messages regarding grammar events. Optionally, the grammar object can create an SR Results object. This object is discussed fully in the next section.

The SR Results Object

The SR Results object contains detailed information about the most recent speech recognition event. This could include a recorded representation of the speech, the interpreted phrase constructed by the engine, the name of the speaker, performance statistics, and so on.

Note
The SR Results object is optional and is not supported by all engines.

Table 15.4 shows the interfaces defined for the SR Results object, along with descriptions of their use. Only the first interface in the table is required (the ISRResBasic interface).

Table 15.4. The defined interfaces for the SR Results object.
Interface NameDescription
ISRResBasic Used to provide basic information about the results object, including an audio representation of the phrase, the selected interpretation of the audio, the grammar used to analyze the input, and the start and stop time of the recognition event.
ISRResAudio Used to retrieve an audio representation of the recognized phrase. This audio file can be played back to the speaker or saved as a WAV format file for later review.
ISRResGraph Used to produce a graphic representation of the recognition event. This graph could show the phonemes used to construct the phrase, show the engine's "score" for accurately detecting the phrase, and so on.
ISRResCorrection Used to provide an opportunity to confirm that the interpretation was accurate, possibly allowing for a correction in the analysis.
ISRResEval Used to re-evaluate the results of the previous recognition. This could be used by the engine to request the speaker to repeat training phrases and use the new information to re-evaluate previous interpretations.
ISRResSpeaker Used to identify the speaker performing the dictation. Could be used to improve engine performance by comparing stored information from previous sessions with the same speaker.
ISRResModifyGUI Used to provide a pop-up window asking the user to confirm the engine's interpretation. Could also provide a list of alternate results to choose from.
ISRResMerge Used to merge data from two different recognition events into a single unit for evaluation purposes. This can be done to improve the system's knowledge about a speaker or phrase.
ISRResMemory Used to allocate and release memory used by results objects. This is strictly a housekeeping function.

Text-to-Speech

The low-level text-to-speech services are provided by one primary object-the TTS Engine object. Like the SR object set, the TTS object set has an Enumerator object and an Engine Enumerator object. These objects are used to locate and select a valid TTS Engine object and are then discarded (see Figure 15.3).

Figure 15.3 : Mapping the low-level SAPI objects.

The TTS services also use an audio output object. The default object for output is the pc speakers, but this can be set to the telephone device. Applications can also create their own output devices, including the creation of a WAV format recording device as the output for TTS engine activity.

The rest of this section discusses the details of the low-level SAPI TTS objects.

The TTS Enumerator and Engine Enumerator Objects

The TTS Enumerator and Engine Enumerator objects are used to obtain a list of the available TTS engines and their speaking modes. They both support two interfaces:

Once the objects have provided a valid address to a TTS engine object, the TTS Enumerator and Engine Enumerator objects can be discarded.

The TTS Engine Object

The TTS Engine object is the primary object of low-level SAPI TTS services. The Engine object supports several interfaces. Table 15.5 lists the interfaces used for the translations of text into audible speech.

Table 15.5. The TTS Engine object interfaces.
Interface NameDescription
ITTSCentral The main interface for the TTS engine object. It is used to register an application with the TTS system, starting, pausing, and stopping the TTS playback, and so on.
ITTSDialogs Used to provide a connection to several dialog boxes. The exact contents of each dialog box is determined by the engine provider, not by Microsoft. Dialog boxes defined for the interface are:
About Box
General Dialog
Lexicon Dialog
Training Dialog
ITTSAttributes Used to set and retrieve control parameters of the TTS engine, including playback speed and volume, playback device, and so on.

In addition to the interfaces described in Table 15.5, the TTS Engine object supports two notification callbacks:

Speech Objects and OLE Automation

Microsoft supplies an OLE Automation type library with the Speech SDK. This type library can be used with any VBA-compliant software, including Visual Basic, Access, Excel, and others. The OLE Automation set provides high-level SAPI services only. The objects, properties, and methods are quite similar to the objects and interfaces provided by the high-level SAPI services described at the beginning of this chapter.

There are two type library files in the Microsoft Speech SDK:

You can load these libraries into a Visual Basic project by way of the Tools | References menu item (see Figure 15.4).

Figure 15.4 : Loading the Voice Command and Voice Text type libraries.

OLE Automation Speech Recognition Services

The OLE Automation speech recognition services are implemented using two objects:

The OLE Voice Command object has three properties and two methods. Table 15.6 shows the Voice Command object's properties and methods, along with their parameters and short descriptions.

Table 15.6. The properties and methods of the OLE Voice Command object.
Property/Method NameParameters Description
Register method  This method is used to register the application with the SR engine. It must be called before any speech recognition will occur.
CallBack property Project.Class as string Visual Basic 4.0 programs can use this property to identify an existing class module that has two special methods defined. (See the following section, "Using the Voice Command Callback.")
Awake property TRUE/FALSE Use this property to turn on or off speech recognition for the application.
CommandSpoken property cmdNum as integer Use this property to determine which command was heard by the SR engine. VB4 applications do not need to use this property if they have installed the callback routines described earlier. All other programming environments must poll this value (using a timer) to determine the command that has been spoken.
MenuCreate method appName as String,
state as String,
langID as Integer,
dialect as String,
flags as Long
Use this method to create a new menu object. Menu objects are used to add new items to the list of valid commands to be recognized by the SR engine.

Using the Voice Command Callback

The Voice Command type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the SR engine, all you need to do is add a VB4 class module to your application. This class module must have two functions created:

Listing 15.1 shows how these two routines look in a class module.


Listing 15.1. Creating the notification routines for the Voice Command object.
'Sent when a spoken phrase was either recognized as being from another Âapplication's
'command set or was not recognized.
Function CommandOther(pszCommand As String, pszApp As String, pszState As String)
    If Len(pszCommand) = 0 Then
        VcintrForm.StatusMsg.Text = "Command unrecognized" & Chr(13) & Chr(10) & ÂVcintrForm.StatusMsg.Text
    Else
        VcintrForm.StatusMsg.Text = pszCommand & " was recognized from " & pszApp & Â"'s " & pszState & " menu" & Chr(13) & Chr(10) & VcintrForm.StatusMsg.Text
    End If

End Function


'Sent when a spoken phrase is recognized as being from the application's Âcommandset.
Function CommandRecognize(pszCommand As String, dwID As Long)
    VcintrForm.StatusMsg.Text = pszCommand & Chr(13) & Chr(10) & ÂVcintrForm.StatusMsg.Text
End Function

Note
You'll learn more about how to use the Voice Command object in Chapter 19, "Creating SAPI Applications with C++."

The Voice Menu Object

The OLE Voice Menu object is used to add new commands to the list of valid items that can be recognized by the SR engine. The Voice Menu object has two properties and three methods. Table 15.7 shows the Voice Menu object's methods and properties, along with parameters and short descriptions.

Table 15.7. The properties and methods of the OLE Voice Menu object.
Property/MethodParameters Description
hWndMenu property hWnd as long Sets the window handle for a voice menu. Whenever this window is the foreground window, the voice menu is automatically activated; otherwise, it is deactivated. If this property is set to NULL, the menu is global.
Active property TRUE/FALSE Use this to turn the menu on or off. If this is set to TRUE, the menu is active. The menu must be active before its commands will be recognized by the SR engine.
Add method id as Long,
command as String,
category as String,
description as String
Adds a new menu to the list of recognizable menus. The command parameter contains the actual menu item the SR engine will listen for. The id parameter will be returned when the SR engine recognizes that the command has been spoken. The other parameters are optional.
Remove method id as Long Removes an item from the menu list. The id parameter is the same value used to create the menu in the Add method.
ListSet method Name as String,
Elements as Long,
Data as String
Add a list of possible entries for use with a command (see "Using Command Lists with the Voice Menu Object" later in this chapter). Name is the name of the list referred to in a command. Elements is the total number of elements in this list. Data is the set of elements, separated by a chr(0).

Using Command Lists with the Voice Menu Object

The Voice Menu object allows you to define a command that refers to a list. You can then load this list into the grammar using the ListSet method. For example, you can use the Add method to create a command to send e-mail messages. Then you can use the ListSet method to create a list of people to receive e-mail (see Listing 15.2).


Listing 15.2. Using the Add and ListSet methods of the Voice Menu object.
Dim Names
Dim szNULL as String
szNULL = Chr(0)
 
 
Call vMenu.Add(109, "Send email to <Names>")
Names = "Larry" & szNULL & "Mike" & szNULL & "Gib" & szNULL & "Doug" & szNULL & Â"George" & szNull
Call Vmenu.ListSet("Names", 5, Names)

OLE Automation Text-to-Speech Services

You can gain access to the OLE Automation TTS services using only one object-the Voice Text object. The Voice Text object has four properties and seven methods. Table 15.8 shows the properties and methods, along with their parameters and short descriptions.

Table 15.8. The properties and methods of the Voice Text object.
Property/MethodParameters Description
Register method AppName as string Used to register the application with the TTS engine. This must be called before any other methods are called.
Callback property Project.Class as string This property is used to establish a callback interface between the Voice Text object and your program. See the "Using the Voice Text Callback" section later in this chapter.
Enabled property TRUE/FALSE Use this property to turn the TTS service on or off. This must be set to TRUE for the Voice Text object to speak text.
Speed property lSpeed as Long Setting this value controls the speed (in words per minute) at which text is spoken. Setting the value to 0 sets the slowest speed. Setting the value to -1 sets the fastest speed.
IsSpeaking TRUE/FALSE Indicates whether the TTS engine is currently speaking text. You can poll this read-only property to determine when the TTS engine is busy or idle. Note that VB4 programmers should use the Callback property instead of this property.
Speak method cText as string,
lFlags as Long
Use this method to get the TTS engine to speak text. The lFlags parameter can contain a value to indicate this is a statement, question, and so on.
StopSpeaking method (none)Use this method to force the TTS engine to stop speaking the current text.
AudioPause (none)Use this method to pause all TTS activity. This affects all applications using TTS services at this site (pc).
AudioResume (none)Use this method to resume TTS activity after calling AudioPause. This affects all applications using TTS services at this site (pc).
AudioRewind (none)Use this method to back up the TTS playback approximately one phrase or sentence.
AudioFastForward (none)Use this method to advance the TTS engine approximately one phrase or sentence.

Using the Voice Text Callback

The Voice Text type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the TTS engine, all you need to do is add a VB4 class module to your application. This class module must have two functions created:


Listing 15.3. Creating the notification routines for a Voice Text object.
Function SpeakingDone()
    VtintrForm.StatusMsg.Text = "Speaking Done notification" & Chr(13) & Chr(10) & ÂVtintrForm.StatusMsg.Text
End Function

Function SpeakingStarted()
    VtintrForm.StatusMsg.Text = "Speaking Started notification" & Chr(13) & Chr(10) Â& VtintrForm.StatusMsg.Text
End Function

Only VB4 applications can use this method of establishing callbacks through class modules. If you are using the TTS objects with other VBA-compatible languages, you need to set up a routine, using a timer, that will regularly poll the IsSpeaking property. The IsSpeaking property is set to TRUE while the TTS engine is speaking text.

Summary

In this chapter you learned the details of the SR and TTS interfaces defined by the Microsoft SAPI model. You learned that the SAPI model is based on the Component Object Model (COM) interface and that Microsoft has defined two distinct levels of SAPI services:

You learned that the two levels of SAPI service each contain several COM interfaces that allow C programmers access to speech services. These interfaces include the ability to set and get engine attributes, turn the services on or off, display dialog boxes for user interaction, and perform direct TTS and SR functions.

Since the SAPI model is based on the COM interface, high-level languages such as Visual Basic cannot directly call functions using the standard API calls. Instead, Microsoft has developed OLE Automation type libraries for use with Visual Basic and other VBA-compliant systems. The two type libraries are:

You now have a good understanding of the types of speech recognition and text-to-speech services that are available with the Microsoft SAPI model. In the next chapter, you'll learn about details surrounding the design and implementation of SAPI applications, including typical hardware required, technology limits, and design considerations when building SAPI applications.