Chapter 18

SAPI Behind the Scenes


CONTENTS


In this chapter, you'll learn about three aspects of the SAPI system that are not often used in the course of normal speech services operations:

You'll learn how to add control tags to your TTS text input in order to change the speed, pitch, volume, mood, gender, and other characteristics of TTS audio output. You'll learn how to use the 15 control tags to improve the sound of your TTS engine.

You'll also learn how grammar rules are used by the SR engine to analyze spoken input. You'll learn how to design your own grammars for specialized uses. You'll also learn how to code and compile your own grammars using tools from the Microsoft Speech SDK. Finally, you'll load your custom grammar into a test program and test the results of your newly designed grammar rules.

The last topic in this chapter is the International Phonetic Alphabet (IPA). The IPA is a
standard system for documenting the various sounds of human speech. The IPA is an implementation option for SAPI speech services under Unicode. For this reason, the IPA can be implemented only on WinNT systems. In this chapter, you'll learn how the IPA can be used to improve both TTS playback and SR recognition performance.

Control Tags

One of the most difficult tasks for a TTS system is the rendering of complete sentences. Most TTS systems do quite well when converting a single word into speech. However, when TTS systems begin to string words together into sentences, they do not perform as well because human speech has a set of inflections, pitches, and rhythms. These characteristics of human speech are called prosody.

There are several reasons that TTS engines are unsuccessful in matching the prosody of human speech. First, very little of it is written down in the text. Punctuation marks can be used to estimate some prosody information, but not all. Much of the inflection of a sentence is tied to subtle differences in the speech of individuals when they speak to each other-interjections, racing to complete a thought, putting in a little added emphasis to make a point. These are all aspects of human prosody that are rarely found in written text.

When you consider the complexity involved in rendering a complete thought or sentence, the current level of technology in TTS engines is quite remarkable. Although the average output of TTS engines still sounds like a poor imitation of Darth Vader, it is amazingly close to human speech.

One of the ways that the SAPI model attempts to provide added control to TTS engines is the inclusion of what are called control tags in text that is to be spoken. These tags can be used to adjust the speed, pitch, and character of the voice used to render the text. By using control tags, you can greatly improve the perceived performance of the TTS engine.

The SAPI model defines 15 different control tags that can be used to modify the output of TTS engines. Microsoft defined these tags but does not determine how the TTS engine will respond to them. It is acceptable for TTS engines that comply with the TAPI model to ignore any and all control tags it does not understand. It is possible that the TTS engine you install on your system will not respond to some or all of these tags. It is also possible that the TTS engine will attempt to interpret them as part of the text instead of ignoring them. You will need to experiment with your TTS engine to determine its level of compliance with the SAPI model.

Note
All of the examples in this section were created using the Microsoft Voice TTS engine that ships with Microsoft Phone.

The SAPI control tags fall into three general categories:

The voice character tags can be used to set high-level general characteristics of the voice. The SAPI model allows users to select gender, dialect, accent, message context types, speaker's age, even the general mood of the speaker.

The phrase modification tags can be used to adjust the pronunciation at a word-by-word or phrase-by-phrase level. Users can control the word emphasis, pauses, pitch, speed, and volume of the playback.

The low-level TTS tags deal with attributes of the TTS engine itself. Users can add comments to the text, control the pronunciation of a word, turn prosody rules on and off, reset the engine to default settings, or even call a control tag based on its own GUID (guaranteed unique identifier).

You add control tags to the text sent to the TTS engine by surrounding them with the backslash (\) character. For example, to adjust the speed of the text playback from 150 to 200 words per minute, you would enter the \Spd=\ control tag. The text below shows how this looks:

\Spd=150\This sentence is normal. \Spd=200\This sentence is faster.

Control tags are not case sensitive. For example, \spd=200\ is the same as \Spd=200\ or \SPD=200\. However, control tags are white-space sensitive. \Spd=200\ is not the same as \ Spd=200 \. As mentioned above, if the TTS engine encounters an unknown control tag, it ignores it. The next three sections of this chapter go into the details of each control tag and show you how to use them.

Note
The following examples all use the TTSTEST.EXE program that is installed in the BIN\ANSI.API or the BIN\UNICODE.API folder of the SPEEchSDK folder. These folders were created when you installed the Microsoft Speech SDK.

Before continuing, load the TTSTEST.EXE program from the SPEEchSDK\BIN\ANSI.API directory (Win95) or the SPEEchSDK\BIN\UNICODE.API directory (WinNT). This program will be used to illustrate examples throughout the chapter. After loading your program, press the Register button to start the TTS engine on your workstation. Then press the Add Mode button to select a voice for playback. Finally, make sure TTSDATAFLAG_TAGGED is checked. This informs the application that you will be sending control tags with your text. Your screen should now look something like the one in Figure 18.1.

Figure 18.1 : Starting the TTSTEST.EXE application.

Note
Even if you do not have a copy of the software, you can still learn a lot by reviewing the material covered in this section.

The Voice Character Control Tags

There are three control tags that allow you to alter the general character of the speaking voice. Microsoft has identified several characteristics of playback voices that can be altered using control tags. However, your TTS engine may not recognize all of them. The three control tags in this group are

Using the Chr Tag to Set the Voice Character

The Chr tag allows you to set the general character of the voice. The syntax for the Chr tag is

\Chr=string[[,string...]]\

More than one characteristic can be applied at the same time. The default value is Normal. Others that are recognized by the Microsoft Voice TTS engine are Monotone and Whisper. Additional characteristics suggested by Microsoft are

Angry BusinessCalm Depressed Excited
Falsetto HappyLoud PerkyQuiet
Sarcastic ScaredShout Tense 

To test the Chr tag, enter the text shown in Listing 18.1 into the input box of TTSTEST.EXE.


Listing 18.1. Testing the Chr control tag.
\chr="monotone"\
How are you today?
\chr="whisper"\
I am fine.
\chr="normal"\
Good to hear.

Each sentence will be spoken using a different characteristic. After entering the text, press the TextData button to hear the results.

Using the Ctx Tag to Set the Message Context

Another valuable control tag is Ctx, the context tag. You can use this tag to tell the TTS engine the context of the message you are asking it to render. Like the Chr tag, the Ctx tag takes string as a parameter. Microsoft has defined the strings in Table 18.1 for the context tag.

Table 18.1. The context tag parameters.
Context Tag ParameterDescription
AddressAddresses and/or phone numbers.
CCode in the C or C++ programming language.
DocumentText document.
E-MailElectronic mail.
NumbersNumbers, dates, times, and so on.
SpreadsheetSpreadsheet document.
Unknown Context is unknown (default).

Setting the context helps the TTS engine better interpret the text. To test this, enter the text shown in Listing 18.2 into the text box.


Listing 18.2. Testing the Ctx control tag.
\ctx="Address"\
1204 W. 7th Street
Oak Ridge, TN
\ctx="E-Mail"\
BillGates@msn.com
\ctx="Unknown"\
129 W. First Avenue

When you press the TextData button to hear the results, you'll notice that the TTS engine automatically converts the "W." to "West" when given the \Ctx="Address"\ tag but fails to do so when the \Ctx="Unknown"\ tag is used. You'll also notice that the e-mail address is spoken using the phrase "Bill Gates at msn dot com" when the \Ctx="E-Mail"\ tag is used.

Using the Vce Tag to Control Additional Voice Characteristics

The last voice character control tag is the Vce tag. This tag can be used to set several aspects of a voice in a single control tag. The exact syntax of the Vce tag is

\Vce=chartype=string[[,chartype=string...]]\

Several character types can be set in a single call. Microsoft has defined six different character type classes. These classes, along with their possible settings and brief descriptions, are shown in Table 18.2.

Table 18.2. The Vce character types and their parameters.
Character TypeDescription
Language=language Tells the TTS engine to speak in the specified language.
Accent=accent Tells the TTS engine to use the specified accent. For example, if Language="English" and Accent="French", the engine will speak English with a French accent.
Dialect=dialect Tells the TTS engine to speak in the specified dialect.
Gender=gender Used to set the gender of the voice as "Male," "Female," or "Neutral."
Speaker=speakername Specifies the name of the voice, or NULL if the name is unimportant. The Microsoft Voice engine can respond using the following names:
Peter
Sidney
Eager Eddie
Deep Douglas
Biff
Grandpa Amos
Melvin
Alex
Wanda
Julia
Age=age Sets the age of the voice, which can be one of the following values:
Baby (about 1 year old)
Toddler (about 3 years old)
Child (about 6 years old)
Adolescent (about 14 years old)
Adult (between 20 and 60 years old)
Elderly (over 60 years old)
Style=style Sets the personality of the voice. For example:
Business
Casual
Computer
Excited
Singsong

To test the Vce control tag, enter the text shown in Listing 18.3 and press TextData to hear the results.


Listing 18.3. Testing the Vce control tag.
\Vce=Speaker="Sidney"\
Hello there Peter.
\Vce=Speaker="Peter"\
Hi Sid. How are you?
\Vce=Speaker="Sidney"\
Not good really. Bad head cold.

You can use the Vce control tag to program the TTS engine to carry on a multiperson dialog.

The Phrase Modification Control Tags

The second set of control tags-the phrase modification tags-can be used to modify words or phrases within the message stream. Phrase modification tags give you added control over TTS output. There are five phrase modification control tags:

Using the Emp Tag to Add Emphasis to a Word

You can insert the \Emp\ tag before a word to force the TTS engine to give it added emphasis. Enter the text shown in Listing 18.4 and press TextData to hear the results.


Listing 18.4. Testing the Emp control tag.
I \Emp\told you never to go running in the street.

Didn't you \Emp\hear me?

You must listen to me when I tell you something \Emp\important.

You can quickly compare this phrase to one without emphasis by simply adding a space to each \Emp\ tag so that is looks like \ Emp\. Since this will appear to be a new tag, the TTS engine will ignore it and speak the text with standard prosody.

Using the Pau Control Tag to Add Pauses to the Text

You can use the Pau tag to add pauses to the playback. The pause is measured in milliseconds. Here's an example of the Pau tag syntax:

\Pau=1000\

To test the Pau tag, add two tags to the speech you entered from the previous example. Your text should now look like the text in Listing 18.5.


Listing 18.5. Testing the Pau control tag.
I \Emp\told you never to go running in the street.
\pau=1000\
Didn't you \Emp\hear me?
\pau=2000\
You must listen to me when I tell you something \Emp\important.

Using the Pit Control Tag to Modify the Pitch of the Voice

The Pit control tag can be used to modify the base pitch of the voice. This base pitch is used to set the normal speaking pitch level. The actual pitch hovers above and below this value as the TTS engine mimics human speech prosody. The pitch is measured in hertz. There is a minimum and maximum pitch: the minimum pitch is 50 hertz and the maximum is 400 hertz.

Listing 18.6 shows modifications to the previous text adding \Pit\ control tags to the text.


Listing 18.6. Testing the Pit control tag.
\Pit=100\
I \Emp\told you never to go running in the street.
\pau=1000\ \Pit=200\
Didn't you \Emp\hear me?
\pau=2000\ \Pit=400\
You must listen to me when I tell you something \Emp\important.
\Pit=50\

Tip
Notice that the last line of Listing 18.6 shows a pitch tag setting the pitch back to normal (\Pit=50\). This is done because the pitch setting does not automatically revert to the default level after a message has been spoken. If you want to return the pitch to its original level, you must do so using the Pit control tag.

Using the Spd Control Tag to Modify the Playback Speed

You can modify the playback speed of the TTS engine using the \Spd\ control tag. The speed is measured in words per minute (wpm). The minimum value is 50wpm and the maximum is 250wpm. Setting Spd to 0 sets the slowest possible speed. Setting Spd to -1 sets the fastest possible speed. Listing 18.7 shows additional modifications to the previous text. Enter this text and press the TextData button to hear the results.


Listing 18.7. Testing the Spd control tag.
\Spd=150\
I \Emp\told you never to go running in the street.
\pau=1000\ \Spd=75\
Didn't you \Emp\hear me?
\pau=2000\ \Spd=200\
You must listen to me when I tell you something \Emp\important.
\Spd=150\

Using the Vol Control Tag to Adjust Playback Volume

The Vol control tag can be used to adjust the base line volume of the TTS playback. The value can range from 0 (the quietest) to 65535 (the loudest). The actual pitch hovers above and below the value set by Vol. Make the changes to the text shown in Listing 18.8 and press TextData to hear the results.


Listing 18.8. Testing the Vol control tag.
\Spd=150\ \Vol=30000\
I \Emp\told you never to go running in the street.

\pau=1000\ \Spd=75\ \Vol=65000\
Didn't you \Emp\hear me?

\Vol=15000\ \pau=2000\ \Spd=200\
You must listen to me when I tell you something \Emp\important.
\Spd=150\ \Vol\=65000

The Low-Level TTS Control Tags

There are seven low-level TTS control tags, which are used to handle TTS adjustments not normally seen by TTS users. Most of these control tags are meant to be used by people who are designing and training complex TTS engines and grammars.

Of the event low-level TTS control tags, only one is used frequently-the \Rst\ tag. This tag resets the control values to those that existed at the start of the current session.

The remaining control tags are summarized in Table 18.3.

Table 18.3. The low-level TTS control tags.
Control TagSyntax Description
Com\Com=string\ Use this tag to add comments to the text passed to the TTS engine. These comments will be ignored by the TTS engine.
Eng\Eng;[GUID]:command\ Use this tag to call an engine-specific command. This can be used to call special hardware-specific commands supported by third-party TTS engines.
Mrk\Mrk=number\ Use this tag to fire the BookMark event of the ITTSBufNotifySink. You can use this to signal such things as page turns or slide changes once the place in the text is reached.
Prn\Prn=text=IPA\ Use this tag to embed custom pronunciations of words using the International Phonetic Alphabet. This may not be supported by your engine.
Pro\Pro=number\ Use this tag to turn on and off the TTS prosody rules. Setting the Pro value to 1 turns the rules on; setting it to 0 turns the rules off.
Prt \Prt=string\ Use this tag to tell the engine what part of speech the current word is. Microsoft has defined these general categories:
Abbr (abbreviation)
N (noun)
Adj (adjective)
Ord (ordinal number)
Adv (adverb)
Prep (preposition)
Card (cardinal number)
Pron (pronoun)
Conj (conjunction)
Prop (proper noun)
Cont (contraction)
Punct (punctuation)
Det (determiner)
Quant (quantifier)
Interj (interjection)
V (verb)

Note
With the exception of the \Rst\ tag, none of the other tags produced noticeable results using the TTS engine that ships with Microsoft Voice. For this reason, there are no examples for these control tags.

Now that you know how to modify the way the TTS engine processes input, you are ready to learn how to use grammar rules to control the way SR engines behave.

Grammar Rules

The grammar of the SR engine controls how the SR engine interprets audio input. The grammar defines the objects for which the engine will listen and the rules used to analyze the objects. SR engines require that one or more grammars be loaded and activated before an engine can successfully interpret the audio stream.

As mentioned in earlier chapters, the SAPI model defines three types of SR grammars:

The context-free grammar format is the most commonly used format. It is especially good at interpreting command and control statements from the user. Context-free grammars also allow a great deal of flexibility since the creation of a set of rules is much easier than building and analyzing large vocabularies, as is done in dictation grammars. By defining a small set of general rules, the SR engine can successfully respond to hundreds (or even thousands) of valid commands-without having to actually build each command into the SR lexicon. The rest of this section deals with the design, compilation, and testing of context-free grammars for the SAPI SR engine model.

General Rules for the SAPI Context-Free Grammar

The SAPI Context-Free Grammar (CFG) operates on a limited set of rules. These rules are used to analyze all audio input. In addition to rules, CFGs also allow for the definition of individual words. These words become part of the grammar and can be recognized by themselves or as part of a defined rule.

Note
Throughout the rest of this section, you will be using NOTEPAD.EXE (or some other ASCII editor) to create grammar files that will be compiled using the GRAMCOMP.EXE grammar compiler that ships with the Microsoft Speech SDK. You will also need the SRTEST.EXE application that ships with the Speech SDK to test your compiled grammars. Even if you do not have the Microsoft Speech SDK, however, you can still learn a lot from this material.

Defining Words in a CFG

In SAPI CFGs, each defined word is assigned a unique ID number. This is done by listing each word, followed by a number. Listing 18.9 shows an example.


Listing 18.9. Defining words for a CFG file.
//
// defining names
//
Lee = 101 ;
Shannon = 102 ;
Jesse = 103 ;
Scott = 104 ;
Michelle = 105 ;
Sue = 106 ;

Notice that there are spaces between each item on the line. The Microsoft GRAMCOMP.EXE program requires that each item be separated by white space. Also note that a semicolon (;) must appear at the end of each definition.

Tip
If you are using the GRAMCOMP.EXE compiler, you are not required to define each word and give it a number. The GRAMCOMP.EXE program automatically assigns a number to each new word for you. However, it is a good idea to predefine words to prevent any potential conflicts at compile time.

The list of words can be as short or as long as you require. Keep in mind that the SR engine can only recognize words that appear in the vocabulary. If you fail to define the word "Stop," you can holler Stop! to the engine as long as you like, but it will have no idea what you are saying! Also, the longer the list, the more likely it is that the engine will confuse one word for another. As the list increases in size, the accuracy of the engine decreases. Try to keep your lists as short as possible.

Defining Rules in a CFG

Along with words, CFGs require rules to interpret the audio stream. Each rule consists of two parts-the rule name and the series of operations that define the rule:

<RuleName> = [series of operations]

There are several possible operations within a rule. You can call another rule, list a set of recognizable words, or refer to an external list of words. There are also several special functions defined for CFGs. These functions define interpretation options for the input stream. There are four CFG functions recognized by the GRAMCOMP.EXE compiler:

Using the alt() Rule Function

When building a rule definition, you can tell the SR engine that only one of the items in the list is expected. Listing 18.10 shows how this is done.


Listing 18.10. An example of the alt() rule function.
<Names> = alt(
Scott
Wayne
Curt
)alt ;

The <Names> rule in Listing 18.10 defines three alternative names for the rule. This tells the SR engine that only one of the names will be spoken at a single occurrence.

Using the seq() Rule Function

You can also define a rule that indicates the sequence in which words will be spoken. Listing 18.11 shows how you can modify the <Names> rule to also include last names as part of the rule.


Listing 18.11. An example of the seq() rule function.
<Names> = alt(
    Scott
    seq( Scott Ivey )seq
    Wayne
    seq( Wayne Ivey )seq
    Curt
    seq( Curt Smith )seq
    )alt ;

The <Names> rule now lists six alternatives. Three of them include two-word phrases that must be spoken in the proper order to be recognized. For example, users could say Scott or Scott Ivey, and the SR engine would recognize the input. However, if the user said Ivey Scott, the system would not understand the input.

Using the opt() Rule Function

You can define rules that show that some of the input is optional-that it may or may not occur in the input stream. The opt() function can simplify rules while still giving them a great deal of flexibility. Listing 18.12 shows how to apply the opt() function to the <Names> rule.


Listing 18.12. An example of the opt() rule function.
<Names> = alt(
    seq( Scott opt( Ivey )opt )seq
    seq( Wayne opt( Ivey )opt )seq
    seq( Curt opt( Smith )opt )seq
    )alt ;

The <Names> rule now has only three alternative inputs again. This time, each input has an optional last name to match the first name.

Using the rep() Rule Function

The rep() rule function can be used to tell the SR engine to expect more than one of the objects within the context of the rule. A good example would be the creation of a phone-dialing rule. First, you can define a rule that dials each phone number (see Listing 18.13).


Listing 18.13. A phone-dialing rule.
<Dial> = alt(
    3215002
    4975501
    3336363
    )alt ;

Listing 18.13 meets all the requirements of a well-formed rule, but it has some problems. First, SR engines are not very good at recognizing objects such as "3336363" as individual words. Second, this list can easily grow to tens, even hundreds, of entries. As the list grows, accuracy will decrease, especially since it is likely that several phone numbers will sound alike.

Instead of defining a rule that contains all the phone numbers, you can define a rule using the rep() function that tells the engine to listen for a set of numbers. Listing 18.14 is an improved version of the <Dial> rule.


Listing 18.14. An improved Dial rule.
<Dial> = alt(
    seq( Dial rep( <Numbers> )rep )seq
    )alt ;

<Numbers> = alt(
    zero
    one
    two
    three
    four
    five
    six
    seven
    eight
    nine
    )alt ;

Now the <Dial> rule knows to wait for a series of numbers. This allows it to be used for any possible combination of digits that can be used to dial a telephone number.

Tip
The <Dial> rule described here is still not very good. The SR engine has a hard time interpreting long sets of numbers. It is better to define words that will aid in the dialing of phone numbers. For example, Dial New York Office is more likely to be understood than Dial 1-800-555-1212.

Using Run-Time Lists with CFG Rules

You can define a rule that uses the contents of a list built at run-time. This allows the SR engine to collect information about the workstation (loadable applications, available Word documents, and so on) while the system is up and running rather than having to build everything into the grammar itself. The list name is added to the rule surrounded by braces ({}). At run-time, programmers can use the SetList method of the Voice Menu object in the OLE library to create and populate the list. Listing 18.15 shows how to build a rule that refers to a run-time list.


Listing 18.15. An example of referring to a run-time list.
<RunProgram> = alt( seq( Run {ProgList} )seq )alt ;

<RunProgram> allows the user to say "Run name" where name is one of the program names that was loaded into the list at run-time.

Creating and Compiling a SAPI Context-Free Grammar

Now that you know the basic building blocks used to create CFGs, it is time to build and compile actual grammar rules using NOTEPAD.EXE and the GRAMCOMP.EXE grammar compiler that ships with the Microsoft Speech SDK. The first step in the process is to define the general scope and function of the grammar. For example, you might want a grammar that can handle typical customer requests for directions in a shopping center.

Once you define the scope and function of a grammar, you need to identify the words and rules needed to populate the CFG. Since an SR engine can only recognize words it already knows, you must be sure to include all the words needed to complete operations.

Tip
You do not, however, need to include all the possible words users may utter to the SR engine. The software package that is using the SR engine should have some type of error response in cases where the audio input cannot be interpreted.

To use the shopping center example, you'd need a list of all the locations that users might request. This would include all the stores, restaurants, major landmarks within the building, public services such as restrooms, exits, drinking fountains, security office, and so on. Then you need to collect a set of typical phrases that you expect users to utter. Examples might be "Where is the ....?" or "Show me ... on the map," or "How can I locate the ....?" After you have collected all this material, you are ready to create the grammar.

Coding the MALL.TXT Context-Free Grammar

Let's assume you have the job of building a workstation that will allow shoppers to ask directions in order to locate their favorite shops within the mall. Listing 18.16 shows a list of the store names and some other major landmarks in the mall. Load NOTEPAD.EXE and enter this information into a file called MALL.TXT.


Listing 18.16. Adding words to the MALL grammar.
// ********************************************************
// MALL GRAMMAR RULES
// ********************************************************
//
// Title:    MALL.TXT
// Version:    1.0 - 05/16/96 (MCA)
//
// Site:    Win95 SAPI
// Compiler:    GRAMCOMP.EXE
//
// Desc:    Used to direct customers to their favorite
//        shops in the mall.
//
// ********************************************************


//
// define words
//
J = 9900 ;
C = 9901 ;
Penneys = 9902 ;
Sears = 9903 ;
Bobs = 9904 ;
Bagels = 9905 ;
Michelles = 9906 ;
Supplies = 9906 ;
The = 9907 ;
Sports = 9908 ;
Barn = 9909 ;

Security = 9910 ;
Office = 9911 ;
Main = 9912 ;
Food = 9913 ;
Court = 9914 ;
Shops = 9915 ;
Specialty = 9916 ;

Exits = 9917 ;
Restroom = 9918 ;
Fountain = 9919 ;

Next, you need to define a top-level rule that calls all other rules. This one rule should be relatively simple and provide for branches to other more complicated rules. By creating branches, you can limit SR errors since the possible words or phrases are limited to those defined in a branch. In other words, by creating branches to other rules, you limit the scope of words and rules that must be analyzed by the SR engine at any one moment. This improves accuracy.

Listing 18.17 shows a top-level rule that calls several other possible rules. Add this to your MALL.TXT grammar file.


Listing 18.17. Adding the top-level rule to the MALL grammar file.
// **************************************
// Define starting rule
//
// This rule calls any one of the other
// internal rules.
//
<Start> = alt(
    <_Locations>
    <_TellMeWhere>
    <_HowCanIFind>
    <_ShowMe>
    <_WhereIs>
    )alt ;

Notice that each of the rules called by <Start> begins with an underscore (_). This underscore tells the compiler that this is an internal rule and should not be exported to the user. The more exported rules the SR engine has to review, the greater the chance of failure. It is a good idea to limit the number of exported rules to a bare minimum.

The first internal rule on the list is called <_Locations>. This rule contains a list of all the locations that customers may ask about. Notice the use of seq() and opt() in the rule. This allows customers to ask for the same locations in several different ways without having to add many items to the vocabulary. Enter the data shown in Listing 18.18 into the MALL.TXT grammar file.


Listing 18.18. Adding the Locations rule.
// *************************************
// Define Locations rule
//
// This rule lists all possible locations
//
<_Locations> = alt(
    // JC Penneys, Penneys
    seq( opt( seq( J C )seq )opt Penneys )seq

    // sears
    Sears

    // Bobs, Bobs Bagels
    seq( Bobs opt( Bagels )opt )seq

    // Michelles, Michelles Supplies
    seq( Michelles opt( Supplies )opt )seq

    // The Sports Barn, Sports Barn
    seq( opt( The )opt Sports Barn )seq

    // Security, Security Office
    seq( Security opt( Office )opt )seq

    // Main Office
    seq( Main Office )seq

    // Food, Food Court, Food Shops
    seq( Food opt( alt( Court Shops )alt )opt )seq

    // Specialty Shops
    seq( Specialty Shops )seq

    // Exits
    Exits

    // Restroom
    Restroom

    // Fountain
    Fountain

    )alt ;

The last step is to build a set of query rules. These are rules that contain the questions customers will commonly ask of the workstation. Each of these questions is really a short phrase followed by a store or location name. Listing 18.19 shows how you can implement the query rules defined in the <Start> rule.


Listing 18.19. Adding the query rules to the MALL.TXT grammar file.
// *************************************
// Define simple queries
//
// These rules respond to customer
// queries
//
<_TellMeWhere> = seq( Tell me where <_Locations> is )seq ;
<_HowCanIFind> = seq( How can I find opt( the )opt <_Locations> )seq ;
<_ShowMe> = seq( Show me opt( where )opt <_Locations> opt( is )opt )seq ;
<_WhereIs> = seq( Where is <_Locations> )seq ;

//
// eof
//

Notice the use of the opt() functions to widen the scope of the rules. These add flexibility to the grammar without adding extra rules.

Note
Be sure to save the file as MALL.TXT before you continue on to the next step.

Compiling the MALL.TXT Grammar File

After you have constructed the grammar file, you are ready to compile it into a binary form understood by the SAPI SR engine.

Note
To do this, you need to use the GRAMCOMP.EXE program that ships with the Speech SDK. You can find this program in the SPEEchSDK\BIN folder that was created when you installed the Speech SDK.

The GRAMCOMP.EXE program is a command-line application with no defined window. To run the program, open an MS-DOS window and move to the directory that contains the GRAMCOMP.EXE file. Then type the following on the command line:

gramcomp /ansi mall.txt mall.grm<return>

Warning
If you are using WinNT, do not include the /ansi portion of the command. This is needed only for Win95 workstations that do not support the default Unicode compilation mode.

You may need to include the directory path to locate the MALL.TXT file. Once the compiler is running, it will read in the MALL.TXT file and compile it into binary format and save it as MALL.GRM. Your screen should look something like the one in Figure 18.2.

Figure 18.2 : Compiling the MALL grammar.

You should get a message telling you that one rule has been exported (Start). If you receive error messages, return to the MALL.TXT file to fix them and recompile. Once you complete compilation successfully, you are ready to test your grammar using SRTEST.EXE.

Loading and Testing SAPI Context-Free Grammars

You can test your new grammar by loading it into the SRTEST.EXE application that ships with the Speech SDK. The ANSI version of the program can be found in the SPEEchSDK\BIN\ANSI.API. You can find the Unicode version in \SPEEchSDK\BIN\UNICODE.API.

Once you load the grammar, you can use the same software to test the SR engine's response to your spoken queries.

Loading and Activating the MALL grammar

When you first start the program, press the Add Mode button to select an engine mode. You should then see one or more modes available. It does not matter which one you pick as long as it supports the same language you used to build the grammar (see Figure 18.3).

Figure 18.3 : Selecting an SR mode.

After selecting the SR mode, you need to load the new MALL.GRM file. To do this, press the Rescan Files button and enter the directory path that contains the MALL.GRM grammar file (see Figure 18.4).

Figure 18.4 : Loading the MALL.GRM grammar file.

You will see the MALL.GRM file appear in the list of available grammars. Double-click the name to load it into the SR engine.

Next, you need to activate the MALL grammar. To do this, select the Grammar option button on the left side of the screen and bring up the ISRGramCom tab of the form. Set the Rule combo box to Start and the Window combo box to MainWnd and press Activate. The MALL.GRM grammar should activate, and your screen should look like the one in Figure 18.5.

Figure 18.5 : Activating the MALL grammar.

The status box at the bottom of the form should contain messages like the ones in Listing 18.20.


Listing 18.20. Messages showing successful grammar activation.
Grammar object created successfully.
Grammar mall.grm activated, hwnd: c0c, pause: False, rule: Start,

Testing the MALL Grammar

You are now ready to test the grammar by speaking to your system. The responses will appear in the status box at the lower left of the form.

For example, ask your system the following question: How can I find Sears? You should see the application flash a few status messages across the bottom of the screen and then return with the selected response. The message in the lower portion of the screen should look like the one in Figure 18.6.

Figure 18.6 : Testing the MALL grammar.

You can experiment with the grammar by speaking phrases and watching the response. You can also make changes to your MALL.TXT file and recompile it to add new features or refine the grammar to meet your needs.

International Phonetic Alphabet

The Unicode versions of the SAPI model can support the use of the International Phonetic Alphabet (IPA) to aid in the analysis and pronunciation of words. The IPA is a standardized set of symbols for documenting phonemes.

On IPA-supported TTS systems, the IPA values can be used to adjust the pronunciation of troublesome words. This is done by associating the Unicode strings with the word in the dictionary. Some TTS engines will store this set of IPA codes as a permanent part of the dictionary. In this way, the TTS system can be refined over time to handle difficult words.

Note
Since the IPA system is only supported through Unicode, Win95 systems do not support the IPA. You can check for IPA support on WinNT-based speech systems by inspecting the TTSMODEINFO structure of TTS engines and the ILexPronounce interface of the SR engine.

SR systems that support IPA will allow users to enter IPA codes into the SR lexicon as a way of teaching the system how some words will sound. These IPA values are then used to match audio input to words in the SR engine's vocabulary, thereby improving recognition performance.

The IPA defines a common set of English consonants and vowels along with a more complex set of phoneme sets that are used to further describe English language sounds. Table 18.4 shows the list of IPA consonants and vowels with their associated Unicode values.

Table 18.4. IPA consonants and vowels.
Consonant
Examples
Unicode values
b
big, able, tab
U+0062
ch
chin, archer, march
U+0074 U+0283
d
dig, idea, wad
U+0064
f
fork, after, if
U+0066
g
gut, angle, tag
U+0261
h
help, ahead, hotel
U+0068
j
joy, agile, edge
U+0064 U+0292
k
cut, oaken, take
U+006B
l
lid
U+006C
elbow, sail
U+026B
m
met, amid, aim
U+006D
n
no, end, pan
U+006E
ng
sing, anger, drink
U+014B
p
put, open, tap
U+0070
r
red, part, far
U+0072
s
sit, cast, toss
U+0073
sh
she, cushion, wash
U+0283
t
talk, sat
U+0074
 
meter
U+027E
th
thin, nothing, truth
U+03B8
dh
then, father, scythe
U+00F0
v
vat, over, have
U+0076
w
with, away, wit
U+0077
z
zap, lazy, haze
U+007A
zh
azure, measure
U+0292
Neutral (schwa)
ago, comply
U+0259
a
at, carry, gas
U+00E6
 
ate, day, tape
U+0065
 
ah, car, father
U+0251
e
end, berry, ten
U+025B
 
eve, be, me
U+0069
i
is, hit, lid
U+026A
 
ice, bite, high
U+0061 U+026A
o
own, tone, go
U+006F
 
look, pull, good
U+028A
 
tool, crew, moo
U+0075
 
oil, coin, toy
U+0254
 
out, how, our
U+0061 U+028A
u
up, bud, cut
U+028C
 
urn, fur, meter,
U+025A
y
yet, onion, yard
U+006A

The IPA system also defines a set of phonemes that describe the various complex sounds of a language. There are several general categories of sounds. Each has its own set of Unicode characters associated with it. The basic sound categories are

Additional information on the IPA and its use can be found in the appendix section of the Microsoft Speech SDK documentation.

Summary

In this chapter, you learned about three aspects of the SAPI interface that are usually not seen by the average user. You learned the following:

You learned that Microsoft has defined 15 control tags and that they fall into three general categories:

You learned that the SAPI model supports three types of grammar: context-free, dictation, and limited-domain. You learned how to create your own context-free grammar with defined words and rules. You compiled and tested that grammar, too.

You learned that the context-free grammar compiler supplied by Microsoft supports the definition of words, rules, and external lists that can be filled at run-time. You also learned that there are four compiler functions:

You designed, coded, compiled, and tested a grammar that could be used to support a voice-activated help kiosk at a shopping center.

Finally, you learned about the International Phonetic Alphabet (IPA) and how Unicode-based speech systems can use IPA to improve TTS and SR engine performance.