Previous Page TOC Next Page



34


Writing CGI Scripts


by William Robert Stanek

Using CGI scripts, you can create powerful, personalized, and professional Web publications that readers can really interact with. CGI scripts are external programs that act as gateways between the Web server and other applications. You can use CGI scripts to process input from readers and thus open a two-way communication channel with your readers. Reader input can be data from fill-out forms, keywords for a database query, or values that describe the reader's browser and connection.

Your CGI scripts can use this input to add entries to an index, to search databases, to create customized documents on the fly, and much more. Yet the most wonderful thing about CGI scripts is that they hide their complexities from users. If you've used a fill-out form or an image map on the Web, you've probably used a gateway script and probably didn't even know it. This is because everything seems to happen automatically. You enter data, click a mouse button, and a moment later a result is displayed. Learning what actually happens between the click of the mouse button and the display of the result is what this chapter is all about. This chapter explains what you need to know about CGI scripts—what they are, how to use them, and why to use them.

Although FrontPage enables you to easily add WebBots to pages that use forms, WebBots generally do not perform any post-submission processing. With CGI scripts, you can process input from forms automatically and generate output directly to the reader based on the results of the processing.

What Are CGI Scripts?


CGI scripts are external programs that run on the Web server. You can use CGI scripts to create highly interactive Web publications. The standard that defines how external programs are used on Web servers and how they interact with other applications is the common gateway interface. The three keywords that comprise the name of the standard—common, gateway, and interface—describe how the standard works:

The developers of CGI worked these key concepts into the CGI standard to create a powerful and extendible advanced feature for Web publishers that shields readers of your publications from its complexities. The reader need only click on an area of an image map or submit their fill-out form after completing it. Everything after the click of the mouse button seems to happen automatically, and the reader doesn't have to worry about the how or why. As a Web publisher, understanding how CGI scripts work is essential, especially if you want to take advantage of the ways CGI can be used to create powerful Web publications.

Although the reader sees only the result of their submission or query, behind the scenes many things are happening. Here is a summary of what is taking place:

  1. The reader's browser passes the input to the Web server.
  2. The server, in turn, passes the input to a CGI script.
  3. The CGI script processes the input, passes it off to another application if necessary, and sends the output to the Web server.
  4. The Web server passes the output back to the reader's browser. The output from a CGI script can be anything from the results of a database search to a completely new document generated based on the reader's input.

On UNIX systems, CGI scripts are located in a directory called cgi-bin in the usr file system and CGI utilities are located in a directory called cgi-src in the usr file system. On other systems, your Web server documentation will explain in what directories CGI scripts and utilities should be placed.

Choosing a Programming Language for Your CGI Scripts


CGI scripts are also called gateway scripts. The term script comes from the UNIX environment, in which shell scripts abound, but gateway scripts don't have to be in the format of a UNIX script. You can write gateway scripts in almost any computer language that produces an executable file. The most common languages for scripts are

Two up-and-coming scripting languages are

The best programming language to write your script in is one that works with your Web server and meets your needs. Preferably, the language should already be available on the Web server and you should be proficient in it (or at least have some knowledge of the language). Keep in mind, most user input is in the form of text that must be manipulated in some way, which makes support for text strings and their manipulation critically important.

The easiest way to determine if a language is available is to ask the Webmaster or system administrator responsible for the server. As most Web servers operate on UNIX systems, you might be able to use the following UNIX commands to check on the availability of a particular language:

You can use either which or whereis on UNIX systems. You would type which or whereis at the shell prompt and follow the command by a keyword on which you want to search, such as the name of the programming language you want to use. To see if your UNIX server supports Perl, you could type either




which perl

or




whereis perl

As Perl, C/C++, and UNIX shell are the most popular languages for scripts, the sections that follow will look briefly at these languages, with emphasis on why and when to use them. Each section contains a checklist for features and systems supported, which can be interpreted as follows:

The sections on common scripting languages are followed by close-ups on the newest scripting languages: JavaScript and VBScript. Both JavaScript and VBScript are hot topics on the Net right now. If you want to be on the cutting edge of Internet technologies, these are languages you want to keep both eyes on.

Using UNIX Shell


The UNIX operating system is in wide use in business, education, and research sectors. There are almost as many variations of the UNIX operating system as there are platforms that use it. You will even find that platforms produced by the same manufacturer use different variants of the UNIX operating system. For example, DEC has variants for the Dec-Alpha, Decstation, and Dec OSF.

What these operating systems have in common is the core environment on which they are based. Most UNIX operating systems are based on Berkeley UNIX (BSD), AT&T System V, or a combination of BSD and System V. Both BSD and System V support three shell scripting languages:


TIP

You can quickly identify the shell scripting language used by examining the first line of a script. Bourne shell scripts generally have this first line:




#!/bin/sh

C shell scripts generally have a blank first line or the following:




#!/bin/csh

Korn shell scripts generally have this first line:




#!/bin/ksh


All UNIX shells are interpreted languages, which means the scripts you create do not have to be compiled. Bourne shell is the most basic shell. C shell is an advanced shell with many features of the C programming language. Because Bourne shell uses a completely different syntax than C shell, scripts written in Bourne are not compatible with scripts written in C. If you create a script in Bourne shell and later want to use C shell to interpret the script, you must rewrite the script for C shell.

Many programmers often want to merge the simplicity of Bourne shell with the advanced features of C shell, and this is where Korn shell comes in handy. Korn shell has the same functionality as the Bourne shell and also incorporates many features of the C shell. Any shells you've written in Bourne shell can be interpreted directly by the Korn interpreter. This saves time rewriting a script when you later find you want to use a feature supported by Korn. Although the Korn shell is gaining popularity, Bourne and C shell are the two most widely used UNIX shells.

Some differences in Bourne, C, and Korn shell are visible only if you are at the shell prompt and using a particular shell. You can change your current shell any time from the shell prompt by typing:

Usually, you will see visible differences between the various shells immediately. For example, the default command prompt for Bourne shell is the dollar sign, while the default command prompt for C shell is usually your host name and user ID followed by a colon. Beyond this, C shell supports a history function, aliasing of commands, and many other controls that the Bourne shell does not. However, to the CGI programmer, these differences are generally not important. Your primary concern should be the features that the shell directly supports and how scripts behave when executed in it.

Bourne shell is the smallest of the shells and the most efficient. Consequently, a Bourne shell script will generally execute faster and use less system resources. When you want more advanced features, such as arrays, you will want to use Korn shell. Korn shell has more overhead than Bourne shell and requires slightly more system resources. When you want to make advanced function calls or assignments, you will want to use C shell. Because C shell is larger than Bourne and Korn shell, scripts written in C shell generally have higher overhead and use more system resources.

Although UNIX shells have good built-in facilities for handing text, such as sed, awk, and grep, they are not as powerful or extensible as traditional programming languages. You should consider using shell scripts when you want to perform simple tasks and moderately advanced text or file manipulation.

Using C/C++


When you want your scripts to perform complex tasks, you call in the big guns. Two of the most advanced languages used in CGI scripts are C and C++. C is the most popular programming language in use today. C++ is the object-oriented successor to C. Both C and C++ are advanced programming languages that require you to compile your scripts before you can use them. A major advantage of C and C++ is that they enjoy widespread use, and versions are available for virtually every operating system you can think of.

The primary time to use C (rather than C++) is when your scripts must execute swiftly and use minimal system resources. C was developed more than 20 years ago, and has been gaining popularity ever since. CGI programmers use C because compiled C programs are very small—tiny compared to programs with similar functionality programmed in other languages. Small programs use minimal system resources and execute quickly. However, C is a very complex language with difficult-to-use facilities for manipulating text. Therefore, if you are not proficient in C, you should be wary of using C to perform advanced text string processing.

The primary time to use C++ is when certain functions of your scripts will be reused and when long-term development costs are a major concern. C++ is an object-oriented language that enables you to use libraries of functions. These functions form the core of your CGI scripts and can be reused in other CGI scripts. For example, you can use one function to sort the user's input, another function to search a database using the input, and another function to display the output as an HTML document. However, C++ is an object-oriented language that is very different from other languages. If you have not used an object-oriented language before, are not familiar with C, and plan to use C++ for your CGI scripts, you should be prepared for a steep learning curve.

Using Perl


If you want to be on the inside track of CGI programming, you should learn and use the Practical Extraction and Report Language (Perl). Perl combines elements of C with UNIX shell features such as awk, sed, and grep to create a powerful language for processing text strings and generating reports. Because most of the processing done by CGI scripts involves text manipulation, Perl is rapidly becoming the most widely used language for CGI scripts. As with C and C++, a major advantage of Perl is its widespread use. Versions of Perl are available for virtually every operating system you can think of. You can use Perl to perform the following tasks:

Perl, like Bourne and C shell, is an interpreted language. However, Perl does not have the limitations of most interpreted languages. You can use Perl to manipulate extremely large amounts of data, and you can quickly scan files using sophisticated pattern-matching techniques. Perl strings are not limited in size. The entire contents of a file can be used as a single string. Perl's syntax is similar to C's. Many basic Perl constructs, like if, for, and while statements, are used just as you use them in C.


TIP

Like a UNIX shell script, a Perl script will usually specify the path to the source routines in the first line. Therefore, the first line of a Perl script should specify the path to where Perl is installed on the system. This path is usually




#!/usr/local/perl

or




#!/usr/local/bin/perl


Perl is surprisingly easy to learn and use, especially if you know the basics of C or UNIX shell. Perl scripts are usually faster than UNIX shell scripts and slightly slower than compiled C/C++ scripts. You should use Perl whenever you have large amounts of text to manipulate.

Using JavaScript


JavaScript is a scripting language based on the Java programming language developed by Sun Microsystems. This powerful up-and-coming scripting language is being developed by Netscape Communications Corporation, and as you might have guessed, the Netscape Navigator 2.0/3.0 fully supports JavaScript.

Netscape Navigator 2.0/3.0 interprets JavaScript programs embedded directly in an HTML page, and just like Java applets, these programs are fully interactive. JavaScript can recognize and respond to mouse clicks, form input, and page navigation. This means your pages can "intelligently" react to user input. The JavaScript language resembles the Java programming language—with a few important exceptions, as you can see from the comparisons in the following lists:

JavaScript is

Java is

JavaScript is designed to complement the Java language and has some terrific features for Web publishers. You could create a JavaScript program that passes parameters to a Java applet. This would enable you to use the JavaScript program as an easy-to-use front-end for your Java applets. Further, because a Web publisher is not required to know about classes to use JavaScript and to pass parameters to a Java applet, JavaScript provides a simple solution for publishers who want to use the features of the Java language but don't want to learn how to program in Java.

This powerful up-and-coming scripting language is featured in Part IX, "JavaScript and Java."

Using VBScript


With VBScript, Microsoft proves once again that it understands the tools developers need. Visual Basic Script is a subset of Visual Basic and is used to create highly interactive documents on the Web. Similar to JavaScript, programs written in VBScript are embedded in the body of your HTML documents.

Visual Basic Script also enables dynamic use of OLE scripting management with ActiveX Controls. The Object Linking and Embedding of scripts enables Web publishers to dynamically embed VBScript runtime environments. Basically, this enables you to use VBScripts as plug-in modules. You can, for example, embed a VBScript program in your Web document that calls other VBScript programs to use as plug-ins. The exact plug-in calls could be dynamically selected based on user input.

This powerful up and coming scripting language is featured in Part VIII, "VBScript and ActiveX."

Why Use CGI Scripts?


At this point, you might be worried about having to program. You might also be wondering why you would want to use gateway scripts at all. These are valid concerns. Learning a programming language isn't easy, but as you will see later, you might never have to program at all. Dozens of ready-to-use CGI scripts are freely available on the Web. Often you can use these existing programs to meet your needs.

The primary reason to use CGI scripts is to automate what would otherwise be a manual and probably time-consuming process. Using CGI scripts benefits both you and your reader. The reader gets simplicity, automated responses to input, easy ways to make submissions, and fast ways to conduct searches. Gateway scripts enable you to automatically process orders, queries, and much more. CGI programs are commonly used for the following purposes:

FrontPage WebBots perform many of the things that CGI scripts are used for. In fact, the only common CGI tasks FrontPage has not automated are the last three items in the previous list.

How CGI Scripts Work


Gateway scripts are used to process input submitted by readers of your Web publications. The input usually consists of environment variables that the Web server passes to the gateway script. Environment variables describe the information being passed, such as the version of CGI used on the server, the type of data, the size of the data, and other important information. Gateway scripts can also receive command-line arguments and standard input. To execute a CGI script, the script must exist on the server you are referencing. You must also have a server that is both capable of executing gateway scripts and configured to handle the type of script you plan to use.

Readers pass information to a CGI script by activating a link containing a reference to the script. The gateway script processes the input and formats the results as output that the Web server can use. The Web server takes the results and passes them back to the reader's browser. The browser displays the output for the reader.

The output from a gateway script begins with a header containing a directive to the server. Currently there are three valid server directives: Content-type, Location, and Status. The header can consist of a directive in the format of an HTTP header followed by a blank line. The blank link separates the header from the data you are passing back to the browser. Output containing Location and Status directives usually are a single line. This is because the directive contained on the Location or Status line is all that the server needs, and when there is no subsequent data, you do not need to insert a blank line. The server interprets the output, sets environment variables, and passes the output to the client.

Any transaction between a client and server has many parts. These parts can be broken down into the following eight steps:

  1. Client passes input to a server.
  2. Server sets environment variables pertaining to input.
  3. Server passes input as variables to the named CGI script.
  4. Server passes command line input or standard input stream to CGI script if present.
  5. Script processes input.
  6. Script returns output to the server. This output always contains a qualified header, and a body if additional data is present.
  7. Server sets environment variables pertaining to output.
  8. Server passes output to client.

FrontPage enables you to set properties for forms using the Form Properties box. You can access this box whenever you add a push button to a form or by double-clicking on a push button in a form. While the push button's Properties box is displayed, click on the Form button to display the Form Properties box. This dialog box has two main areas. The Form Handler area defines the type of handler that will process the input from the form. The Hidden Fields area defines form fields not visible to the user.

To use a CGI script, select the Custom ISAPI, NSAPI, or CGI Script form handler. Next, click the Settings button. This opens the Settings For Custom Form Handler dialog box shown in Figure 34.1. As you can see, this dialog box has three fields: Action, Method and Encoding Type. The next three sections discuss the values you can use for these fields.

Figure 34.1. Using CGI Scripts

The Action Field

The Action field specifies the action to be performed when a form is submitted. As a form without a defined action will not be processed in any way, you should always specify a value for the Action field. You can define an action for your forms as the URL to a gateway script to be executed or as an actual action.

By specifying the URL to a gateway script, you can direct input to the script for processing. The URL provides a relative or an absolute path to the script. Scripts defined with relative URLs are located on your local server. Scripts defined with absolute URLs can be located on a remote or local server. Most CGI scripts are located in the cgi-bin directory. You could access a script in a cgi-bin directory by setting the Action field to




http://tvp.com/cgi-bin/your_script

You can also use the Action field to specify an actual action to be performed. The only action currently supported is mailto that enables you to mail the contents of a form to anyone using their e-mail address. Most current browser and server software support the mailto value. To use the mailto value, set the Action field as follows:




mailto:name@host

Here, name is the user name and host is the host machine the user is located on, as in the following example:




mailto:publisher@tvp.com

A form created using the previous example would be sent to publisher@tvp.com. The mailto value provides you with a simple solution for using forms that does not need to be directed to a CGI script to be processed. This is great news for Web publishers who don't have access to CGI and can't use FrontPage server extensions. As the contents of the form are mailed directly to an intended recipient, the data can be processed off-line as necessary. You should consider using the mailto value for forms that don't need immediate processing and when you don't have access to CGI or FrontPage server extensions but would like to use forms in your Web publications.

The Method Field

The Method field specifies the way the form is submitted. There are currently two acceptable values:

The preferred submission method is POST, the default value used by FrontPage. POST sends the data as a separate input stream via the server to your gateway script. This enables the server to pass the information directly to the gateway script without assigning variables or arguments. The value of an environment variable called CONTENT_LENGTH tells the CGI script how much data to read from the standard input stream. Using this method, there is no limit on the amount of data that can be passed to the server.

GET appends the retrieved data to the script URL. The script URL and the data are passed to the server as a single URL-encoded input. The server receiving the input passes it to two variables: the script URL to SCRIPT_NAME and the data to QUERRY_STRING.

Assigning the data to variables on a UNIX system means passing the data through the UNIX shell. The number of characters you can send to UNIX shell in a single input is severely limited. Some servers restrict the length of this type of input to 255 characters. This means you can append only a limited amount of data to a URL before truncation occurs. You lose data when truncation occurs, and losing data is a bad thing. Consequently, if you use GET, you should always ensure that the length of data input is small.

The Encoding Type Field

The Encoding Type field specifies the MIME content type for encoding the form data. The client encodes the data before passing it to the server. The reason for encoding the data from fill-out form is not to prevent the data from being read, but rather to ensure that input fields can be easily matched to key values. By default, the data is x-www-form-encoded. This encoding is also called URL encoding. If you do not specify an encoding type, the default value is used automatically.

Although in theory you can use any valid MIME type, such as text/plain, most forms on the Web use the default encoding, x-www-form-encoded. MIME stands for Multipurpose Internet Mail Extensions. HTTP uses MIME to identify the type of object being transferred across the Internet. The purpose of encoding is to prevent problems you would experience when trying to manipulate data that has not been encoded in some way.

You do not have to set a value for this field. However, if you wanted to strictly specify the default encoding, you would set the Encoding Type field to the following value:




x-www-form-encoded

Input to CGI Scripts


When a user activates a link to a gateway script, input is sent to the server. The server formats this data into environment variables and checks to see whether additional data was submitted via the standard input stream.

Environment Variables


Input to CGI scripts is usually in the form of environment variables. The environment variables passed to gateway scripts are associated with the browser requesting information from the server, the server processing the request, and the data passed in the request. Environment variables are case-sensitive and are normally used as described in this section. Although some environment variables are system-specific, many environment variables are standard. The standard variables are shown in Table 11.1.

As later examples show, environment variables are set automatically whenever reader input is passed to a server. The primary reason to learn about these variables is to better understand how input is passed to CGI scripts, but you should also learn about these variables so you know how to take advantage of them when necessary.

Table 11.1. Standard environment variables.

AUTH_TYPE Specifies the authentication method and is used to validate a user's access.
CONTENT_LENGTH Used to provide a way of tracking the length of the data string as a numeric value.
CONTENT_TYPE Indicates the MIME type of data.
GATEWAY_INTERFACE Indicates the version of the CGI standard the server is using.
HTTP_ACCEPT Indicates the MIME content types the browser will accept, as passed to the gateway script via the server.
HTTP_USER_AGENT Indicates the type of browser used to send the request, as passed to the gateway script via the server.
PATH_INFO Identifies the extra information included in the URL after the identification of the CGI script.
PATH_TRANSLATED Set by the server based on the PATH_INFO variable. The server translates the PATH_INFO variable into this variable.
QUERY_STRING Set to the query string (if the URL contains a query string).
REMOTE_ADDR Identifies the Internet Protocol address of the remote computer making the request.
REMOTE_HOST Identifies the name of the machine making the request.
REMOTE_IDENT Identifies the machine making the request.
REMOTE_USER Identifies the user name as authenticated by the user.
REQUEST_METHOD Indicates the method by which the request was made.
SCRIPT_NAME Identifies the virtual path to the script being executed.
SERVER_NAME Identifies the server by its host name, alias, or IP address.
SERVER_PORT Identifies the port number the server received the request on.
SERVER_PROTOCOL Indicates the protocol of the request sent to the server.
SERVER_SOFTWARE Identifies the Web server software.

AUTH_TYPE

The AUTH_TYPE variable provides access control to protected areas of the Web server and can be used only on servers that support user authentication. If an area of the Web site has no access control, the AUTH_TYPE variable has no value associated with it. If an area of the Web site has access control, the AUTH_TYPE variable is set to a specific value that identifies the authentication scheme being used. Otherwise, the variable has no value associated with it. A simple challenge-response authorization mechanism is implemented under current versions of HTTP.

Using this mechanism, the server can challenge a client's request and the client can respond. To do this, the server sets a value for the AUTH_TYPE variable and the client supplies a matching value. The next step is to authenticate the user. Using the basic authentication scheme, the user's browser must supply authentication information that uniquely identifies the user. This information includes a user ID and password.

Under the current implementation of HTTP, HTTP 1.0, the basic authentication scheme is the most commonly used authentication method. To specify this method, set the AUTH_TYPE variable as follows:




AUTH_TYPE = Basic

CONTENT_LENGTH

The CONTENT_LENGTH variable provides a way of tracking the length of the data string. This tells the client and server how much data to read on the standard input stream. The value of the variable corresponds to the number of characters in the data passed with the request. If no data is being passed, the variable has no value.

As long as the characters are represented as octets, the value of the CONTENT_LENGTH variable will be the precise number of characters passed as standard input or standard output. Thus, if 25 characters are passed and they are represented as octets, the CONTENT_LENGTH variable will have the following value:




CONTENT_LENGTH = 25

CONTENT_TYPE

The CONTENT_TYPE variable indicates the data's MIME type. MIME typing is a feature of HTTP 1.0 and is not available on servers using HTTP 0.9. The variable is set only when attached data is passed using the standard input or output stream. The value assigned to the variable identifies the MIME type and subtype as follows:




CONTENT_TYPE = type/subtype

MIME types are broken down into basic type categories. Each data type category has a primary subtype associated with it. The basic MIME types and their descriptions are shown in Table 11.2.

Table 11.2. Basic MIME types.

Type Description
application Binary data that can be executed or used with another application
audio A sound file that requires an output device to preview
image A picture that requires an output device to preview
message An encapsulated mail message
multipart Data consisting of multiple parts and possibly many data types
text Textual data that can be represented in any character set or formatting language
video A video file that requires an output device to preview
x-world Experimental data type for world files

MIME subtypes are defined in three categories: primary, additionally defined, and extended. The primary subtype is the primary type of data adopted for use as MIME Content-Types. Additionally defined data types are additional subtypes that have been officially adopted as MIME Content-Types. Extended data types are experimental subtypes that have not been officially adopted as MIME Content-Types. You can easily identify extended subtypes because they begin with the letter x followed by a hyphen. Table 11.3 lists common MIME types and their descriptions.

Table 11.3. Common MIME types.

Type/Subtype Description
application/mac-binhex40 Macintosh binary-formatted data
application/msword Microsoft word document
application/octet-stream Binary data that can be executed or used with another application
application/pdf ACROBAT PDF document
application/postscript Postscript-formatted data
application/rtf Rich Text Format (RTF) document
application/x-compress Data that has been compressed using UNIX compress
application/x-dvi Device-independent file
application/x-gzip Data that has been compressed using UNIX gzip
application/x-latex LATEX document
application/x-tar Data that has been archived using UNIX tar
application/x-zip-compressed Data that has been compressed using PKZip or WinZip
audio/basic Audio in a nondescript format
audio/x-aiff Audio in Apple AIFF format
audio/x-wav Audio in Microsoft WAV format
image/gif Image in GIF format
image/jpeg Image in JPEG format
image/tiff Image in TIFF format
image/x-portable-bitmap Portable bitmap
image/x-portable-graymap Portable graymap
image/x-portable-pixmap Portable pixmap
image/x-xbitmap X-bitmap
image/x-xpixmap X-pixmap
message/external-body Message with external data source
message/partial Fragmented or partial message
message/rfc822 RFC-822-compliant message
multipart/alternative Data with alternative formats
multipart/digest Multipart message digest
multipart/mixed Multipart message with data in multiple formats
multipart/parallel Multipart data with parts that should be viewed simultaneously
text/html HTML-formatted text
text/plain Plain text with no HTML formatting included
video/mpeg Video in the MPEG format
video/quicktime Video in the Apple QuickTime format
video/x-msvideo Video in the Microsoft AVI format
x-world/x-vrml VRML world file

Some MIME Content-Types can be used with additional parameters. These Content-Types include: text/plain, text/html, and all multipart message data. The charset parameter is used with the text/plain type to identify the character set used for the data. The version parameter is used with the text/html type to identify the version of HTML used. The boundary parameter is used with multipart data to identify the boundary string that separates message parts.

The charset parameter for the text/plain type is optional. If a charset is not specified, the default value charset=us-ascii is assumed. Other values for charset include any character set approved by the International Standards Organization. These character sets are defined by ISO-8859-1 to ISO-8859-9 and are specified as follows:




CONTENT_TYPE = text/plain; charset=iso-8859-1

The version parameter for the text/html type is optional. If this parameter is set, the browser reading the data interprets the data if the browser supports the version of HTML specified. The following document conforms to the HTML 3.2 specification:




CONTENT_TYPE = text/html; version=3.2

The boundary parameter for multipart message types is required. The boundary value is set to a string of 1 to 70 characters. Although the string cannot end in a space, the string can contain any valid letter or number and can include spaces and a limited set of special characters. Boundary parameters are unique strings that are defined as follows:




CONTENT_TYPE = multipart/mixed; boundary=boundary_string

GATEWAY_INTERFACE

The GATEWAY_INTERFACE variable indicates the version of the CGI specification the server is using. The value assigned to the variable identifies the name and version of the specification used as follows:




GATEWAY_INTERFACE = name/version

The current version of the CGI specification is 1.1. A server conforming to this version would set the GATEWAY_INTERFACE variable as follows:




GATEWAY_INTERFACE = CGI/1.1

HTTP_ACCEPT

The HTTP_ACCEPT variable defines the types of data the client will accept. The acceptable values are expressed as a type/subtype pair. Each type/subtype pair is separated by commas, as in




type/subtype, type/subtype

Most clients accept dozens of MIME types. The following identifies all the MIME Content-Types accepted by this client:




HTTP_ACCEPT = application/msword, application/octet-stream,



application/postscript, application/rtf, application/x-zip-compressed,



audio/basic, audio/x-aiff, audio/x-wav, image/gif, image/jpeg, image/tiff,



image/x-portable-bitmap, message/external-body, message/partial,



message/rfc822, multipart/alternative,



multipart/digest, multipart/mixed, multipart/parallel, text/html,



text/plain, video/mpeg, video/quicktime, video/x-msvideo

HTTP_USER_AGENT

The HTTP_USER_AGENT variable identifies the type of browser used to send the request. The acceptable values are expressed as software type/version or library/version. The following HTTP_USER_AGENT variable identifies the Netscape Navigator Version 2.0:




HTTP_USER_AGENT = Mozilla/2.0

As you can see, Netscape uses the alias Mozilla to identify itself. The primary types of clients that set this variable are browsers, Web spiders, and robots. Although this is a useful parameter for identifying the type of client used to access a script, keep in mind that not all clients set this variable.

Here's a list of software type values used by popular browsers:

These values are used by Web spiders:


PATH_INFO

The PATH_INFO variable specifies extra path information and can be used to send additional information to a gateway script. The extra path information follows the URL to the gateway script referenced. Generally, this information is a virtual or relative path to a resource that the server must interpret. If the URL to the CGI script is specified in your document as




/usr/cgi-bin/formparse.pl/home.html

then the PATH_INFO variable would be set as follows:




PATH_INFO = /home.html

PATH_TRANSLATED

Servers translate the PATH_INFO variable into the PATH_TRANSLATED variable. It does this by inserting the default Web document's directory path in front of the extra path information. For example, if the PATH_INFO variable was set to home.html and the default directory was /usr/documents/pubs, the PATH_TRANSLATED variable would be set as follows:




PATH_TRANSLATED = /usr/documents/pubs/home.html

QUERY_STRING

The QUERY_STRING specifies an URL-encoded search string. You'll set this variable when you use the GET method to submit a fill-out form, or when you use an ISINDEX query to search a document. The query string is separated from the URL by a question mark. The user submits all the information following the question mark separating the URL from the query string. Here is an example:




/usr/cgi-bin/formparse.pl?string

When the query string is URL-encoded, the browser encodes key parts of the string. The plus sign is a placeholder between words, as a substitute for spaces:




/usr/cgi-bin/formparse.pl?word1+word2+word3

Equal signs separate keys assigned by the publisher from values entered by the user. In the following example, response is the key assigned by the publisher, and never is the value entered by the user:




/usr/cgi-bin/formparse.pl?response=never

Ampersand symbols separate sets of keys and values. In the following example, response is the first key assigned by the publisher, and sometimes is the value entered by the user. The second key assigned by the publisher is reason, and the value entered by the user is "I am not really sure". Here is an example:




/usr/cgi-bin/formparse.pl?response=sometimes&reason=I+am+not+really+sure

Finally, the percent sign is used to identify escape characters. Following the percent sign is an escape code for a special character expressed as a hexadecimal value. Here is how the previous query string could be rewritten using the escape code for an apostrophe:




/usr/cgi-bin/formparse.pl?response=sometimes&reason=I%27m+not+really+sure

REMOTE_ADDR

The REMOTE_ADDR variable is set to the Internet Protocol (IP) address of the remote computer making the request. The IP address is a numeric identifier for a networked computer. The REMOTE_ADDR variable is associated with the host computer making the request for the client and could be used as follows:




REMOTE_ADDR = 205.1.20.11

REMOTE_HOST

The REMOTE_HOST variable specifies the name of the host computer making a request. This variable is set only if the server can figure out this information using a reverse lookup procedure. If this variable is set, the full domain and host name are used as follows:




REMOTE_HOST = www.tvp.com

REMOTE_IDENT

The REMOTE_IDENT variable identifies the remote user making a request. The variable is set only if the server and the remote machine making the request support the identification protocol. Further, information on the remote user is not always available, so you should not rely on it even when it is available. If the variable is set, the associated value is a fully expressed name that contains the domain information as well, such as




REMOTE_IDENT = william.www.tvp.com

REMOTE_USER

The REMOTE_USER variable is the user name as authenticated by the user, and as such is the only variable you should rely upon to identify a user. As with other types of user authentication, this variable is set only if the server supports user authentication and if the gateway script is protected. If the variable is set, the associated value is the user's identification as sent by the client to the server, such as




REMOTE_USER = william

REQUEST_METHOD

The REQUEST_METHOD specifies the method by which the request was made. For HTTP 1.0, the methods could be any of the following:

The GET, HEAD, and POST methods are the most commonly used request methods. Both GET and POST are used to submit forms. The HEAD method could be specified as follows:




REQUEST_METHOD = HEAD

SCRIPT_NAME

The SCRIPT_NAME variable specifies the virtual path to the script being executed. This is useful if the script generates an HTML document that references the script. If the URL specified in your HTML document is




http://tvp.com/cgi-bin/formparse.pl

the SCRIPT_NAME variable is set as follows:




SCRIPT_NAME = /cgi-bin/formparse.pl

SERVER_NAME

The SERVER_NAME variable identifies the server by its host name, alias, or IP address. This variable is always set and could be specified as follows:




SERVER_NAME = tvp.com

SERVER_PORT

The SERVER_PORT variable specifies the port number on which the server received the request. This information can be interpreted from the URL to the script if necessary. However, most servers use the default port of 80 for HTTP requests. If the URL specified in your HTML document is




http://www.ncsa.edu:8080/cgi-bin/formparse.pl

the SERVER_PORT variable is set as follows:




SERVER_PORT = 8080

SERVER_PROTOCOL

The SERVER_PROTOCOL variable identifies the protocol used to send the request. The value assigned to the variable identifies the name and version of the protocol used. The format is name/version, such as HTTP/1.0. The variable is set as follows:




SERVER_PROTOCOL = HTTP/1.0

SERVER_SOFTWARE

The SERVER_SOFTWARE variable identifies the name and version of the server software. The format for values assigned to the variable is name/version, such as CERN/2.17. The variable is set as follows:




SERVER_SOFTWARE = CERN/2.17

CGI Standard Input


Most input sent to a Web server is used to set environment variables, yet not all input fits neatly into an environment variable. When a user submits actual data to be processed by a gateway script, this data is received as a URL-encoded search string or via the standard input stream. The server knows how to process actual data by the method used to submit the data.

Sending data as standard input is the most direct way to send data. The server simply tells the gateway script how many 8-bit sets of data to read from standard input. The script opens the standard input stream and reads the specified amount of data. Although long URL-encoded search strings might get truncated, data sent on the standard input stream will not. Consequently, the standard input stream is the preferred way to pass data.

Clarifying CGI Input


You can identify a submission method when you create your fill-out forms. Under HTTP 1.0, there are two submission methods for forms:

Let's create a sample Web document containing a form with three key fields: NAME, ADDRESS, and PHONE_NUMBER. Assume the URL to the script is http://www.tvp.com/cgi-bin/survey.pl and the user responds as follows:




Sandy Brown



12 Sunny Lane WhoVille, USA



987-654-3210

Identical information submitted using the GET and POST methods is treated differently by the server. When the GET method is used, the server sets the following environment variables then passes the input to the survey.pl script:




PATH=/bin:/usr/bin:/usr/etc:/usr/ucb



SERVER_SOFTWARE = CERN/3.0



SERVER_NAME = www.tvp.com



GATEWAY_INTERFACE = CGI/1.1



SERVER_PROTOCOL = HTTP/1.0



SERVER_PORT=80



REQUEST_METHOD = GET



HTTP_ACCEPT = text/plain, text/html, application/rtf, application/postscript,



audio/basic, audio/x-aiff, image/gif, image/jpeg, image/tiff, video/mpeg



PATH_INFO =



PATH_TRANSLATED =



SCRIPT_NAME = /cgi-bin/survey.pl



QUERY_STRING = NAME=Sandy+Brown&ADDRESS=12+Sunny+Lane+WhoVille,+USA



&PHONE_NUMBER=987-654-3210



REMOTE_HOST =



REMOTE_ADDR =



REMOTE_USER =



AUTH_TYPE =



CONTENT_TYPE =



CONTENT_LENGTH =

When the POST method is used, the server sets the following environment variables and then passes the input to the survey.pl script:




PATH=/bin:/usr/bin:/usr/etc:/usr/ucb



SERVER_SOFTWARE = CERN/3.0



SERVER_NAME = www.tvp.com



GATEWAY_INTERFACE = CGI/1.1



SERVER_PROTOCOL = HTTP/1.0



SERVER_PORT=80



REQUEST_METHOD = POST



HTTP_ACCEPT = text/plain, text/html, application/rtf, application/postscript,



audio/basic, audio/x-aiff, image/gif, image/jpeg, image/tiff, video/mpeg



PATH_INFO =



PATH_TRANSLATED =



SCRIPT_NAME = /cgi-bin/survey.pl



QUERY_STRING =



REMOTE_HOST =



REMOTE_ADDR =



REMOTE_USER =



AUTH_TYPE =



CONTENT_TYPE = application/x-www-form-urlencoded



CONTENT_LENGTH = 81

The following POST-submitted data is passed to the gateway script via the standard input stream:




NAME=Sandy+Brown&ADDRESS=12+Sunny+Lane+WhoVille,+USA&PHONE_NUMBER=987-654-3210

Output from CGI Scripts


After the script has completed processing the input, the script should return output to the server. The server will then return the output to the client. Generally, this output is in the form of an HTTP response that includes a header followed by a blank line and a body. Although the CGI header output is strictly formatted, the body of the output is formatted in the manner you specify in the header. For example, the body can contain an HTML document for the client to display.

CGI Headers


CGI headers contain directives to the server. Currently there are three valid server directives:

A single header can contain one or all of the server directives. Your CGI script would output these directives to the server. Although the header is followed by a blank line that separates the header from the body, the output does not have to contain a body.

Content-Types Used in CGI Headers

The Content-Type field in a CGI header identifies the MIME type of the data you are sending back to the client. Usually the data output from a script is fully formatted document, such as an HTML document. You could specify this in the header as follows:




Content-Type: text/html

Locations used in CGI Headers

The output of your script doesn't have to be a document created within the script. You can reference any document on the Web using the Location field. The Location field references a file by its URL. Servers process location references either directly or indirectly depending on the location of the file. If the server can find the file locally, it passes the file to the client. Otherwise, the server redirects the URL to the client and the client has to retrieve the file. You can specify a location in a script as follows:




Location: http://www.tvpress.com/

NOTE

Some older browsers don't support automatic redirection. Consequently, you might want to consider adding an HTML -formatted message body to the output. This message body will only be displayed if a browser cannot use the location URL.



Status Used in CGI Headers

The Status field passes a status line to the server for forwarding to the client. Status codes are expressed as a three-digit code followed by a string that generally explains what has occurred. The first digit of a status code shows the general status as follows:

Although many status codes are used by servers, the status codes you pass to a client via your CGI script are usually client error codes. For example, let's say the script could not find a file, and you have specified that in such cases, instead of returning nothing, it should output an error code. Here is a list of the client error codes you might want to use:


Clarifying CGI Output


Creating the output from a CGI script is easier than it might seem. All you have to do is format the output into a header and body using your favorite programming language. This section contains two examples. The first example is in the Perl programming language. The second example is in the UNIX Bourne shell.

If you wanted the script to output a simple HTML document using Perl, here is how you could do it:




#!/usr/bin/perl



#Create header with extra line space



print "Content-Type: text/html\n\n";



#Add body in HTML format



print <<"MAIN";



<HTML><HEAD><TITLE>Output from Script</TITLE></HEAD>



<BODY>



<H1>Top 10 Reasons for Using CGI</H1>



<P>10. Customer feedback.</P>



<P>9. Obtaining questionnaire and survey responses.</P>



<P>8. Tracking visitor count.</P>



<P>7. Automating searches.</P>



<P>6. Creating easy database interfaces.</P>



<P>5. Building gateways to other protocols.</P>



<P>4. HTML 2.0 Image maps.</P>



<P>3. User Authentication.</P>



<P>2. On-line order processing.</P>



<P>1. Generating documents on the fly.</P>



</BODY>



MAIN

If you wanted the script to output a simple HTML document in Bourne shell, here's how you could do it:




#!/bin/sh



#Create header with extra line space



echo "Content-Type: text/html"



#Add body in HTML format



cat << MAIN



<HTML><HEAD><TITLE>Output from Script</TITLE></HEAD>



<BODY>



<H1>Top 10 Reasons for Using CGI</H1>



<P>10. Customer feedback.</P>



<P>9. Obtaining questionnaire and survey responses.</P>



<P>8. Tracking visitor count.</P>



<P>7. Automating searches.</P>



<P>6. Creating easy database interfaces.</P>



<P>5. Building gateways to other protocols.</P>



<P>4. HTML 2.0 Image maps.</P>



<P>3. User Authentication.</P>



<P>2. On-line order processing.</P>



<P>1. Generating documents on the fly.</P>



</BODY>



MAIN

The server processing the output sets environment variables, creates an HTTP header, then sends the data on to the client. Here is how the HTTP header might look coming from a CERN Web Server:




HTTP/1.0 302 Found



MIME-Version: 1.0



Server: CERN/3.0



Date: Monday, 4-Mar-96 23:59:59 HST



Content-Type: text/html



Content-Length: 485



<HTML><HEAD><TITLE>Output from Script</TITLE></HEAD>



<BODY>



<H1>Top 10 Reasons for Using CGI</H1>



<P>10. Customer feedback.</P>



<P>9. Obtaining questionnaire and survey responses.</P>



<P>8. Tracking visitor count.</P>



<P>7. Automating searches.</P>



<P>6. Creating easy database interfaces.</P>



<P>5. Building gateways to other protocols.</P>



<P>4. HTML 2.0 Image maps.</P>



<P>3. User Authentication.</P>



<P>2. On-line order processing.</P>



<P>1. Generating documents on the fly.</P>



</BODY>

Summary


The common gateway interface opens the door for adding advanced features to your Web publications. This workhorse running quietly in the background enables fill-out forms, database queries, index searches, and creation of documents on the fly. FrontPage allows you to easily add WebBots to pages that use forms. However, WebBots generally do not perform any post-submission processing. With CGI scripts, you can process input from forms automatically and generate output directly to the reader based on the results of the processing.

Although CGI enhancement is a click of the mouse button away for most readers, CGI enhancement means extra work for Web publishers. Still, the exponential payoff associated with CGI enhancing your Web publications makes the extra effort truly worthwhile.

Previous Page Page Top TOC Next Page