by William Robert Stanek
Using CGI scripts, you can create powerful, personalized, and professional Web publications that readers can really interact with. CGI scripts are external programs that act as gateways between the Web server and other applications. You can use CGI
scripts to process input from readers and thus open a two-way communication channel with your readers. Reader input can be data from fill-out forms, keywords for a database query, or values that describe the reader's browser and connection.
Your CGI scripts can use this input to add entries to an index, to search databases, to create customized documents on the fly, and much more. Yet the most wonderful thing about CGI scripts is that they hide their complexities from users. If you've
used a fill-out form or an image map on the Web, you've probably used a gateway script and probably didn't even know it. This is because everything seems to happen automatically. You enter data, click a mouse button, and a moment later a result is
displayed. Learning what actually happens between the click of the mouse button and the display of the result is what this chapter is all about. This chapter explains what you need to know about CGI scriptswhat they are, how to use them, and why to
use them.
Although FrontPage enables you to easily add WebBots to pages that use forms, WebBots generally do not perform any post-submission processing. With CGI scripts, you can process input from forms automatically and generate output directly to the reader
based on the results of the processing.
CGI scripts are external programs that run on the Web server. You can use CGI scripts to create highly interactive Web publications. The standard that defines how external programs are used on Web servers and how they interact with other applications
is the common gateway interface. The three keywords that comprise the name of the standardcommon, gateway, and interfacedescribe how the standard works:
Common: By specifying a common way for scripts to be accessed, CGI enables anyone, no matter their platform, to pass information to a CGI script.
Gateway: By defining the link or gateway between the script, the server,, and other applications, CGI makes it possible for external programs to accept generalized input and pass information to other applications.
Interface: By describing the interface or the way external programs can be accessed by users, CGI reduces the complex process of interfacing with external programs to a few basic procedures.
The developers of CGI worked these key concepts into the CGI standard to create a powerful and extendible advanced feature for Web publishers that shields readers of your publications from its complexities. The reader need only click on an area of an
image map or submit their fill-out form after completing it. Everything after the click of the mouse button seems to happen automatically, and the reader doesn't have to worry about the how or why. As a Web publisher, understanding how CGI scripts work is
essential, especially if you want to take advantage of the ways CGI can be used to create powerful Web publications.
Although the reader sees only the result of their submission or query, behind the scenes many things are happening. Here is a summary of what is taking place:
On UNIX systems, CGI scripts are located in a directory called cgi-bin in the usr file system and CGI utilities are located in a directory called cgi-src in the
usr file system. On other systems, your Web server documentation will explain in what directories CGI scripts and utilities should be placed.
CGI scripts are also called gateway scripts. The term script comes from the UNIX environment, in which shell scripts abound, but gateway scripts don't have to be in the format of a UNIX script. You can write gateway scripts in almost any
computer language that produces an executable file. The most common languages for scripts are
Bourne Shell
C Shell
C/C++
Perl
Python
Tcl
Visual Basic
Two up-and-coming scripting languages are
JavaScript
VBScript
The best programming language to write your script in is one that works with your Web server and meets your needs. Preferably, the language should already be available on the Web server and you should be proficient in it (or at least have some
knowledge of the language). Keep in mind, most user input is in the form of text that must be manipulated in some way, which makes support for text strings and their manipulation critically important.
The easiest way to determine if a language is available is to ask the Webmaster or system administrator responsible for the server. As most Web servers operate on UNIX systems, you might be able to use the following UNIX commands to check on the
availability of a particular language:
which
whereis
You can use either which or whereis on UNIX systems. You would type which or whereis at the shell prompt and follow the command by a
keyword on which you want to search, such as the name of the programming language you want to use. To see if your UNIX server supports Perl, you could type either
which perl
or
whereis perl
As Perl, C/C++, and UNIX shell are the most popular languages for scripts, the sections that follow will look briefly at these languages, with emphasis on why and when to use them. Each section contains a checklist for features and systems supported,
which can be interpreted as follows:
The sections on common scripting languages are followed by close-ups on the newest scripting languages: JavaScript and VBScript. Both JavaScript and VBScript are hot topics on the Net right now. If you want to be on the cutting edge of Internet
technologies, these are languages you want to keep both eyes on.
Operating system support: UNIX
Programming level: Basic
Complexity of processing: Basic
Text-handling capabilities: Moderately Advanced
The UNIX operating system is in wide use in business, education, and research sectors. There are almost as many variations of the UNIX operating system as there are platforms that use it. You will even find that platforms produced by the same
manufacturer use different variants of the UNIX operating system. For example, DEC has variants for the Dec-Alpha, Decstation, and Dec OSF.
What these operating systems have in common is the core environment on which they are based. Most UNIX operating systems are based on Berkeley UNIX (BSD), AT&T System V, or a combination of BSD and System V. Both BSD and System V support three
shell scripting languages:
Bourne shell
C shell
Korn shell
TIP
You can quickly identify the shell scripting language used by examining the first line of a script. Bourne shell scripts generally have this first line:
#!/bin/shC shell scripts generally have a blank first line or the following:
#!/bin/cshKorn shell scripts generally have this first line:
#!/bin/ksh
All UNIX shells are interpreted languages, which means the scripts you create do not have to be compiled. Bourne shell is the most basic shell. C shell is an advanced shell with many features of the C programming language. Because Bourne shell uses a
completely different syntax than C shell, scripts written in Bourne are not compatible with scripts written in C. If you create a script in Bourne shell and later want to use C shell to interpret the script, you must rewrite the script for C shell.
Many programmers often want to merge the simplicity of Bourne shell with the advanced features of C shell, and this is where Korn shell comes in handy. Korn shell has the same functionality as the Bourne shell and also incorporates many features of the
C shell. Any shells you've written in Bourne shell can be interpreted directly by the Korn interpreter. This saves time rewriting a script when you later find you want to use a feature supported by Korn. Although the Korn shell is gaining popularity,
Bourne and C shell are the two most widely used UNIX shells.
Some differences in Bourne, C, and Korn shell are visible only if you are at the shell prompt and using a particular shell. You can change your current shell any time from the shell prompt by typing:
/bin/sh to change to Bourne shell
/bin/csh to change to C shell
/bin/ksh to change to Korn shell
Usually, you will see visible differences between the various shells immediately. For example, the default command prompt for Bourne shell is the dollar sign, while the default command prompt for C shell is usually your host name and user ID followed
by a colon. Beyond this, C shell supports a history function, aliasing of commands, and many other controls that the Bourne shell does not. However, to the CGI programmer, these differences are generally not important. Your primary concern should be the
features that the shell directly supports and how scripts behave when executed in it.
Bourne shell is the smallest of the shells and the most efficient. Consequently, a Bourne shell script will generally execute faster and use less system resources. When you want more advanced features, such as arrays, you will want to use Korn shell.
Korn shell has more overhead than Bourne shell and requires slightly more system resources. When you want to make advanced function calls or assignments, you will want to use C shell. Because C shell is larger than Bourne and Korn shell, scripts written in
C shell generally have higher overhead and use more system resources.
Although UNIX shells have good built-in facilities for handing text, such as sed, awk, and grep, they are not as powerful or extensible as traditional programming languages. You should consider using shell scripts when you want to perform simple tasks
and moderately advanced text or file manipulation.
Operating system support: UNIX, DOS, Windows, MAC and others
Programming level: Advanced
Complexity of processing: Advanced
Text-handling capabilities: Difficult to use
When you want your scripts to perform complex tasks, you call in the big guns. Two of the most advanced languages used in CGI scripts are C and C++. C is the most popular programming language in use today. C++ is the object-oriented successor to C.
Both C and C++ are advanced programming languages that require you to compile your scripts before you can use them. A major advantage of C and C++ is that they enjoy widespread use, and versions are available for virtually every operating system you can
think of.
The primary time to use C (rather than C++) is when your scripts must execute swiftly and use minimal system resources. C was developed more than 20 years ago, and has been gaining popularity ever since. CGI programmers use C because compiled C
programs are very smalltiny compared to programs with similar functionality programmed in other languages. Small programs use minimal system resources and execute quickly. However, C is a very complex language with difficult-to-use facilities for
manipulating text. Therefore, if you are not proficient in C, you should be wary of using C to perform advanced text string processing.
The primary time to use C++ is when certain functions of your scripts will be reused and when long-term development costs are a major concern. C++ is an object-oriented language that enables you to use libraries of functions. These functions form the
core of your CGI scripts and can be reused in other CGI scripts. For example, you can use one function to sort the user's input, another function to search a database using the input, and another function to display the output as an HTML document. However,
C++ is an object-oriented language that is very different from other languages. If you have not used an object-oriented language before, are not familiar with C, and plan to use C++ for your CGI scripts, you should be prepared for a steep learning curve.
Operating system support: UNIX, DOS, Windows, MAC and others
Programming level: Advanced
Complexity of processing: Advanced
Text handling capabilities: Easy to use
If you want to be on the inside track of CGI programming, you should learn and use the Practical Extraction and Report Language (Perl). Perl combines elements of C with UNIX shell features such as awk, sed, and grep to create a powerful language for
processing text strings and generating reports. Because most of the processing done by CGI scripts involves text manipulation, Perl is rapidly becoming the most widely used language for CGI scripts. As with C and C++, a major advantage of Perl is its
widespread use. Versions of Perl are available for virtually every operating system you can think of. You can use Perl to perform the following tasks:
Perl, like Bourne and C shell, is an interpreted language. However, Perl does not have the limitations of most interpreted languages. You can use Perl to manipulate extremely large amounts of data, and you can quickly scan files using sophisticated
pattern-matching techniques. Perl strings are not limited in size. The entire contents of a file can be used as a single string. Perl's syntax is similar to C's. Many basic Perl constructs, like if, for, and while statements, are used just as you use them in C.
TIP
Like a UNIX shell script, a Perl script will usually specify the path to the source routines in the first line. Therefore, the first line of a Perl script should specify the path to where Perl is installed on the system. This path is usually
#!/usr/local/perlor
#!/usr/local/bin/perl
Perl is surprisingly easy to learn and use, especially if you know the basics of C or UNIX shell. Perl scripts are usually faster than UNIX shell scripts and slightly slower than compiled C/C++ scripts. You should use Perl whenever you have large
amounts of text to manipulate.
JavaScript is a scripting language based on the Java programming language developed by Sun Microsystems. This powerful up-and-coming scripting language is being developed by Netscape Communications Corporation, and as you might have guessed, the
Netscape Navigator 2.0/3.0 fully supports JavaScript.
Netscape Navigator 2.0/3.0 interprets JavaScript programs embedded directly in an HTML page, and just like Java applets, these programs are fully interactive. JavaScript can recognize and respond to mouse clicks, form input, and page navigation. This
means your pages can "intelligently" react to user input. The JavaScript language resembles the Java programming languagewith a few important exceptions, as you can see from the comparisons in the following lists:
JavaScript is
Java is
JavaScript is designed to complement the Java language and has some terrific features for Web publishers. You could create a JavaScript program that passes parameters to a Java applet. This would enable you to use the JavaScript program as an
easy-to-use front-end for your Java applets. Further, because a Web publisher is not required to know about classes to use JavaScript and to pass parameters to a Java applet, JavaScript provides a simple solution for publishers who want to use the features
of the Java language but don't want to learn how to program in Java.
This powerful up-and-coming scripting language is featured in Part IX, "JavaScript and Java."
With VBScript, Microsoft proves once again that it understands the tools developers need. Visual Basic Script is a subset of Visual Basic and is used to create highly interactive documents on the Web. Similar to JavaScript, programs written in VBScript
are embedded in the body of your HTML documents.
Visual Basic Script also enables dynamic use of OLE scripting management with ActiveX Controls. The Object Linking and Embedding of scripts enables Web publishers to dynamically embed VBScript runtime environments. Basically, this enables you to use
VBScripts as plug-in modules. You can, for example, embed a VBScript program in your Web document that calls other VBScript programs to use as plug-ins. The exact plug-in calls could be dynamically selected based on user input.
This powerful up and coming scripting language is featured in Part VIII, "VBScript and ActiveX."
At this point, you might be worried about having to program. You might also be wondering why you would want to use gateway scripts at all. These are valid concerns. Learning a programming language isn't easy, but as you will see later, you might never
have to program at all. Dozens of ready-to-use CGI scripts are freely available on the Web. Often you can use these existing programs to meet your needs.
The primary reason to use CGI scripts is to automate what would otherwise be a manual and probably time-consuming process. Using CGI scripts benefits both you and your reader. The reader gets simplicity, automated responses to input, easy ways to make
submissions, and fast ways to conduct searches. Gateway scripts enable you to automatically process orders, queries, and much more. CGI programs are commonly used for the following purposes:
FrontPage WebBots perform many of the things that CGI scripts are used for. In fact, the only common CGI tasks FrontPage has not automated are the last three items in the previous list.
Gateway scripts are used to process input submitted by readers of your Web publications. The input usually consists of environment variables that the Web server passes to the gateway script. Environment variables describe the information being passed,
such as the version of CGI used on the server, the type of data, the size of the data, and other important information. Gateway scripts can also receive command-line arguments and standard input. To execute a CGI script, the script must exist on the
server you are referencing. You must also have a server that is both capable of executing gateway scripts and configured to handle the type of script you plan to use.
Readers pass information to a CGI script by activating a link containing a reference to the script. The gateway script processes the input and formats the results as output that the Web server can use. The Web server takes the results and passes them
back to the reader's browser. The browser displays the output for the reader.
The output from a gateway script begins with a header containing a directive to the server. Currently there are three valid server directives: Content-type, Location, and Status. The header can consist of a directive in the format of an HTTP header
followed by a blank line. The blank link separates the header from the data you are passing back to the browser. Output containing Location and Status directives usually are a single line. This is because the directive contained on the Location or Status
line is all that the server needs, and when there is no subsequent data, you do not need to insert a blank line. The server interprets the output, sets environment variables, and passes the output to the client.
Any transaction between a client and server has many parts. These parts can be broken down into the following eight steps:
FrontPage enables you to set properties for forms using the Form Properties box. You can access this box whenever you add a push button to a form or by double-clicking on a push button in a form. While the push button's Properties box is displayed,
click on the Form button to display the Form Properties box. This dialog box has two main areas. The Form Handler area defines the type of handler that will process the input from the form. The Hidden Fields area defines form fields not visible to the
user.
To use a CGI script, select the Custom ISAPI, NSAPI, or CGI Script form handler. Next, click the Settings button. This opens the Settings For Custom Form Handler dialog box shown in Figure 34.1. As you can see, this dialog box has three fields: Action,
Method and Encoding Type. The next three sections discuss the values you can use for these fields.
Figure 34.1. Using CGI Scripts
The Action field specifies the action to be performed when a form is submitted. As a form without a defined action will not be processed in any way, you should always specify a value for the Action field. You can define an action for your forms as the
URL to a gateway script to be executed or as an actual action.
By specifying the URL to a gateway script, you can direct input to the script for processing. The URL provides a relative or an absolute path to the script. Scripts defined with relative URLs are located on your local server. Scripts defined with
absolute URLs can be located on a remote or local server. Most CGI scripts are located in the cgi-bin directory. You could access a script in a cgi-bin directory by setting the Action field to
http://tvp.com/cgi-bin/your_script
You can also use the Action field to specify an actual action to be performed. The only action currently supported is mailto that enables you to mail the contents of a form to anyone using their e-mail address. Most current
browser and server software support the mailto value. To use the mailto value, set the Action field as follows:
mailto:name@host
Here, name is the user name and host is the host machine the user is located on, as in the following example:
mailto:publisher@tvp.com
A form created using the previous example would be sent to publisher@tvp.com. The mailto value provides you with a simple solution for using forms that does not need to be directed to a CGI
script to be processed. This is great news for Web publishers who don't have access to CGI and can't use FrontPage server extensions. As the contents of the form are mailed directly to an intended recipient, the data can be processed off-line as necessary.
You should consider using the mailto value for forms that don't need immediate processing and when you don't have access to CGI or FrontPage server extensions but would like to use forms in your Web publications.
The Method field specifies the way the form is submitted. There are currently two acceptable values:
GET
POST
The preferred submission method is POST, the default value used by FrontPage. POST sends the data as a separate input stream via the server to your gateway script.
This enables the server to pass the information directly to the gateway script without assigning variables or arguments. The value of an environment variable called CONTENT_LENGTH tells the CGI script how much data to read from
the standard input stream. Using this method, there is no limit on the amount of data that can be passed to the server.
GET appends the retrieved data to the script URL. The script URL and the data are passed to the server as a single URL-encoded input. The server receiving the input passes it to two variables: the script URL to SCRIPT_NAME and the data to QUERRY_STRING.
Assigning the data to variables on a UNIX system means passing the data through the UNIX shell. The number of characters you can send to UNIX shell in a single input is severely limited. Some servers restrict the length of this type of input to 255
characters. This means you can append only a limited amount of data to a URL before truncation occurs. You lose data when truncation occurs, and losing data is a bad thing. Consequently, if you use GET, you should always ensure
that the length of data input is small.
The Encoding Type field specifies the MIME content type for encoding the form data. The client encodes the data before passing it to the server. The reason for encoding the data from fill-out form is not to prevent the data from being read, but rather
to ensure that input fields can be easily matched to key values. By default, the data is x-www-form-encoded. This encoding is also called URL encoding. If you do not specify an encoding type, the default value is used
automatically.
Although in theory you can use any valid MIME type, such as text/plain, most forms on the Web use the default encoding, x-www-form-encoded. MIME stands for Multipurpose Internet Mail Extensions.
HTTP uses MIME to identify the type of object being transferred across the Internet. The purpose of encoding is to prevent problems you would experience when trying to manipulate data that has not been encoded in some way.
You do not have to set a value for this field. However, if you wanted to strictly specify the default encoding, you would set the Encoding Type field to the following value:
x-www-form-encoded
When a user activates a link to a gateway script, input is sent to the server. The server formats this data into environment variables and checks to see whether additional data was submitted via the standard input stream.
Input to CGI scripts is usually in the form of environment variables. The environment variables passed to gateway scripts are associated with the browser requesting information from the server, the server processing the request, and the data passed in
the request. Environment variables are case-sensitive and are normally used as described in this section. Although some environment variables are system-specific, many environment variables are standard. The standard variables are shown in Table 11.1.
As later examples show, environment variables are set automatically whenever reader input is passed to a server. The primary reason to learn about these variables is to better understand how input is passed to CGI scripts, but you should also learn
about these variables so you know how to take advantage of them when necessary.
AUTH_TYPE | Specifies the authentication method and is used to validate a user's access. |
CONTENT_LENGTH | Used to provide a way of tracking the length of the data string as a numeric value. |
CONTENT_TYPE | Indicates the MIME type of data. |
GATEWAY_INTERFACE | Indicates the version of the CGI standard the server is using. |
HTTP_ACCEPT | Indicates the MIME content types the browser will accept, as passed to the gateway script via the server. |
HTTP_USER_AGENT | Indicates the type of browser used to send the request, as passed to the gateway script via the server. |
PATH_INFO | Identifies the extra information included in the URL after the identification of the CGI script. |
PATH_TRANSLATED | Set by the server based on the PATH_INFO variable. The server translates the PATH_INFO variable into this variable. |
QUERY_STRING | Set to the query string (if the URL contains a query string). |
REMOTE_ADDR | Identifies the Internet Protocol address of the remote computer making the request. |
REMOTE_HOST | Identifies the name of the machine making the request. |
REMOTE_IDENT | Identifies the machine making the request. |
REMOTE_USER | Identifies the user name as authenticated by the user. |
REQUEST_METHOD | Indicates the method by which the request was made. |
SCRIPT_NAME | Identifies the virtual path to the script being executed. |
SERVER_NAME | Identifies the server by its host name, alias, or IP address. |
SERVER_PORT | Identifies the port number the server received the request on. |
SERVER_PROTOCOL | Indicates the protocol of the request sent to the server. |
SERVER_SOFTWARE | Identifies the Web server software. |
The AUTH_TYPE variable provides access control to protected areas of the Web server and can be used only on servers that support user authentication. If an area of the Web site has no access control, the AUTH_TYPE variable has no value associated with it. If an area of the Web site has access control, the AUTH_TYPE variable is set to a specific value that identifies the authentication scheme being used.
Otherwise, the variable has no value associated with it. A simple challenge-response authorization mechanism is implemented under current versions of HTTP.
Using this mechanism, the server can challenge a client's request and the client can respond. To do this, the server sets a value for the AUTH_TYPE variable and the client supplies a matching value. The next step is to
authenticate the user. Using the basic authentication scheme, the user's browser must supply authentication information that uniquely identifies the user. This information includes a user ID and password.
Under the current implementation of HTTP, HTTP 1.0, the basic authentication scheme is the most commonly used authentication method. To specify this method, set the AUTH_TYPE variable as follows:
AUTH_TYPE = Basic
The CONTENT_LENGTH variable provides a way of tracking the length of the data string. This tells the client and server how much data to read on the standard input stream. The value of the variable corresponds to the number
of characters in the data passed with the request. If no data is being passed, the variable has no value.
As long as the characters are represented as octets, the value of the CONTENT_LENGTH variable will be the precise number of characters passed as standard input or standard output. Thus, if 25 characters are passed and they
are represented as octets, the CONTENT_LENGTH variable will have the following value:
CONTENT_LENGTH = 25
The CONTENT_TYPE variable indicates the data's MIME type. MIME typing is a feature of HTTP 1.0 and is not available on servers using HTTP 0.9. The variable is set only when attached data is passed using the standard input
or output stream. The value assigned to the variable identifies the MIME type and subtype as follows:
CONTENT_TYPE = type/subtype
MIME types are broken down into basic type categories. Each data type category has a primary subtype associated with it. The basic MIME types and their descriptions are shown in Table 11.2.
Type | Description |
application | Binary data that can be executed or used with another application |
audio | A sound file that requires an output device to preview |
image | A picture that requires an output device to preview |
message | An encapsulated mail message |
multipart | Data consisting of multiple parts and possibly many data types |
text | Textual data that can be represented in any character set or formatting language |
video | A video file that requires an output device to preview |
x-world | Experimental data type for world files |
MIME subtypes are defined in three categories: primary, additionally defined, and extended. The primary subtype is the primary type of data adopted for use as MIME Content-Types. Additionally defined data types are additional subtypes that have been
officially adopted as MIME Content-Types. Extended data types are experimental subtypes that have not been officially adopted as MIME Content-Types. You can easily identify extended subtypes because they begin with the letter x followed by a hyphen. Table
11.3 lists common MIME types and their descriptions.
Type/Subtype | Description |
application/mac-binhex40 | Macintosh binary-formatted data |
application/msword | Microsoft word document |
application/octet-stream | Binary data that can be executed or used with another application |
application/pdf | ACROBAT PDF document |
application/postscript | Postscript-formatted data |
application/rtf | Rich Text Format (RTF) document |
application/x-compress | Data that has been compressed using UNIX compress |
application/x-dvi | Device-independent file |
application/x-gzip | Data that has been compressed using UNIX gzip |
application/x-latex | LATEX document |
application/x-tar | Data that has been archived using UNIX tar |
application/x-zip-compressed | Data that has been compressed using PKZip or WinZip |
audio/basic | Audio in a nondescript format |
audio/x-aiff | Audio in Apple AIFF format |
audio/x-wav | Audio in Microsoft WAV format |
image/gif | Image in GIF format |
image/jpeg | Image in JPEG format |
image/tiff | Image in TIFF format |
image/x-portable-bitmap | Portable bitmap |
image/x-portable-graymap | Portable graymap |
image/x-portable-pixmap | Portable pixmap |
image/x-xbitmap | X-bitmap |
image/x-xpixmap | X-pixmap |
message/external-body | Message with external data source |
message/partial | Fragmented or partial message |
message/rfc822 | RFC-822-compliant message |
multipart/alternative | Data with alternative formats |
multipart/digest | Multipart message digest |
multipart/mixed | Multipart message with data in multiple formats |
multipart/parallel | Multipart data with parts that should be viewed simultaneously |
text/html | HTML-formatted text |
text/plain | Plain text with no HTML formatting included |
video/mpeg | Video in the MPEG format |
video/quicktime | Video in the Apple QuickTime format |
video/x-msvideo | Video in the Microsoft AVI format |
x-world/x-vrml | VRML world file |
Some MIME Content-Types can be used with additional parameters. These Content-Types include: text/plain, text/html, and all multipart message data. The charset
parameter is used with the text/plain type to identify the character set used for the data. The version parameter is used with the text/html type to identify the version of HTML used. The boundary parameter is used with multipart data to identify the boundary string that separates message parts.
The charset parameter for the text/plain type is optional. If a charset is not specified, the default value charset=us-ascii is assumed.
Other values for charset include any character set approved by the International Standards Organization. These character sets are defined by ISO-8859-1 to ISO-8859-9 and are specified as follows:
CONTENT_TYPE = text/plain; charset=iso-8859-1
The version parameter for the text/html type is optional. If this parameter is set, the browser reading the data interprets the data if the browser supports the version of HTML specified. The
following document conforms to the HTML 3.2 specification:
CONTENT_TYPE = text/html; version=3.2
The boundary parameter for multipart message types is required. The boundary value is set to a string of 1 to 70 characters. Although the string cannot end in a space, the string can contain any
valid letter or number and can include spaces and a limited set of special characters. Boundary parameters are unique strings that are defined as follows:
CONTENT_TYPE = multipart/mixed; boundary=boundary_string
The GATEWAY_INTERFACE variable indicates the version of the CGI specification the server is using. The value assigned to the variable identifies the name and version of the specification used as follows:
GATEWAY_INTERFACE = name/version
The current version of the CGI specification is 1.1. A server conforming to this version would set the GATEWAY_INTERFACE variable as follows:
GATEWAY_INTERFACE = CGI/1.1
The HTTP_ACCEPT variable defines the types of data the client will accept. The acceptable values are expressed as a type/subtype pair. Each type/subtype pair is separated by commas, as in
type/subtype, type/subtype
Most clients accept dozens of MIME types. The following identifies all the MIME Content-Types accepted by this client:
HTTP_ACCEPT = application/msword, application/octet-stream, application/postscript, application/rtf, application/x-zip-compressed, audio/basic, audio/x-aiff, audio/x-wav, image/gif, image/jpeg, image/tiff, image/x-portable-bitmap, message/external-body, message/partial, message/rfc822, multipart/alternative, multipart/digest, multipart/mixed, multipart/parallel, text/html, text/plain, video/mpeg, video/quicktime, video/x-msvideo
The HTTP_USER_AGENT variable identifies the type of browser used to send the request. The acceptable values are expressed as software type/version or library/version. The following HTTP_USER_AGENT variable identifies the Netscape Navigator Version 2.0:
HTTP_USER_AGENT = Mozilla/2.0
As you can see, Netscape uses the alias Mozilla to identify itself. The primary types of clients that set this variable are browsers, Web spiders, and robots. Although this is a useful parameter for identifying the type of client used to access a
script, keep in mind that not all clients set this variable.
Here's a list of software type values used by popular browsers:
Arena
Enhanced NCSA Mosaic
Lynx
MacWeb
Mozilla
NCSA Mosaic
NetCruiser
WebExplorer
WinMosaic
These values are used by Web spiders:
Lycos
MOMSpider
WebCrawler
The PATH_INFO variable specifies extra path information and can be used to send additional information to a gateway script. The extra path information follows the URL to the gateway script referenced. Generally, this
information is a virtual or relative path to a resource that the server must interpret. If the URL to the CGI script is specified in your document as
/usr/cgi-bin/formparse.pl/home.html
then the PATH_INFO variable would be set as follows:
PATH_INFO = /home.html
Servers translate the PATH_INFO variable into the PATH_TRANSLATED variable. It does this by inserting the default Web document's directory path in front of the extra path information. For
example, if the PATH_INFO variable was set to home.html and the default directory was /usr/documents/pubs, the PATH_TRANSLATED variable
would be set as follows:
PATH_TRANSLATED = /usr/documents/pubs/home.html
The QUERY_STRING specifies an URL-encoded search string. You'll set this variable when you use the GET method to submit a fill-out form, or when you use an ISINDEX
query to search a document. The query string is separated from the URL by a question mark. The user submits all the information following the question mark separating the URL from the query string. Here is an example:
/usr/cgi-bin/formparse.pl?string
When the query string is URL-encoded, the browser encodes key parts of the string. The plus sign is a placeholder between words, as a substitute for spaces:
/usr/cgi-bin/formparse.pl?word1+word2+word3
Equal signs separate keys assigned by the publisher from values entered by the user. In the following example, response is the key assigned by the publisher, and never is the value entered by
the user:
/usr/cgi-bin/formparse.pl?response=never
Ampersand symbols separate sets of keys and values. In the following example, response is the first key assigned by the publisher, and sometimes is the value entered by the user. The second key
assigned by the publisher is reason, and the value entered by the user is "I am not really sure".
Here is an example:
/usr/cgi-bin/formparse.pl?response=sometimes&reason=I+am+not+really+sure
Finally, the percent sign is used to identify escape characters. Following the percent sign is an escape code for a special character expressed as a hexadecimal value. Here is how the previous query string could be rewritten using the escape code for
an apostrophe:
/usr/cgi-bin/formparse.pl?response=sometimes&reason=I%27m+not+really+sure
The REMOTE_ADDR variable is set to the Internet Protocol (IP) address of the remote computer making the request. The IP address is a numeric identifier for a networked computer. The REMOTE_ADDR
variable is associated with the host computer making the request for the client and could be used as follows:
REMOTE_ADDR = 205.1.20.11
The REMOTE_HOST variable specifies the name of the host computer making a request. This variable is set only if the server can figure out this information using a reverse lookup procedure. If this variable is set, the full
domain and host name are used as follows:
REMOTE_HOST = www.tvp.com
The REMOTE_IDENT variable identifies the remote user making a request. The variable is set only if the server and the remote machine making the request support the identification protocol. Further, information on the remote
user is not always available, so you should not rely on it even when it is available. If the variable is set, the associated value is a fully expressed name that contains the domain information as well, such as
REMOTE_IDENT = william.www.tvp.com
The REMOTE_USER variable is the user name as authenticated by the user, and as such is the only variable you should rely upon to identify a user. As with other types of user authentication, this variable is set only if the
server supports user authentication and if the gateway script is protected. If the variable is set, the associated value is the user's identification as sent by the client to the server, such as
REMOTE_USER = william
The REQUEST_METHOD specifies the method by which the request was made. For HTTP 1.0, the methods could be any of the following:
GET
HEAD
POST
PUT
DELETE
LINK
UNLINK
The GET, HEAD, and POST methods are the most commonly used request methods. Both GET and POST are used to
submit forms. The HEAD method could be specified as follows:
REQUEST_METHOD = HEAD
The SCRIPT_NAME variable specifies the virtual path to the script being executed. This is useful if the script generates an HTML document that references the script. If the URL specified in your HTML document is
http://tvp.com/cgi-bin/formparse.pl
the SCRIPT_NAME variable is set as follows:
SCRIPT_NAME = /cgi-bin/formparse.pl
The SERVER_NAME variable identifies the server by its host name, alias, or IP address. This variable is always set and could be specified as follows:
SERVER_NAME = tvp.com
The SERVER_PORT variable specifies the port number on which the server received the request. This information can be interpreted from the URL to the script if necessary. However, most servers use the default port of 80 for
HTTP requests. If the URL specified in your HTML document is
http://www.ncsa.edu:8080/cgi-bin/formparse.pl
the SERVER_PORT variable is set as follows:
SERVER_PORT = 8080
The SERVER_PROTOCOL variable identifies the protocol used to send the request. The value assigned to the variable identifies the name and version of the protocol used. The format is name/version, such as HTTP/1.0. The variable is set as follows:
SERVER_PROTOCOL = HTTP/1.0
The SERVER_SOFTWARE variable identifies the name and version of the server software. The format for values assigned to the variable is name/version, such as CERN/2.17. The variable is set as follows:
SERVER_SOFTWARE = CERN/2.17
Most input sent to a Web server is used to set environment variables, yet not all input fits neatly into an environment variable. When a user submits actual data to be processed by a gateway script, this data is received as a URL-encoded search string
or via the standard input stream. The server knows how to process actual data by the method used to submit the data.
Sending data as standard input is the most direct way to send data. The server simply tells the gateway script how many 8-bit sets of data to read from standard input. The script opens the standard input stream and reads the specified amount of data.
Although long URL-encoded search strings might get truncated, data sent on the standard input stream will not. Consequently, the standard input stream is the preferred way to pass data.
You can identify a submission method when you create your fill-out forms. Under HTTP 1.0, there are two submission methods for forms:
Let's create a sample Web document containing a form with three key fields: NAME, ADDRESS, and PHONE_NUMBER. Assume the URL to the script is http://www.tvp.com/cgi-bin/survey.pl and the user responds as follows:
Sandy Brown 12 Sunny Lane WhoVille, USA 987-654-3210
Identical information submitted using the GET and POST methods is treated differently by the server. When the GET method is used, the server sets the following
environment variables then passes the input to the survey.pl script:
PATH=/bin:/usr/bin:/usr/etc:/usr/ucb SERVER_SOFTWARE = CERN/3.0 SERVER_NAME = www.tvp.com GATEWAY_INTERFACE = CGI/1.1 SERVER_PROTOCOL = HTTP/1.0 SERVER_PORT=80 REQUEST_METHOD = GET HTTP_ACCEPT = text/plain, text/html, application/rtf, application/postscript, audio/basic, audio/x-aiff, image/gif, image/jpeg, image/tiff, video/mpeg PATH_INFO = PATH_TRANSLATED = SCRIPT_NAME = /cgi-bin/survey.pl QUERY_STRING = NAME=Sandy+Brown&ADDRESS=12+Sunny+Lane+WhoVille,+USA &PHONE_NUMBER=987-654-3210 REMOTE_HOST = REMOTE_ADDR = REMOTE_USER = AUTH_TYPE = CONTENT_TYPE = CONTENT_LENGTH =
When the POST method is used, the server sets the following environment variables and then passes the input to the survey.pl script:
PATH=/bin:/usr/bin:/usr/etc:/usr/ucb SERVER_SOFTWARE = CERN/3.0 SERVER_NAME = www.tvp.com GATEWAY_INTERFACE = CGI/1.1 SERVER_PROTOCOL = HTTP/1.0 SERVER_PORT=80 REQUEST_METHOD = POST HTTP_ACCEPT = text/plain, text/html, application/rtf, application/postscript, audio/basic, audio/x-aiff, image/gif, image/jpeg, image/tiff, video/mpeg PATH_INFO = PATH_TRANSLATED = SCRIPT_NAME = /cgi-bin/survey.pl QUERY_STRING = REMOTE_HOST = REMOTE_ADDR = REMOTE_USER = AUTH_TYPE = CONTENT_TYPE = application/x-www-form-urlencoded CONTENT_LENGTH = 81
The following POST-submitted data is passed to the gateway script via the standard input stream:
NAME=Sandy+Brown&ADDRESS=12+Sunny+Lane+WhoVille,+USA&PHONE_NUMBER=987-654-3210
After the script has completed processing the input, the script should return output to the server. The server will then return the output to the client. Generally, this output is in the form of an HTTP response that includes a header followed by a
blank line and a body. Although the CGI header output is strictly formatted, the body of the output is formatted in the manner you specify in the header. For example, the body can contain an HTML document for the client to display.
CGI headers contain directives to the server. Currently there are three valid server directives:
A single header can contain one or all of the server directives. Your CGI script would output these directives to the server. Although the header is followed by a blank line that separates the header from the body, the output does not have to contain a
body.
The Content-Type field in a CGI header identifies the MIME type of the data you are sending back to the client. Usually the data output from a script is fully formatted document, such as an HTML document. You could specify this in the header as
follows:
Content-Type: text/html
The output of your script doesn't have to be a document created within the script. You can reference any document on the Web using the Location field. The Location field references a file by its URL. Servers process location references either directly
or indirectly depending on the location of the file. If the server can find the file locally, it passes the file to the client. Otherwise, the server redirects the URL to the client and the client has to retrieve the file. You can specify a location in a
script as follows:
Location: http://www.tvpress.com/
NOTE
Some older browsers don't support automatic redirection. Consequently, you might want to consider adding an HTML -formatted message body to the output. This message body will only be displayed if a browser cannot use the location URL.
The Status field passes a status line to the server for forwarding to the client. Status codes are expressed as a three-digit code followed by a string that generally explains what has occurred. The first digit of a status
code shows the general status as follows:
1XX Not yet allocated
2XX Success
3XX Redirection
4XX Client error
5XX Server error
Although many status codes are used by servers, the status codes you pass to a client via your CGI script are usually client error codes. For example, let's say the script could not find a file, and you have specified that in such cases, instead of
returning nothing, it should output an error code. Here is a list of the client error codes you might want to use:
Status: 401 Unauthorized Authentication has failed. User is not allowed to access the file and should try again.
Status: 403 Forbidden. The request is not acceptable. User is not permitted to access file.
Status: 404 Not found. The specified resource could not be found.
Status: 405 Method not allowed. The submission method used is not allowed.
Creating the output from a CGI script is easier than it might seem. All you have to do is format the output into a header and body using your favorite programming language. This section contains two examples. The first example is in the Perl
programming language. The second example is in the UNIX Bourne shell.
If you wanted the script to output a simple HTML document using Perl, here is how you could do it:
#!/usr/bin/perl #Create header with extra line space print "Content-Type: text/html\n\n"; #Add body in HTML format print <<"MAIN"; <HTML><HEAD><TITLE>Output from Script</TITLE></HEAD> <BODY> <H1>Top 10 Reasons for Using CGI</H1> <P>10. Customer feedback.</P> <P>9. Obtaining questionnaire and survey responses.</P> <P>8. Tracking visitor count.</P> <P>7. Automating searches.</P> <P>6. Creating easy database interfaces.</P> <P>5. Building gateways to other protocols.</P> <P>4. HTML 2.0 Image maps.</P> <P>3. User Authentication.</P> <P>2. On-line order processing.</P> <P>1. Generating documents on the fly.</P> </BODY> MAIN
If you wanted the script to output a simple HTML document in Bourne shell, here's how you could do it:
#!/bin/sh #Create header with extra line space echo "Content-Type: text/html" #Add body in HTML format cat << MAIN <HTML><HEAD><TITLE>Output from Script</TITLE></HEAD> <BODY> <H1>Top 10 Reasons for Using CGI</H1> <P>10. Customer feedback.</P> <P>9. Obtaining questionnaire and survey responses.</P> <P>8. Tracking visitor count.</P> <P>7. Automating searches.</P> <P>6. Creating easy database interfaces.</P> <P>5. Building gateways to other protocols.</P> <P>4. HTML 2.0 Image maps.</P> <P>3. User Authentication.</P> <P>2. On-line order processing.</P> <P>1. Generating documents on the fly.</P> </BODY> MAIN
The server processing the output sets environment variables, creates an HTTP header, then sends the data on to the client. Here is how the HTTP header might look coming from a CERN Web Server:
HTTP/1.0 302 Found MIME-Version: 1.0 Server: CERN/3.0 Date: Monday, 4-Mar-96 23:59:59 HST Content-Type: text/html Content-Length: 485 <HTML><HEAD><TITLE>Output from Script</TITLE></HEAD> <BODY> <H1>Top 10 Reasons for Using CGI</H1> <P>10. Customer feedback.</P> <P>9. Obtaining questionnaire and survey responses.</P> <P>8. Tracking visitor count.</P> <P>7. Automating searches.</P> <P>6. Creating easy database interfaces.</P> <P>5. Building gateways to other protocols.</P> <P>4. HTML 2.0 Image maps.</P> <P>3. User Authentication.</P> <P>2. On-line order processing.</P> <P>1. Generating documents on the fly.</P> </BODY>
The common gateway interface opens the door for adding advanced features to your Web publications. This workhorse running quietly in the background enables fill-out forms, database queries, index searches, and creation of documents on the fly.
FrontPage allows you to easily add WebBots to pages that use forms. However, WebBots generally do not perform any post-submission processing. With CGI scripts, you can process input from forms automatically and generate output directly to the reader based
on the results of the processing.
Although CGI enhancement is a click of the mouse button away for most readers, CGI enhancement means extra work for Web publishers. Still, the exponential payoff associated with CGI enhancing your Web publications makes the extra effort truly
worthwhile.