Composing Good HTML

This document attempts to address stylistic points of HTML composition,
both at the document and the web level. It is available on the Web at
http://www.cs.cmu.edu/~tilt/cgh/ (if you are reading this via a mirror, you
may want to check the original to make sure you're seeing an up-to-date
version).
---------------------------------------------------------------------------

New: This is version 2.0.5; version 1 is still available for those who are
interested. Now that Web Weaving is on shelves near you, it seems
appropriate for me to get off my duff and feed all of the changes back into
this document. See "Some History," below, for more information on what the
heck I'm talking about.
---------------------------------------------------------------------------

This document is divided into two main sections. The first section
discusses the document -- it should be recognizable as the revised version
of the original CGH. It discusses good practices to follow in creating your
documents, common errors and things to avoid when composing HTML, and
finally, a brief treatment style sheets, which provide a mechanism for
greater control over how a document is rendered. The second section is
brand new -- it discusses style issues regarding your Web as a whole. How
it is divided and organized, how it is interlinked and intertwined; these
are the issues under consideration here.

This is not a beginner's guide; check the "For More Information" section
for pointers to more basic works, as well as for more advanced references
and tutorials. It is designed for the HTML author who has learned the
basics, and is ready to start thinking about the more advanced aspects of
Web document design.

Note: I'm not finished spiffing up this new version yet, but it's good
enough to be presentable, and I'd rather have the information available,
rather than have it languish for lack of final polishing. At the very
least, I still need to:

   * Make some of the larger figures into a more manageable size
   * Provide rendered versions of the HTML examples
   * Add in some more useful links to other resources (suggestions
     appreciated!)
   * Break this into single and multipart versions by preparing multiview
     source documents

Unfortunately, the life of grad student is not all cheese and wine (very
little of it, in fact), so these will have to come at a later date.
Besides, with the publication of Web Weaving (see the History section below
for background), it seems an appropriate time to also re-update this
document, so I won't let a little thing like a busy schedule stand in my
way.

Some History

I wrote the first version of "Composing Good HTML" in January of 1994. At
this point, the Web was just starting to explode, and Mosaic was the
browser on the tip of everyone's mouse. Being one of the strange few who
used Lynx as well as Mosaic (as well as Emacs-W3, when I was feeling
cocky), I noticed that different browsers dealt with incorrect usage of
HTML with varying degrees of success. When I pointed this out, the solution
suggested to me was to write a "lint" for HTML that would point out common
errors in documents. In preparation for this, I started making a list of
common errors, and turned that list into a human-readable document. That
document became "Composing Good HTML."

About that time the semester started, so I made the document publicly
available, and asked for comments and criticism. I got both, in spades! I
corrected errors (including a plethora of spelling and grammatical errors),
added some new sections, and revised pieces of existing sections. But, all
in all, CGH didn't really change much, even though things like Netscape and
HTML 3.0 (let alone Java and VRML) have snuck up in the meantime.

In January of 1995, Carl Steadman, Tyler Jones, and I got together with the
idea of writing a book about the Web (this was before the current explosion
of the market, so you'll pardon our naivete). Rather than writing a book
about HTML, we decided to write a book about creating and maintaining an
entire site -- including the stylistic points in CGH as a starting point.
The book is called Web Weaving, and it appeared on bookshelves on December
18th, 1995. The book is published by Addison-Wesley.

The side effect of all of this is that it gave me a reason to revise CGH to
reflect current practices for inclusion in Web Weaving. And now that we've
finally finished our book, this also means that the changes in CGH are
getting fed right back into the online version. Which, I'm proud to say, is
still freely available (and better than ever, I'd like to think). What you
see here is, by and large, Chapters 11 and 12 from Web Weaving, edited it
so that they stand alone better. While I'd certainly recommend you read Web
Weaving for a full treatment on all the issues involved in building and
maintaining your Web site (and because every author hopes that his words
will be read), Composing Good HTML remains (I hope!) a useful resource for
HTML authors (and now Web designers) who want a slightly more sophisticated
treatment of the stylistic issues involved in, well, weaving your web.

I never did get around to writing that "lint" program, though.

Document Style Considerations

The World Wide Web has been a wildly successful experiment. It has filled a
need for both information users and for information providers: a tool which
allows information to be deployed to a wide variety of people over wide
geographic distances, regardless of what kind of computer they may be
running. All that is required to publish information is any one of a number
of Web servers, and all that is required to view that information is any
one of a number of Web clients. This is both an opportunity and a
challenge. This document discusses the ways in which you construct your
markup so that it is readable and usable for a wide range of browsers.

HTML provides a device-independent way of describing information. The
elements of HTML describe what your information is, not how it should be
displayed. This is a subtle point, and perhaps the most important one
presented here. HTML will let you describe this piece of information as a
header, or that piece of information as an address. It will not let you
describe this text as being in 24-point Helvetica, right justified. Your
challenge is to provide professional page layout and design without using
the traditional tools of professional page layout and design. Sound like a
paradox? Not really. All it involves is a bit of trust.

The trust you must have can be summarized by the following rule:

   * if you mark up a document so that your information is labeled as what
     it is instead of as how it should be displayed,
   * then browsers will render it in a way that is appropriate and
     professional-looking.

With the current diversity of clients for the Web (and we can only expect
to see more), it has become important to write HTML that will look good on
any client, and not just on the specific client which the author may have
access to. You must trust your markup. There is no way to anticipate how
every browser will (differently) render your HTML. If you follow this rule
you will get the best possible rendering with all browsers, instead of for
just one browser.

To this end, there are a few solutions. One approach is software based -- a
"lint"-like program for catching semantic errors in HTML, and perhaps even
correcting them. Two good examples of this are WebTech's HTML Validation
Service and WebLint. Another approach is the one taken by this document --
a style guide which points out common errors one might make in the
composition of HTML, and recommending good practices to follow.

Bear in mind when following these guidelines that your document may not end
up looking the best it possibly can on a particular browser. However, it
also will not look ugly on any browser, which is the risk you take by
disregarding these recommendations and tweaking your markup code for, say,
Netscape. Unfortunately, Netscape may render things differently from Lynx
which may render things differently from Mosaic, and so on and so forth --
and even within a particular browser, a user may have chosen font or style
preferences different from the ones which you might assume. What these
guidelines should do, if followed, is make for a better presentation for
the most browsers (instead of the best presentation for only one) -- and
ensure that your documents reach the widest audience possible.

Good Practices

Things contained in this section are good practices for the generation of
any HTML document. Specifically, this would include anything which should
routinely be done in the creation of documents for the benefit of both
reader and author.

How to Use Non-Standard HTML

There are at least three major flavors of HTML currently in practice as
this is being written: HTML 2.0, HTML 3.0, and the Netscape extensions to
HTML 2.0. HTML 2.0 is the closest thing to current practice that is
available, and can be assumed to be "safe" for all browsers.

On the other hand, the HTML 3.0 and the Netscape extensions are not widely
implemented, let alone standardized. Under most circumstances, this would
be a good reason not to use them until they were more widely available, but
there is the mitigating circumstance that all of the Netscape extensions
(and some of HTML 3.0, most notably tables) are supported by one of the
most popular Web browsers ... Netscape!

What should be done about this? Many Web authors take the approach that,
since most people use Netscape, it's acceptable to use the Netscape
elements, even if it is to the detriment of people using other browsers.
Others take the approach that nothing more than HTML 2.0 should ever be
used, which means that any benefit which might be derived from these
enhancements is lost.

The best road is a middle approach. Two good rules of thumb are:

   * If two or more popular browsers support the extension, it's probably
     fine to use. For instance, both Netscape and Mosaic (and Arena) now
     support tables, so any tables you use will be available to most of
     your audience.
   * If the extension is not widely supported, but it will not adversely
     affect your document if it is missing, it's probably fine to use. For
     instance, the FONT element changes the font size of text in the
     Netscape Navigator, but not in any other client. However, other
     clients will simply ignore tags they do not understand-so the text in
     the FONT element will still be readable. On the other hand, if the
     MATH element is ignored by a browser, the browser will display
     gibberish.

In general, try to think about the effect that the non-standard elements
will have if they are not recognized. These elements can be used
intelligently, and on browsers that recognize them, can dramatically
enhance the presentation of your page. If it is not possible to use the
elements in such a way that rendering is still good on all clients, think
about providing multiple copies of the document (for instance, providing a
version of the table using the PRE element), and possibly using
content-negotiation on the server to provide the reader with the correct
version of the document.

A final thought on the subject: try to avoid banners in your document that
claim that your document is "Enhanced for Netscape" or "Enhanced for HTML
3.0" (or the rapidly more prevalent "Enhanced for Microsoft's Internet
Explorer." Ugh.) Rather, try to build your document so that if a reader
reads it in (for example) Netscape, it will be obvious that it uses the new
elements to good effect ... and if a reader reads it in another browser,
they can remain blissfully unaware of what they cannot see, and still be
impressed by what they do see.

(Opinion Alert: a general comment, that may or may not place me on Bill
Gates' hit list -- while I have a healthy disregard for the cavalier
attitude in which most "extensions" are made de facto by overwhelming will
of places like Netscape, I still have a healthy respect for those
extensions which attempt to solve an important problem in a useful way.
Many of the Netscape extensions, especially those involving tables, fit
this bill, and while they did also provide many duds as well, they have
also supported the valid HTML 3.0 alternatives that mirror their
extensions. However, in my opinion, every single one of the "Microsoft
extensions" is of dubious merit, and of certain incompatibility with any
evolving HTML 3.0 specification. Given the well developed state of HTML
3.0, introducing new and incompatible methods of doing the same thing is
irresponsible at the least. I highly recommend simply disregarding the
extensions introduced with Internet Explorer. Please note that I have the
highest respect for many of Microsoft's products; I even used Word and
Internet Assistant to compose this edition of this document [although I
edited the HTML afterward]. And, dear reader, this paragraph in particular
is highly opinion-ridden, so you must take it with a grain of salt as you
see fit. On with the useful stuff:)

Signing and time-stamping documents

One problem which faces anyone trying to find information using the
Internet is the question of "authoritativeness." The relative ease with
which WWW servers can be set up and populated with information means that
the traditional checks of the publishing process can not act to filter out
information which is inaccurate or misleading. In addition, it can often be
hard to tell how current information found online is, or how actively it is
maintained and updated.

One thing which you can do to assist Web users is to sign and date all
documents in your infostructure, so that people viewing the documents can
form some impression of the authority of the document (i.e., how recent it
is, and how reliable the information provider is). This is not a complete
solution, but it is a large step forward.

For example:

<HR>
Last modified: March 6, 1995
<ADDRESS>
<A HREF="http://cs.cmu.edu/~tilt/">James Eric Tilton</A><BR>
<A HREF="mailto:tilt@cs.cmu.edu">tilt@cs.cmu.edu</A>
</ADDRESS>

Some notes about this example:

   * The date is given in an unambiguous format: "March 6, 1995". Why is
     this better than the more economical "3/6/95"? One reason is that for
     some of your audience, especially those from Europe, this means "June
     3, 1995".
   * A link to a home page is provided. If a reader is interested, she can
     follow it to find more information by this author. This provides a
     consistent centering function which helps keep a reader from becoming
     disoriented (See Main Roads and Scenic Paths, below).
   * A mailto: link anchors the document to the mail address of its
     creator. The mailto: URL specifies an e-mail address. Most browsers
     support this, allowing the reader to send e-mail to the address
     specified. This can be a useful way to get feedback. In addition, the
     mailto: link is separated from the link to the home page by a <BR>, so
     that the two links can be easily distinguished.

Another option for signing a document is to encode information about the
author in the document's header information. You can do this by including a
LINK element of type made in your HEAD element. For example:

<HEAD>
<TITLE>This is my Title</TITLE>
<LINK REV="made" HREF="mailto:author@some.site.org">
</HEAD>

This example uses the LINK element, which may be unfamiliar to you. This
element is equivalent to the A element; that is, it provides a link to some
other object. However, since it is part of the HEAD information (which is
information about the document, rather than part of the document itself),
this is a link from the entire document to another object. (Anchors, on the
other hand, are links from some small subset of the document, like a word
or a phrase, to another document). This link, like most other HEAD
information, is typically not displayed by a browser, or followable by a
reader.

The fact that it is not displayed does not make it useless, however. Many
browsers, such as Lynx, supply a "reply to author" function. The
information about who the author is comes from using the LINK as above.
Other applications which can make use of the information include Web
spiders and other maintenance tools, which can benefit from having
authority information in machine readable format.

The format of the LINK element is the same as that of the A element. Notice
the use of the REV attribute, which describes this relationship as a
REVerse relationship of the type made. This means that this document was
made by the object at the other end of the anchor.

Device independence through better printing

One promise of the wide-spread availability of personal computers has been
the lessening of our reliance on paper. In some ways, this promise has been
realized; many trees (and municipal landfills) are no doubt grateful that
many of us are now committing our words to e-mail instead of to a
handwritten or typewritten letter or memo. On the other hand, until video
display technology produces results indistinguishable from paper, we will
no doubt continue to print out things. It's hard to curl up with a notebook
at night, especially if it has a coaxial cable jutting out the back of it.
Because of this, many people will want to print out the documents which you
have provided electronically. In effect, they will want to take the
document you have woven into a part of a web, and make it into a standalone
document.

Fortunately, HTML is well-suited to this. A document in HTML can
theoretically be rendered in many more formats besides simply on a screen.
Print is one obvious alternative, although speech and Braille are also
possible and desirable. We bring this up because it is important to
consider ways other than on-screen that a reader may encounter your
documents. Given that, thinking about your document as something that might
be printed can be a very useful tool for creating documents that aren't
tied to the specific requirements of a browser or display hardware.

Taking advantage of prose

One of the advantages of the World Wide Web over similar infosystems, like
Gopher, is that the Web makes no distinction between what is a menu and
what is a document. For instance, in Gopher, a document is "dead" -- it
can't lead anywhere, and, in order to continue exploration, a reader must
return one step back to a menu. In the same vein, a Gopher menu provides
only limited information about where links to lead to: often a menu item
must be retrieved and explored before any sense can be made of whether it
is appropriate to what a reader seeks.

On the other hand, a Web document is "live" -- there's no clear dividing
line between a menu container and its contents. This is a liberating
distinction, as a document can now be as verbose as necessary in providing
context for links. Consider the difference between these two documents in
Figures 1 and 2.
---------------------------------------------------------------------------

              [Figure 1: A menu list without context (Lynx)]

            [Figure 2: A prose description of resources (Lynx)]
---------------------------------------------------------------------------

The second example is much more satisfying, because it is more than simply
a list of pointers. Instead, an effort has been made to integrate the list
into prose that is (presumably) better tied into the subject of the
document as a whole.

This is not to say that it is always preferable to force what is more
naturally a menu into prose for the sake of prose. If you are creating a
document that serves as an jumping-off point to other resources, your
readers might not want to get into the thick of text to find the resource
they're searching for. In this case, a definition list may be more
appropriate, as shown in Figure 3. This is a nice compromise, giving
context without becoming buried in a forest of words.
---------------------------------------------------------------------------

             [Figure 3: A menu using a definition list (Lynx)]
---------------------------------------------------------------------------

Meaningless link text

When creating documents, make sure that your links are meaningful -- that
is, that they avoid online-specific references, and that they don't detract
from readability. The text of your links should flow well in the context of
the rest of your text , and your text should also be able to stand alone as
a printable document . You should at all costs avoid the "Click Here"
syndrome, as shown in Figure 4.
---------------------------------------------------------------------------

               [Figure 4, The "Click Here" Syndrome (Arena)]
---------------------------------------------------------------------------

Figure 4 is also bad because it refers to "clicking", which assumes that
everyone is using a mouse with their browser, which is not always the case.
A much better alternative is demonstrated in figure 5.
---------------------------------------------------------------------------

                 [Figure 5, Meaningful Link Text (Arena)]
---------------------------------------------------------------------------

Another point to consider about the choice of words selected for link text
("information about cows", in this example), is that often this link text
may be what is used as information for a reader's bookmark or hotlist
entry. When the word "here" is used as link text, the hotlist may become
cluttered with entries that read only, "here", instead with information
about what the link is actually about.

Organization through outlining

Headers provide a useful way to provide an outline for your document.
Headers of level 1 (H1) indicate major points, while headers of level 2
(H2) provide sub-topics to those points, and so on and so forth. It is
important to remember that the purpose of these headers is not to provide
specific kinds of fonts or layout, but rather to organize a document into
sections. To that end, here are some recommendations about heading usage:

   * A heading should not be more then one level below the heading which
     preceded it. That is, an H3 element should not follow an H1 element
     directly.
   * Also, one version of the HTML specification declares that "a heading
     element implies all the font changes, paragraph breaks before and
     after, and white space (for example) necessary to render the heading".
     Extra highlighting elements are discouraged within the header, like EM
     or B.
   * Do not markup text as H2 or H3, simply because it provides the correct
     size and bolding of fonts on the browsers used by local readers. On
     another browser, that same text may be incredibly grotesque and large,
     not providing the desired effect at all. Figures 6 and 7 demonstrate
     this effect.

---------------------------------------------------------------------------

              [Figure 6: Expected headline rendering (Arena)]

           [Figure 7: Unexpected headline rendering (Netscape)]
---------------------------------------------------------------------------

Physical versus logical character emphasis

Since HTML (and also SGML) is designed to be a device independent language
for describing the content of documents, most of the elements within it
aren't intended to give direct control to the author over how the final
page layout will look. The major exceptions to this are in the character
highlighting elements.

There are two types of character highlighting elements -- physical and
logical. The physical styles involve things like "italic font", and
"boldface"; while the logical styles are things like "emphasis",
"citation", and "strong." It is strongly recommended that you employ the
logical styles rather than the physical styles in your documents. Using the
I element to render text in italics will only be effective on those
browsers which are capable of displaying italics -- which all browsers are
not guaranteed to be able to do. It is far better to encode semantic
content -- to describe things in terms of logical styles -- and then allow
the browser to display that semantic structure as best it can, given its
display capabilities.

So, instead of

<I>italics</I>

you might use

<EM>emphasized</EM>

or a

<CITE>citation</CITE>

and instead of

<B>bold</B>

you might use

<STRONG>strong</STRONG>

This also leaves the possibilities open in the future for more
sophisticated uses of these semantic encodings, which have much more
inherent meaning than font styles like bold or italic. For example, the
Lycos indexing system can take advantage of semantic encoding to create
abstracts of documents.

Note: Before you stop using B and I altogether, here's another viewpoint to
consider. One argument against logical character styles is that it turns
out to be a bottomless pit, a fruitless attempt to define logical styles
for every possibility. Physical styles, combined with the context of the
text in which they are placed, seem to provide a much richer set without a
huge number of tags. Consider the large space of context that can be
implied with only the typographical conventions of bold or italic. The only
problem is that that contextual space needs to have a human being to
interpret it, which would make some kinds of computer-based rendering
difficult, if not impossible (e.g. speech synthesis).

A picture is worth a thousand words (which is why it takes a thousand times
longer to load...)

The title of this section is somewhat facetious, but only somewhat. It's
more and more obvious from current Web development efforts that the main
attraction of the Web is not hypertext, and it's not an easy interface; the
main attraction is the flashy graphics and the alluring promise of
multimedia. We shall heroically refrain from commenting on whether this is
a good or a bad thing, for the fact remains that online multimedia is here
to stay. What we will comment on is on the issues that must be considered
to use multimedia for best effect.

The first set of issues revolves about the faux sense of page design one
can get by using inline images. An early example of this was one of the
early commercial forays into the Web, a graphic design house which
advertised professional layout services for online brochures. They spent
quite a bit of time designing graphics images of the proper width so that
they could achieve page-layout effects like right justification and
centering, and created a page which was fairly well-designed. However, they
got bitten because this design relied on a browser's window being the
default width for X Mosaic. With a wider window, the carefully aligned logo
in the upper right corner was immediately followed by the image that should
have been on left justified on the following line.

Current browsers implement some better forms of layout control for images.
For example, an author can specify the way in which text will flow around
an image with an ALIGN element. Figures 8 and 9 exemplify this; the former
has no text-flow information, and the latter does. This is not perfect, as
using the ALIGN tag can cause strange stair-stepping effects if there is
not enough text separating two images, as figure 10 illustrates. If the
desired effect is of images with captions, a table is probably the best
approach for layout purposes (Figure 11).
---------------------------------------------------------------------------

           [Figure 8: IMG without the ALIGN element (Netscape)]

             [Figure 9: IMG with the ALIGN element (Netscape)]

            [Figure 10: Stair-stepping due to ALIGN (Netscape)]

              [Figure 11: Using TABLE for layout (Netscape)]
---------------------------------------------------------------------------

Another consideration is unnecessary duplication of effort. Many authors
swear by colored bullets and colorful horizontal rules, implementing both
effects by using inlined images rather than the structural markup. Doing
this can leave the portion of your audience which is unable (or unwilling)
to view inlined images out of the loop, and can also negate some of the
benefits provided by structural markup. There is also an unexpected side
effect to using many small images: the current way in which Web clients
retrieve documents requires that a separate connection to a Web server be
initiated for each image. The time involved in negotiating this connection
may actually be larger than the time involved in retrieving the image
itself. Consider whether the effect achieved by the "enhanced" layout
justifies the cost.

Another concern is the size of images. With the increasing home popularity
of the Internet, more and more users are purchasing dial-up connections of
one sort or another. This may be of the strict "shell-account" variety,
which means that your readers will not see images at all, or they may be of
the SLIP/PPP variety, which means that your readers will have an average of
only 14,400 bits of information per second sent to them. This is not a
large number, and huge images can take minutes to load. Bear this in mind
when selecting images; will the image take so long to load that your reader
will go somewhere else rather than wait?

The image size issue can be alleviated in several ways. First, the
increasing popularity of the JPEG format means that images can be
compressed to much smaller sizes, which provides dramatic speed-up in image
load time. Even better results can be achieved by using less colors (gray
scale, rather than full 24-bit color, for example). Another approach is to
use a small set of navigational icons which appear on every page in your
Web. Most browsers now cache documents and images; using the same icons
(and using the same URL to refer to them with, perhaps by maintaining an
/icons directory on your Web server) means that the reader will only incur
the cost of downloading once.

Also, when using the IMG element, don't forget to also use the ALT
attribute. The ALT attribute allows alternate text to be specified for an
inlined image. This is especially useful for images that have specific
meaning (and provide a link to other documents), as that meaning can be
lost on those who do not have images loaded. For example:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif">

can be better represented with the addition of the following ALT attribute:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif" ALT="[Next Page]">

as shown in figures 12 through 16.
---------------------------------------------------------------------------

             [Figure 12: The Document As Expected (Netscape)]

           [Figure 13: Inlined Images Off/No ALT Tag (Netscape)]

                [Figure 14: Text Browser/No ALT Tag (Lynx)]

        [Figure 15: Inlined Images Off/ALT Tag Supplied (Netscape)]

             [Figure 16: Text Browser/ALT Tag Supplied (Lynx)]
---------------------------------------------------------------------------

Finally, don't rely entirely on image maps and graphic logos to build your
site. There are a few sites which have almost no textual content
whatsoever; when visited by readers who do not (or cannot) load images,
there is no information available. This is not to say that image maps must
be avoided altogether. Instead, provide alternative means of navigation
which supplement the image map, such as explanatory text which follows your
map.

Common Errors

This section details common errors in HTML composition, that may lead to
documents which are not fully device-independent. The behaviors of these
errors are undefined, so certain browsers may render them as intended but
not all browsers are guaranteed of doing so. Therefore, these mistakes
should be avoided, even if your browser of choice renders your documents
correctly.

These errors are, for the most part, artifacts of "raw" HTML authoring. Web
development has suffered from a lack of good authoring tools, a situation
which is only now beginning to be rectified. Many of these errors involve
typos or simple mistakes, although others deal with more fundamental
conceptual problems.

Paragraph element errors

The use of the paragraph element (P) can be confusing. When HTML was first
introduced, <P> served as a paragraph separator, not as an
end-of-paragraph; a confusion which originally prompted this document.
However, more recent version of the HTML 2.0 and later specifications have
changed this behavior.

The current recommended use of the P element is to be placed at the
beginning of paragraphs; for example:

<P> In this paragraph, our hero discovers that he really likes
baloney sandwiches. He also listens to some disco, and has a
lovely beverage. Ah, if only all paragraphs were this exciting!

This is in contrast to previous usage, where the <P> was usually placed at
the end of the paragraph.

Still, in certain contexts, use of <P> should be avoided, such as directly
before any other element which already implies a paragraph break.

To wit, the <P> element should not be placed before the headings, HR,
ADDRESS, BLOCKQUOTE, or PRE.

It should also not be placed immediately before a list element of any
stripe. That is, a <P> should not be used to mark the end-of-text for <LI>,
<DT> or <DD>. These elements already imply paragraph breaks.

Caveats

Some clarifications on the above might be in order. One is the difficulty
of rendering appropriate white space by a browser. While it is true that
all of the entities mentioned above imply a paragraph break, this only
occasionally means that they also imply white space between sections --
this depends on the browser. So, while you might feel inclined to add a <P>
in order to fix white space problems, please think twice and avoid it if
you can.

Also, when using the glossary list (DL), please try to avoid using multiple
DDs (definitions of terms) in order to provide multiple entries for a term
(DT). Instead, use a <P> tag between paragraphs in a definition.

All clear now?

Character and entity reference errors

Simply put, a character reference and an entity reference are ways to
represent information that might otherwise be interpreted as a markup tag.
For example, consider the rendered HTML document in figure 17.
---------------------------------------------------------------------------

         [Figure 17: Properly escaping character entities (Arena)]
---------------------------------------------------------------------------

The source which produces this document, which uses entities, looks like:

In order to represent the &quot;&lt;P&gt;&quot; in this text, I had to use &amp;lt;P&amp;gt; in my raw HTML.

In this example, the &lt; becomes "<", the &gt; becomes ">", the &quot;
becomes a quotation mark, and the &amp; becomes "&" (which is needed in
order to represent the text &lt; in the document without the text being
turned into "<"). There are currently four entities for this purpose in
HTML, as well as several entities which allow encoding of the ISO Latin-1
Character Set.

The most common error in the use of entities is to leave off the trailing
semicolon. Also, no additional spaces are needed before or after the
entity/character reference. Here are some examples of incorrect usage:

Doug &amp Chris went out for a walk.
A paragraph break can be represented with
&quote; &lt; P &gt; &quote;

Can you spot the errors in the above examples? They are:

   * In the first line, "&amp" needs to have a semicolon after it.
   * In the third line, "&quote;" should be "&quot;" (this is subtle and
     annoying, much like the Unix system call, creat())
   * There should be no spaces in the third line, which should read:
     &quot;&lt;P&gt;&quot;.

URL errors

Another misunderstood aspect of Web document composition is in the creation
of URLs.

Directory reference errors

One grey area involves references to directories. It is possible to request
an index of a directory from an HTTP server. The typical response from the
server is to either return a pre-generated index document (which is often
the document "index.html" in the referenced directory), or to construct an
HTML document on the fly which contains a listing of all files in the
directory. However, when making such a directory reference, it is important
to make sure to have a trailing slash on the URL. That is, if you were to
request the index of Willamette University's directory of HTML
documentation, you would want to refer to it as
http://www.willamette.edu/html-composition/, not as
http://www.willamette.edu/html-composition.

Many servers are able to catch these errors, and provide redirection to the
proper URL, but it's best to get the URL right in the first place --
notably because not all browsers support transparent redirection. Also,
getting this correct the first time means it will take less time for the
page to be loaded; your readers won't have to wait through the time needed
to open two (or more) HTTP connections.

Not using fully qualified domain names

Problems can arise when the hostnames in URLs aren't fully qualified.
Within a local network, a machine can often be simply referred to by its
host name. For example, the domain miskatonic.edu might have in it a WWW
server with the host name www. Readers within that domain can refer to the
machine by this name. However, the server's fully qualified domain name is
www.miskatonic.edu. This fully qualified domain name provides enough
information that any host, anywhere on the Internet, can find this
particular machine.

What happens is that an HTML author might construct a link that looks like
this:

<A HREF="http://www/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which produces a link to "Metanoia-A Change In Spirit" that will only work
for people in the local network which that machine is on. A correct link
would look like this, instead:

 <A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which would allow all of the readers who are interested in Metanoia -- even
those living in Freedonia -- to actually follow the link.

Along those same lines, be careful in using URLs of the scheme "file:".
It's possible to have a reference to file://localhost/some/file/pathname.
What this does is references the file described on the local host of
whoever is browsing the document. Which is why a reference to <A
HREF="file://localhost/etc/motd">the message of the day</A> will display
the message of the day on your machine, not the message of the day on my
machine. However, this makes several assumptions about your reader's local
machine and network which you probably shouldn't be making. Unless you know
what you are doing (and probably even then), references of this type will
really mess up your Web.

Missing quotes in start tags

One common error, especially with the current lack of widely available and
useful authoring tools, is to leave off a quote in the attributes of tags.
For example, this reference to the euphonium, king of instruments, should
look like:

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/">

but people composing "raw" HTML from a text editor will often instead type

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/>

It's likely that by the end of that huge URL, the author had forgotten it
was supposed to be quoted. The behavior of browsers upon encountering this
varies -- some display a proper link, but you can't follow it, while others
actually eat up huge portions of the following text, thinking everything up
until the next quotation mark to be part of the URL.

Missed end tags

Many of the HTML elements contain information within them. For example,
<EM>emphasized text</EM> would be rendered as emphasized text. There is a
start tag (<EM>), some content (which may include text, and in some cases,
other nested elements), and an end tag (</EM>, indicated by the </). A
common mistake is to miss the / in the end tag. All elements (except empty
elements, below) must be terminated by an end tag -- otherwise, undefined
behavior may occur.

Some HTML elements may be empty, such as <P> and <HR> (the HTML 2.0
specification provides more information about element content). If this is
the case, there is no need for an end tag.

Using white space around element tags

In general, the use of white space around element tags should be avoided.
For example, if white space immediately follows a start tag, the style
changes implied by that element may be applied to the initial space as
well. For instance,

You really should
<A HREF="http://www.cs.cmu.edu/~tilt/"> CZeCh THIZ 0uT </A> !

would be rendered in Netscape as shown in figure 18, and in Lynx as shown
in figure 19.
---------------------------------------------------------------------------

 [Figure 18: Improper use of whitespace (and spelling and punctuation, too)
                                (Netscape)]

              [Figure 19: Improper use of whitespace (Lynx)]
---------------------------------------------------------------------------

On some browsers, there may be white space around the anchor, which adds
unwanted unsightliness to the rendering, and may lessen the impact of the
document. (This comment really applies to white space immediately following
start tags, and immediately preceding end tags.)

Stylesheets

The point has probably been well made by now that HTML is not a very good
vehicle for providing specific information about layout and presentation.
There are no mechanisms for an author to specify how she wants specific
elements rendered, or to control aspects of page layout. While one of the
strengths of HTML is this very independence from presentation details, it
has become clear that some form of presentation control is needed.

Stylesheets are the answer to this problem. It provides the other half of
the equation, the half that is currently not provided by HTML. While HTML
provides information about content, stylesheets will provide information
about how to render specific elements.

Unfortunately, while several mechanisms for providing stylesheets are under
development, there is no clear standard at the time of this writing. We
cannot tell you what stylesheet mechanism(s) will become standard, but we
can tell you about the current contenders. Keep your hopes up, though:
because of the importance of stylesheets, it is highly likely that a usable
standard will emerge within the next year.

Some Stylesheet Proposals

In these proposals, the stylesheets contain information about how elements
should be rendered, whether this is font information, justification
information, etc. At the time of this writing, the syntax for these
stylesheets has not yet been fully designed.

Arena/Cascading Style Sheets

The Arena browser is currently the only browser which supports a stylesheet
mechanism, and that mechanism is currently only very limited and very
experimental. The mechanism involves "cascading style sheets," which means
that the several different style sheets, each with a different order of
importance, are combined in order of importance to create a presentation
style. The reader can specify her own preferences for rendering, as can
document authors, and these preferences are merged to produce the final
document.

DSSSL/DSSSL Lite

DSSSL is the Document Style Semantics and Specification Language, which has
emerged from the SGML community as a potential stylesheet mechanism.
Because it is complex, work is being done to create "DSSSL Lite," a
modified subset of DSSSL which can be easily implemented by client
programmers, and easily used by HTML authors.

Alternatives to Stylesheets

While stylesheets are not currently useable, there are alternatives in
existing specifications, which can be used with existing browsers. While
the HTML 3.0 enhancements below are not yet widely propagated, it is likely
that they will be soon; and the Netscape enhancements are already available
(and are likely to be integrated into the evolving HTML 3.0 specification).

HTML 3.0

While HTML 3.0 does include the STYLE element for supporting whatever
mechanism is eventually deployed for stylesheets, HTML 3.0 also provides
some new elements for greater control over presentation. These elements
include BANNER, BIG, SMALL, TABLE, MATH, and TAB.

The BANNER element provides a means for a banner of HTML that will always
remain on the screen. This might be a copyright notice, a toolbar, or any
other content which should always be available.

The BIG and SMALL elements allow for rendering text as bigger or smaller,
as compared to the default text size.

The TABLE and MATH elements provide for a more sophisticated means of
layout. The TABLE element allows the author to specify a spreadsheet-style
arrangement, with cells that can contain text, images, and even input
elements for FORMs. The MATH element allows for the description and
rendering of complex mathematical formulae.

The TAB element allows the author to specify tab stops within the document.

In addition, some entities have been added, such as "&emspace;", to provide
finer control over spacing.

For more information about these additional elements and entities, see the
HTML 3.0 specification.

Netscape

The Netscape approach cannot be called a "style sheet," per se. Rather, as
of the 1.1 release of Netscape Navigator, Netscape has provided several
"enhanced" elements to help control presentation. These elements include
FONT, BASEFONT, IMG, and BODY.

The FONT and BASEFONT elements allow changing the size of font within a
document. The IMG element, on the other hand, has been enhanced to provide
text flow around images in documents.

The BODY element now allows control over the background. The author is
allowed to provide a background color or image for their document. In
addition, the author can specify different colors for hypertext links, in
case the default colors do not have sufficient contrast to the new
background color.

If you would like more information, Netscape Communications has provided
documentation of their HTML extensions online (both for the Netscape HTML
2.0 extensions and the Netscape HTML 3.0 extensions).

Note: Be careful when changing colors for hypertext links. Most browsers
take the approach of using a bright color (such as bright blue), which has
high contrast to the default page background, for links which have not yet
been followed; and of using a dull color (such as dark blue), which has
less contrast to the default page background, for links which have already
been followed. Readers have become used to this high-contrast/low-contrast
visual cue, and changing the link colors can confuse readers.

The best approach is to, first, not change the link colors unless you have
to. With most background colors, the defaults should still be fine. If you
do need to change the link colors, use a color that is bright, and
high-contrast to the background color, for links to pages which have not
yet been visited. Use a duller version of that same color for links that
have already been followed.

Netscape Frames

Given the proliferation of Netscape's frames, it seems appropriate to at
least add in a paragraph or so commenting on proper usage. Frames allow you
to break the browser's window into separate subwindows, with different
documents in different windows. This provides even greater control for the
author in terms of what the end document actually looks like (and, granted,
can be used to very good effect), but, as with all things, must be used
with care.

Some gotchas with frames include:

Navigational
     This has more to do with Netscape's current implementation, but may be
     more fundamentally related with the issues involved in providing
     frame-style mechanisms. Currently, when a reader encounters a space
     structured with frames, any further navigation they do does not make
     it onto the history stack. This means that the next time they hit the
     "back" arrow, they pop right out of the entire space, possibly going
     back several link selections. This can be jarring, to say the least.
     What this boils down to is that you must be even more careful to
     prepare a good navigational structure for your corpus of documents.
     (In fairness, Netscape has recognized the frame problem, and the 3.x
     version of Navigator addresses it.)
Layout
     Many sites have poorly layed-out frames; when a reader with a browser
     window of unexpected shape or size shows up, some of the frames are
     not completely readable. I don't understand enough about frames to
     know why this happens, yet, so all I can do is to warn you to watch
     out.

In general, the gotchas revolve around the fact that more control is
removed from the reader in a medium where the reader expects to have a good
deal of control. This doesn't mean don't use frames; it means that you must
carefully analyze why you are using them, and make sure that their use is
justified.

Another note: there is a NOFRAME element which can be used to give
alternate text for those browsers which do not support frames; use it.

More on this subject as I become more familiar with frames.

Web Style Considerations

A quick plug: Chapter 5 of Web Weaving discusses many of the issues you
should take into account in planning and administering your Web (in fact,
the entire book revolves around the subject in great detail). Here we will
also address that subject, considering the architecture of your
infostructure.

Organization

When organizing your infostructure, there are several important issues to
consider. These issues include:

Presenting a clear ordering of information by subject (table of contents),
or some other form of reasonable entry into the infostructure. Some useful
forms are:

   * Table of Contents
   * Searchable Index
   * What's New (with the organic nature of online documents, a
     time-oriented ordering will help the infonaut quickly orient herself
     with what is new and/or changed in otherwise familiar territory)

The reader needs to be able to find what they are looking for, and a good
overview that allows the reader to quickly find a particular topic or
document is invaluable.

Only making a document as long as it needs to be. If a document can be
logically decomposed into more then one file, do so, but only decompose a
document if the narrative branches from the linear structure of the current
document. An example of this is breaking a book-length work up into
chapters, and further breaking those chapters up into sections. Because of
the length of time involved in retrieving documents, making the document
available in readable chunks means that the reader can use the information
without becoming overwhelmed in loading times and a correspondingly large
amounts of information presented a single, huge, scrolling document.

Correspondingly, make sure a document is richly cross-referenced, so that
if reader wants to ask, "Why?", she can. If you can split up supplementary
information into separate documents, do so. This allows the reader to
follow a main flow of narrative, but still able to look up evidence and
additional related stories and information as necessary. But don't put in
so many links that the reader gets lost trying to follow them all.

Providing a clear, consistent navigation structure. You should always be
able to easily to navigate to all documents which immediately relate, but
you should also always be able to get any other document in the
infostructure with a minimum of fuss. Always provide access to the original
table of contents, or its equivalent. This is especially important for when
others create links to documents in your Web, but do not necessarily create
links to your main entry points; readers can find themselves in the middle
of what is obviously a larger document, but without any means of finding
additional information. See Main Roads and Scenic Paths, below.

Design Goals

Importance of content

Anyone working with HTML for any length of time will soon realize that the
markup language is composed of containers, which label content. It should
be obvious, then, that your web should be primarily about this content,
whatever it may be.

That's not to say that content only lies between HTML tags: content is also
found in other media types, of course, and, depending upon the type of
information you provide, sounds or images may be more important to both you
and your readers than other types of media.

Web sites, however, should be driven by content, not by vanity or the need
or desire to make a buck. Whatever your background, you have real "content"
-- information, discussion, narrative, ideas -- to publish on the Web.
People will visit your site to find this content. Provide it. Focus your
site around it.

The largest threat to the Web is that as it becomes insanely popular,
instead of becoming a world-wide information repository, as its founders
and proponents have hoped, it becomes a large intertwined mass of
self-referential sites unwittingly involved in meta-discussions on the
nature of the Web: home pages which say little more than "This is my home
page" (or "our home page", in the case of the corporate or organizational
"presence"), with a collection of links which (virtually) point to the same
collections of sites as the last page you visited did.

Main Roads and Scenic Paths: Issues of Navigability

As readers attempt to sail the seas of your infostructure, it is important
that you provide useful ways for them to move around in your infostructure.
Many readers complain about the proliferation of links in documents,
providing so many choices that it becomes impossible to decide where to go
next. The blessings of hypertext -- leaving control in the hands of the
reader -- can also be a curse, as the original thrust of the narrative
becomes awash in side tracks and dead ends.

A means of approaching this problem is to use the metaphor of "main roads"
and "scenic paths." This means categorizing the kinds of links you include
into two major groups: those which are recommended next destinations, and
those which lead off into explanatory side-trails and divergences. As an
example, a main path through a hypertext version of a book would be a
linear progression from first chapter to the last. A side trail, on the
other hand, would be a reference from (for example) Chapter 6's description
of CGI functionality in various HTTP servers to Chapter 8's extended
discussion of CGI scripting.

This is not to say that there is a single main path through a document --
there can be several (just as there are several ways to read a book,
including as a linear narrative, and as a random-access reference). And
side trails include references outside of the immediate document, such as
bibliographic references. In addition, side trails can become main paths if
the trail leads to another document instead of self-contained explanation.

The point, however, is that a document (in the extended sense of several
HTML pages collected and interlinked) should contain at least one or more
author-defined main paths through the text, in order to provide a guidepost
for those exploring the information. These main paths should take the form
of "next" and "previous" anchors, links back to the table of contents and
index from any point within the document, and pointers to alternate main
paths which are available (where appropriate).

Although hypertext is based on notions of non-linear text, readers do make
it linear as they read through it. And it doesn't hurt to provide at least
one sensible linear pathway through the document for readers who aren't
interested in wandering around in hyperspace.

Consistency

Consistency is what brings your site together so that it feels like a
cohesive whole -- it can unite otherwise disparate topics or content areas,
and it can be used to give your site a distinctive feel in comparison to
other sites, or a sense of personality. Consistency also lends to the
maintenance of a site -- if you have a certain way of doing things
site-wide, it becomes much easier to make significant site-wide changes
without putting a great deal of time into it. You can achieve site-wide
consistency a number of ways:

Headers and footers

A standard site-wide graphical banner or text-based header can be used to
easily identify the site or sponsoring organization. Your header doesn't
necessarily need to be static across the site; you can easily share
dimensions and a primary graphic element across banners while making each
one relate specifically to the content at hand.

Footers can be used in the same way; a standard method to sign documents
and/or a standard text-based or graphical menu bar can easily pull a site
together, not only as a design element, but also as an easy way to always
navigate to the table of contents or index of a site.

Server-side includes, supported by most HTTP servers, can simplify some of
this work, allowing you to create generic headers and footers which can be
modified once and included in all of your documents.

Graphic elements

A unifying theme for graphic elements throughout the site easily pulls it
together into a whole. A shared motif, such as bubbles, sign posts, or a
corporate logo, works, as does a site-wide color scheme or page
backgrounds. You can rely on sizing and positioning of graphic elements or
textual elements, as well, to achieve a unified feel.

Personality and style

Beyond images and design elements, sites come together because of
personality and style. A consistent feel or attitude for a site, conveyed
across textual and graphic elements, can not only make each piece feel as
if it's part of a larger whole, it can also attract readers who share the
same attitude or outlook (or are fascinated by yours). The best sites on
the Web aren't necessarily the most polished, but those that pull readers
back again and again not only because of informational content but also
because of the voice with which that content is presented.

For documents which should have a personality all their own, such as user
home pages, you can still pull all these different personalities and
outlooks together by presenting a common theme or launching point. All the
users of a particular Internet service provider, for example, have
something in common by the sheer fact of their being there -- and by the
mere fact of providing a top page view to user-maintained areas, the
service provider has begun to form a community around which a commonality
can develop.

Persistent URLs

Although Universal Resource Names, or URNs, are being developed in order to
provide a naming system similar to the domain naming system for URLs, at
this point it remains desirable to use URLs as if they refer to the same
resource persistently through time.

As a content provider, you can help provide those who make links which
point to your site by developing a file structure which will allow you to
manage content as it grows and develops.

If your Web space is based on a hierarchical filing system, you can avoid
major reorganization of that file system by

   * thinking not only about organizing your current content, but how you
     plan on developing and expanding that content in the future
   * creating a file space which is neither too shallow nor too deep for
     your content.

An example might be an organization which has just created a new division,
Foobar. Currently, there's little information to publish about Foobar on
the Web: Foobar has a mission statement and little else. Though it might
logically follow to create a file, "foobar.html", to hold the mission
statement, and to store it in the same directory as your main
organization's web, it might be wiser to create a subdirectory named foobar
which could then contain foobar.html and other files, as Foobar expands.
This way, links don't have to be changed or redirected down the road when
Foobar adds additional files and perhaps chooses to design and administer
its own web space. If part of Foobar's mission statement is to spin off
into its own organization, you might even create a directory on the same
level as the parent organization's, to signify within the URL path the
relative autonomy of the division and its future direction.

Another way to manage URLs is to only publicize a few well-known entry
points to your Web: for example, the top view, or table of contents page,
and perhaps an index page, or a FAQ page.

When URLs do change, it's important that you not only provide links from
the old URLs to the new ones (or redirect the URLs to the new ones), but
you also make an attempt to notify those that have links into your Web
space, through general announcements or by contacting directly those who
have well-known links to your documents (such as Yahoo or Lycos).

Seamlessness

Your web space should not only be consistent with itself internally, it
should make references between the site and the outside world appear
seamless.

A good case in point is the corporate site which has made its product
information available via the Web, but, under the link for Ordering
Information, only provides an 800 number in order to purchase the
advertised commodity. Or the home page for a band which doesn't provide any
audio clips of the band's songs, but just a thumbnail image of the cover
art from their most recent album, available through some obscure indie
label. Or the online newspaper which provides news coverage, but doesn't
push the envelope and provide a real way to participate in the political
process.

Seamlessness is about bridging the gap between the world you create within
your web and the world outside it. Often, this means not carrying over from
traditional broadcast media restrictions or limitations that fail to make
sense in interactive media.

Macrocosms and Microcosms

The big picture: entire server structure

A site-wide strategy to organize information is never easy to invent, but
vitally important to your site's success as a place where information is
retrieved and used, versus simply being an area in which content is stored.

Finding a metaphor

Of course, there's no single recipe or structuring mechanism which you can
apply to all types of content to give you a well-designed web site. That
comes from thinking about the nature of your site and your content, and the
logical divisions that your content can be organized around. However,
finding an existing metaphor which you can work within while also pushing
the boundaries of can be an effective way to plan for the organization of a
site.

There are many obvious metaphors upon which to base a web site: thinking of
your content as being organized like a book, building, or branching tree.

The book metaphor: pages of content

Books lend themselves easily to the Web: and, in fact, many books have been
"ported" to the Web, for better and for worse. Books have tables of
contents and indices, for quickly locating information; parts, chapters,
sections, and sub-sections, for organizing content; and footnotes,
endnotes, and bibliographies, for displaying links to other content.
Collections of books become "libraries", complete with card catalogs and
help desks.

However, books also have pages which display content statically, while
computer displays have a single, dynamic screen. A book metaphor quickly
falls apart when applied to the Web on a page level: you could choose to
consider a single HTML document a "page", causing you to break up content
into arbitrarily small and hard to manage, difficult to navigate pieces; or
you could think of whatever text and graphics being currently displayed on
a screen as a "page", which could easily drown the user in a sea of text
without the benefit of traditional navigational tools such as page breaks
and numbering of pages. The screen is not a page.

The building metaphor: content as artifice

Sites can also be managed as being housed in a building, a collection of
buildings, or along some other spatial metaphor. The information you hope
to store and manage is divided for the user along content areas, which is
housed in different "buildings", which can then be further subdivided into
"rooms". Obviously, this can be effective for some types of content, such
as a large corporate site with many divisions, or a museum or gallery:
basically, any information which can be mapped into a spatial plane
consistently lends itself to this sort of view.

At the same time, a spatial metaphor in a largely text-driven medium, as
the Web is today, is often hard to pull off convincingly. VRML (Virtual
Reality Markup Language) and other such developments will allow for the
creation of virtual spaces; even then, the connecting points between rooms
or buildings -- hallways and walkways -- need to be considered
thoughtfully. It's also the case that, at many sites, the metaphor is
dropped too quickly: you're asked to select a content area based upon a
clickable map-based view, but then you're dropped into pages of descriptive
text. Not only can this be disconcerting for a user, it points out the fact
that oftentimes resources aren't allocated wisely across a Web site, with
too much attention and time spent on the top page of a site in comparison
to the remainder of the site.

The branching metaphor: regimented growth

A third way of thinking about a site as a whole is using a branching
metaphor, where all content springs from a common root and then branches
out into many divisions and content areas. This is an obvious metaphor to
use for web sites built atop file systems, since most file systems share
this organization of directories (or folders) branching into subdirectories
(or subfolders), and so on.

A branching metaphor shouldn't be pursued over the linear flow of
information, however: too many branches can be confusing or frustrating for
a user, especially if navigating those branches requires repeated jumps to
a monolithic top structure.

In general, there some key issues you should keep in mind when organizing a
site on a macro level, including:

Providing a main entry point, or top view, which makes it easy for users to
find the content which they're most interested in. At times, you'll know
exactly what a user is looking for: if you run a site which provides audio
clips of theme songs from popular cartoon series of the '70s, users
probably expect to find a listing of available audio samples or a link to
such a listing from your site's top page. Other times, you can't be
expected to know: for a site covering a wide diversity of subjects, it may
be necessary to provide a search mechanism or user-customizable top view in
order for users to navigate your site comfortably.

Offering multiple paths to the same content. Not all readers seek the same
information in the same way. A good glossary or index will cross-reference
information: for example, you may be told to look under "automobiles" if
you seek information under "cars". That same information could probably be
found by looking through a table of contents. With hypertext links, you can
refer to the same information in many ways. Do so, where it facilitates the
user without overwhelming her.

Keep in mind, too, that a site, whether it be a file system or a database,
need not be organized as the user sees it: the underlying structure doesn't
have to be identical to the structure which the user navigates. However, a
close relationship between the two can make it easier to maintain a site,
as content is revised and expanded. A change in one part of your web space
can have an impact on other parts of your site which share links or other
references: the easier it is for you to see these relationships while
maintaining these underlying documents, the more likely it becomes that
your site as a whole is kept up-to-date and cohesive.

The little picture: a document corpus

Many of the decisions you make on a site-wide level to organize content
carry over to the management of "documents", whether they be single pages
of HTML, or a collection of such pages which cover a single topic. These
things include such obvious carry-overs as having an overview of the
information presented within the document available to the reader at the
"top" page, or expected entry point; making links available at appropriate
points (usually, at the tops or the bottoms of pages) to bring the reader
back to the overview for the document; and keeping your collection of
documents uniform in terms of both content and form.

Much of the management of documents, though, is the management of links.
Hypertext is all about links -- this should be patently obvious to most.
But producing hypertext is all about managing links from the perspective of
your potential reader. Too often, Web documents fail by failing to manage
links effectively -- either by delivering screenfulls and screenfulls of
ever-scrolling text, or providing index-card-sized groupings of hypertext
which link in a myriad of directions to other index-card-sized groupings of
hypertext. Neither end of the spectrum allows the user to navigate the
content presented easily: in one case, one becomes disoriented in a sea of
text; in the other, in an ocean of links. Worse yet, documents can become
so overseasoned with random and senseless connections to every possible
place that that the reader becomes lost in a sea of text and links!

The key to managing links in your documents (besides simply verifying that
they are correct) is to organize them into classifications, and to employ
links of various classifications in a reasonable and intelligent way. The
next few sections describe some of the various classifications of links.

Footnotes

There are two traditional purposes for footnotes: for bibliographic
references, and for further commentary and/or elaboration of points within
the main text. Links to short explanatory text within a hypertext document
can be useful to readers, if it's clear from context that the link is a
digression.

Within your documents, the "footnote" style of link should be regarded as
an explanatory link which elaborates on the current discussion without
drawing the reader away from the main text. A footnote will draw the reader
away temporarily, explain something, and then allow the reader to return to
the main flow of text. While a footnote might offer further links to
further explanations of greater depth, the footnote itself is usually
nothing more than a brief explanation or glossary-style definition.

You can achieve this effect by context, by linking from a phrase (as in the
lemming example below) to a short explanation or parenthetical remark that
explains the text in question. If you are to trying to achieve a more
traditional effect, you can also use numbered note references, by either
using a number surrounded by brackets ([1]), or by using the SUP element in
HTML 3 (<SUP>1</SUP>).

HTML 3 also defines the FN element for use in footnotes, which, "when
practical, [should be] rendered as pop-up notes":

<P>Nothing is certain about the <A HREF="#FN1">lemmings</A>,
other than that they left as they came, with nothing but a silly grin and
some lemon pies.

<FN ID="fn1">Lemmings: Small rodents that like to leap off of
cliffs if necessary for retrieving a really nice lemon pie.</FN>

Whole documents

Where the footnote provides brief elaboration, the link to a "whole
document" (whether it be a single document, or to the entry point for a
collection of documents) provides a whole new potential area of
exploration. This is the most common sort of link, which provides a
connection between your document and the outside world.

This sort of link should be used with care. It has the potential to draw
your reader completely away from your document, by providing supplementary
information that takes longer to read than the original document. It is
better to use footnote-style links for explanation and elaboration, and
from there to use links to outside documents to provide further reference
information for the curious (and insatiable) reader. Another danger is that
of peppering your document with random hypertext links that a reader feels
she must follow, without actually providing further explanations or further
reading that's germane to the context or the point of your own document.

On the other hand, if you are referring directly to another on-line
document, this is the kind of link to use. By providing direct access to
supplementary material for your readers, you can give them as much or as
little detail as they are willing to plow through.

Indices

Another form of link is the index. Unlike the previous two classifications,
which provide further information for the reader as they advance through
the text, the index allows the reader to enter the text from whatever point
she desires, so that she can get right to the meat of what she is
interested in. An index allows the reader to cut through the author's
pre-designed tour of the information, and get right to that vital
information on wildebeest's dietary habits.

There are several variations on this. The most popular is the full-text
searchable, allowing readers to query a database of keywords and retrieve
those portions of your text which contain those keywords. Several software
packages provide full-text searching capability, and the WN server provides
has searching built-in.

Another variation is often found in books: an enumerated list of keywords.
This differs from an index where the reader supplies the keywords in that
the author can provide a selection of keywords that are particularly useful
for finding information. This is important-picking proper keywords can be
an arcane art, sometimes requiring intimate knowledge of the contents of
the collection being searched. Especially if the collection is a large one,
most keywords will return a large amount of documents which may be only
partially related to what the reader had in mind.

Yet another variation provides even more refinement and selection: the
table of contents. A table of contents is a form of index, organized by
broad topic. Consider providing not just one, but multiple tables of
contents for your documents, especially if there is more than one
reasonable way in which to read the information.

Portability Between Server Platforms

One of the advantages of HTML, which most Web documents consist of, is that
HTML is based upon a number of other clearly defined, widely supported,
non-proprietary formats, such as ISO Latin-1 and Internet Media Types
(itself based on MIME). This approach makes it much more likely that, a
decade from now, your documents will not be part of some legacy system
which is, at best, difficult to maintain and expand.

If your documents do have that kind of lifespan, however, it's probable
that they will reside on multiple hosts in that timeframe: perhaps
concurrently, in the case of popular sites which are mirrored. A little
attention to the requirements of different filesystems during the initial
planning of your site could save a lot of time spent renaming files and
links in the future.

About filesystems: some make the argument that Web servers should sit atop
databases, instead of filesystems; databases certainly allow
non-hierarchical relationships between pieces of content and make it easier
to provide "dynamic" documents (documents which alter their appearance or
content based upon the user accessing the data or other conditions) than
traditional filesystem-based approaches. By the time this book sees print,
there will certainly be several HTTP-serving database systems which address
many of the issues raised here "automatically".

There are some very compelling reasons for using a database over a file
system. A database-oriented system might be utilized to maintain linkages
as documents move and change; to track documents as they grow old, alerting
maintainers to update the documents periodically so that they do not suffer
"bit-rot"; and to generate multiple representations of a collection of
information dynamically (allowing your readers to order your document
collections in ways that make sense to them). However, a database approach
is not required to get some of this functionality; other tools also exist
that also do these sorts of things (Chapter 7 of Web Weaving covers these
sorts of tools in more detail; examples include MOMspider and the HTML
Validation Service).

But this automation may not come cheap: there will always be a learning
curve to mastering any system, proprietary or non-proprietary, and the
skills learned from managing a proprietary system are not easily
transferred to other systems. You, as an information provider, must rely on
your database solutions vendor to understand your needs and continue to
build the feature-set of the system to satisfy them as you develop and
grow. You may be risking the future of your documents -- by marrying your
content to a single-vendor methodology -- for some short-term gains in
manageability and ease of publishing content.

Please keep these sorts of considerations in mind: a fear of ours is that
the Web, as it moves forward almost exponentially, may lose any sense of
history as links fail and documents drop out of view because the cost of
maintenance and "keeping up" has grown too great. Pick simple solutions
over complex ones.

Naming Space

Historically, most Web servers have been Unix-based, and have used the
naming space associated with that operation system. Many servers have since
been developed for other platforms, however, and it's no doubt prudent
that, as you create documents, you do not adhere to a naming space for a
particular platform such that you make it difficult to move your documents
to another platform.

  1. Some filesystems have naming spaces which are case-sensitive. Unix is
     a good example of an OS which would consider "document.html" a
     different file from "Document.html", while other file systems, such as
     the Mac OS, make no such distinction -- both names would refer to the
     same file. For the sake of portability, it's probably best to keep all
     the file and directory names within your web structure lowercase. An
     added benefit is that this makes your URLs much more
     human-communicable: it's much easier to read an all-lowercase URL over
     the phone than one which contains both uppercase and lowercase
     characters, when case is significant.
  2. Some filesystems require file extensions to properly type files.
     Servers running under the Mac OS could serve up files with proper
     Content-type headers based upon the file's creator and file type
     stored in the file's resource fork; other filesystems use extensions
     to do this typing. It's always wise to use the appropriate file
     extension for the content type -- such as .gif for GIF files --
     whenever possible.
  3. Some filesystems are restricted to a limited number of significant
     characters. DOS and Windows, of course, only allow eight characters,
     plus three characters for the file extension. Generally, filenames
     under 32 characters should be fairly cross-platform, but for
     DOS/Windows (although Windows 95 and NT eliminate this restriction).
     If you think your files may ever need to live on a DOS or Windows
     server, you may need to restrict yourself to 8 + 3 character
     filenames.
  4. Almost all filesystems define special characters.

Almost all operating systems allow certain special characters in filenames,
while disallowing others; the Mac OS, for example, allows slashes in file
names, while Unix doesn't. It's best to avoid all characters but for the
letters a through z, the numbers 0 through 9, and the underscore, hyphen,
and period.

Developing Content

Uniqueness

Uniqueness may not be seen as an important design goal at first glance:
after all, uniqueness -- not duplicating efforts by creating or compiling
same or similar content -- may appear to be more of a community issue than
an organizational one.

Providing a unique resource, however, increases traffic to your site, and
adds to the authoritativeness of your content (see below). It will also
require support, and a popular, unique resource can have a spill-over
effect on the other content you provide on your site, especially if your
site has a consistent feel and character.

In addition, redoing what has already been done elsewhere can add to
frustration on the part of readers. Providing yet another list of exciting
online resources means that there is simply more of the same sort of
content available, which readers must then evaluate and compare to other
such resources. Providing a unique resource (or a resource in short supply)
means that you are adding to the content of the network, instead of
duplicating it.

How to check for uniqueness of content? There are many search mechanisms on
the Web, such as Lycos. You can also check in relevant newsgroups and
mailing lists. (Chapter 10 of Web Weaving covers these sorts of issues in
more detail).

You can also produce your content so that it leans towards providing
unique, value-added content: instead of simply providing a list of poetry
sites, say, you could provide a list of poetry resources which you find
particularly compelling, with descriptions of why you think they are
compelling. Adding value and content means that you are being a good
network citizen, leaving the community with more than you found it with.

Authoritativeness

Authoritativeness has always been a fallacy, except when read as
author-itativeness; whatever claims to authority you or your organization
have ultimately boil down to status and reputation within the community.
One becomes a reputable source not by being non-refutable, but by putting a
stamp on what you write; by claiming authorship, and, thereby, author-ity.

This means that readers must take greater responsibility for critically
analyzing what documents they come across. But it also means that you must
be responsible in establishing credentials for what you claim, providing
source material and raw data to justify your conclusions.

In some sense, this is the end result of all of the things we discuss here
(and in Web Weaving). In building and maintaining your infostructure what
you are aiming for is authoritativeness; for creating documents which are
well thought out and well designed; which do not become stale or
inaccurate; and which remain both internally and externally consistent.
Your mission now is to use the tools we have provided you with to place the
stamp of authority and relevance on your own works, and to truly create
infostructures on the Web which are compelling and creative. Good luck!

For More Information

There already exist documents on the Web which address this same topic, and
perhaps in more detail. For definitive reference information you may wish
to check the HTML specifications from the World Wide Web Consortium (W3C).
For a more detailed discussion of HTML composition style, you should also
check the Style Guide (especially the section on device-independent
formatting), which is also from the W3C.

If you're looking for a good document for learning the basics of HTML, you
will want to check out the Beginner's Guide to HTML, from NCSA.

Also useful is the Bibliography from Web Weaving, from Addison-Wesley (as
soon as this is placed on-line, I'll put a link to it here).

Finally, the somewhat creatively-minded among you can draw inspiration from
this page's evil twin, Composing Evil HTML. Officially, I don't endorse any
of these techniques. Unofficially ... well, let's just say someday I intend
to buy Andrew several beers.

Acknowledgements

I'd like to thank all of you who have visited this document and commented
on it, suggesting fixes, clarification, and even new sections. You know who
you are (even if I managed to lose your addresses in the flood of
information)! It is, in some senses, always a work in progress and is
always amenable to suggestion, modification, and repair. I appreciate your
help!

We (the authors of Web Weaving) especially like to thank the folks at
Addison-Wesley, for helping us turn all of this into much more than I, at
least, ever thought it would be. There's something just so satisfying about
actually holding a book, hypertext be damned.
---------------------------------------------------------------------------

Copyright © 1994, 1995, 1996 by Eric Tilton. Permission is granted for
individual use and reproduction provided that this document remains intact,
with this copyright message clearly visible. Commercial use and
reproduction rights are held by Addison-Wesley, and this document may not
be resold or redistributed for compensation of any kind without prior
written permission from Addison Wesley -- contact me for details. Parts of
this document appear in a revised form in Web Weaving (ISBN 0-201-48959-7),
a book by Eric Tilton, Carl Steadman, and Tyler Jones, to be published by
Addison-Wesley. Look for it in a bookstore near you!

The upshot is, this document has always been meant as a public service, and
will remain a public service. I hope you've found it to be useful; I've had
fun providing it for your use.
---------------------------------------------------------------------------

Last modified: Dec 8, 1995
James "Eric" Tilton, HTML Guru Wannabee and Occasional Author,
tilt@cs.cmu.edu

(and with most of the Web style considerations contributed by Carl
Steadman, Guy Who Doesn't Suck, carl@freedonia.com)
.