WELCOME to the Java Developer ConnectionTM
(JDC) Tech Tips,
June 27, 2000. This issue is covers some aspects of using the JavaTM programming language with XML. First there's a short
introduction to XML, followed by tips on how to use two APIs designed for use
with XML.
This issue of the JDC Tech Tips is written by Stuart Halloway,
a Java specialist at DevelopMentor
.
XML INTRODUCTION
XML INTRODUCTION
The Extensible Markup Language (XML) is a way of specifying the
content elements of a page to a Web browser. XML is syntactically
similar to HTML. In fact, XML can be used in many of the places
in which HTML is used today. Here's an example. Imagine that the
JDC Tech Tip index was stored in XML instead of HTML. Instead of
HTML coding such as this:
<html>
<body>
<h1>JDC Tech Tip Index</h1>
<ol><li>
<a href="/developer/TechTips/2000/tt0509.html#tip1">
Random Access for Files
</a>
</li></ol>
</body>
</html>
It might look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<tips>
<author id="glen" fullName="Glen McCluskey"/>
<tip title="Random Access for Files"
author="glen"
htmlURL="/developer/TechTips/2000/tt0509.html#tip1"
textURL="/developer/TechTips/txtarchive/May00_GlenM.txt">
</tip>
</tips>
Notice the coding similarities between XML and HTML. In each case,
the document is organized as a hierarchy of elements, where each
element is demarcated by angle brackets. As is true for most HTML
elements, each XML element consists of a start tag, followed by
some data, followed by an end tag:
<element>element data</element>
Also as in HTML, XML elements can be annotated with attributes.
In the XML example above, each <tip> element has several
attributes. The 'title' attribute is the name of the tip, the
'author' attribute gives a short form of the author's name, and
the 'htmlURL' and 'textURL' attributes contain links to different
archived formats of the tip.
The similarities between the two markup languages is an important
advantage as the world moves to XML, because hard-earned HTML
skills continue to be useful. However, it does beg the question
"Why bother to switch to XML at all?" To answer this question,
look again at the XML example above, and this time consider the
semantics instead of the syntax. Where HTML tells you how to format
a document, XML tells you about the content of the document. This
capability is very powerful. In an XML world, clients can
reorganize data in a way most useful to them. They are not
restricted to the presentation format delivered by the server.
Importantly, the XML format has been designed for the convenience
of parsers, without sacrificing readability. XML imposes strong
guarantees about the structure of documents. To name a few: begin
tags must have end tags, elements must nest properly, and all
attributes must have values. This strictness makes parsing and
transforming XML much more reliable than attempting to manipulate
HTML.
The similarities between XML and HTML stem from a shared history.
HTML is a simplified vocabulary of a powerful markup language
called SGML. SGML is the "kitchen sink" of markup, allowing you
to do almost anything, including the ability to define your own
domain-specific vocabularies. HTML is a dim shadow of SGML, with
a predefined vocabulary. Thus HTML is basically a static snapshot
of some presentation features that seemed useful circa 1992. Both
SGML and HTML are problematic: SGML does everything, but is too
complex. HTML is simple, but its parsing rules are loose, and its
vocabulary does not provide a standard mechanism for extension.
XML, by comparison, is a streamlined version of SGML. It aims to
meet the most important objectives of SGML without too much
complexity. If SGML is the "kitchen sink," XML is a "Swiss Army
knife."
Given its advantages, XML does far more than simply displace HTML
in some applications. It can also displace SGML, and open new
opportunities where the complexity of SGML had been a barrier.
Regardless of how you plan to use XML, the programming language of
choice is likely to be the Java programming language. You could
write your own code to parse XML directly, the Java language
provides higher level tools to parse XML documents through the
the Simple API for XML (SAX) and the Document Object Model (DOM)
interfaces. The SAX and DOM parsers are standards that are
implemented in several different languages. In the Java
programming language, you can instantiate the parsers by using the
JavaTM API for XML Parsing (JAXP).
To execute the code in this tip, you will need to download JAXP
and a reference implementation of the SAX and DOM parsers. You will also
need to
download SAX 2.0. Remember
to update your class path to include the jaxp, parser, and sax2
JAR files.
USING THE SAX API
The SAX API provides a serial mechanism for accessing XML
documents. It was developed by members of the XML-DEV mailing list
as a standard set of interfaces to allow different vendor
implementations. The SAX model allows for simple parsers by
allowing parsers to read through a document in a linear way, and
then to call an event handler every time a markup event occurs.
The original SAX implementation was released in May 1998. It was
superseded by SAX 2.0 in May 2000. (The code is this tip is SAX2
compliant.)
All you have to do to use SAX2 for notification of markup events,
is implement a few methods and interfaces. The ContentHandler
interface is the most important of these interfaces. It declares
a number of methods for different steps in parsing an XML document.
In many cases, you will only be interested in few of these methods.
For example, the code below handles only a single ContentHandler
method (startElement)
, and uses it to build an HTML page from the
XML Tech Tip Index:
import java.io.*;
import java.net.*;
import java.util.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
/**
* Builds a simple HTML page which lists tip titles
* and provides links to HTML and text versions
*/
public class UseSAX2 extends DefaultHandler {
StringBuffer htmlOut;
public String toString() {
if (htmlOut != null)
return htmlOut.toString();
return super.toString();
}
public void startElement(String namespace,
String localName,
String qName,
Attributes atts) {
if (localName.equals("tip"))
{
String title = atts.getValue("title");
String html = atts.getValue("htmlURL");
String text = atts.getValue("textURL");
htmlOut.append("<br>");
htmlOut.append("<A HREF=");
htmlOut.append(html);
htmlOut.append(">HTML</A> <A HREF=");
htmlOut.append(text);
htmlOut.append(">TEXT</A> ");
htmlOut.append(title);
}
}
public void processWithSAX(String urlString) throws Exception {
System.out.println("Processing URL " + urlString);
htmlOut =
new StringBuffer("<HTML><BODY><H1>JDC Tech Tips
Archive</H1>");
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
ParserAdapter pa = new ParserAdapter(sp.getParser());
pa.setContentHandler(this);
pa.parse(urlString);
htmlOut.append("</BODY></HTML>");
}
public static void main(String[] args) {
try {
UseSAX2 us = new UseSAX2();
us.processWithSAX(args[0]);
String output = us.toString();
System.out.println("Saving result to " + args[1]);
FileWriter fw = new FileWriter(args[1]);
fw.write(output, 0, output.length());
fw.flush();
}
catch (Throwable t) {
t.printStackTrace();
}
}
}
To test the program, you can use the XML fragment in the XML
Introduction that precedes this tip, or download a
longer version.
Save the XML fragment or the longer XML version in your local
directory as TechTipArchive.xml. You can then produce an HTML
version with the command:
java UseSAX2 file:TechTipArchive.xml SimpleList.html
Then use your browser of choice to view SimpleList.html, and
follow links to either text or HTML versions of recent Tech Tips.
(In a production scenario you would probably merge this code into
a client browser or into a servlet or JSP page on the server.)
There are several interesting points about the code above. Notice
the steps in creating the parser.
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
In JAXP, the SAXParser class is not created directly, but instead
through the factory method newSAXParser()
. This allows different
implementations to be plug-compatible without source code changes.
The factory also provides control over more advanced parsing
features such as namespace support and validation. Even after you
have the JAXP parser instance, you still aren't ready to parse.
The current JAXP parser only supports SAX 1.0; to get SAX 2.0
support, you must wrap the parser in a ParserAdapter.
ParserAdapter pa = new ParserAdapter(sp.getParser());
The ParserAdapter
class adds SAX2 functionality to an existing
SAX1 parser and is part of the SAX2 download.
Notice that instead of implementing the ContentHandler
interface,
UseSAX extends
the DefaultHandler
class.
DefaultHandler
is an
adapter class that provides an empty implementation of all the
ContentHandler
methods, so only the methods that are of interest
need to be overridden.
The startElement()
method does the real work. Because the program
only wants to list the tips by title, the <tip> element is
all-important, and the <tips> and <author> elements are ignored.
The startElement method checks the element name and continues
only if the current element is <tip>. The method also provides
access to an element's attributes via an Attributes reference, so
it is easy to extract the tip name, htmlURL, and textURL.
The end result of this exercise is an HTML document that allows you
to browse the list of recent Tech Tips. You could have done this
directly by coding in HTML. But doing this in XML, and writing the
SAX code provides additional flexibility. If another person wanted
to view the Tech Tips sorted by date, or by author, or filtered by
some constraint, then various views could be generated from a
single XML file, with different parsing code for each view.
Unfortunately, as the XML data gets more complicated, the sample
above becomes more difficult to code and maintain. The example
suffers from two problems. First, the code to generate the HTML
output is just raw string manipulation, which makes it easy to
lose a '>' or a '/' somewhere. Second, the SAX API doesn't remember
much; if you need to refer back to some earlier element, then you
have to build your own state machine to remember the elements that
have already been parsed.
The Document Object Model (DOM) API solves both of these problems.
USING THE DOM API
The DOM API is based on an entirely different model of document
processing than the SAX API. Instead of reading a document
one piece at a time (as with SAX), a DOM parser reads an entire
document. It then makes the tree for the entire document available
to program code for reading and updating. Simply put, the
difference between SAX and DOM is the difference between
sequential, read-only access, and random, read-write access.
At the core of the DOM API are the Document and Node interfaces.
A Document is a top level object that represents an XML document.
The Document holds the data as a tree of Nodes, where a Node is
a base type that can be an element, an attribute, or some other
type of content. The Document also acts as a factory for new
Nodes. Nodes represent a single piece of data in the tree, and
provide all of the popular tree operations. You can query nodes
for their parent, their siblings, or their children. You can also
modify the document by adding or removing Nodes.
To demonstrate the DOM API, let's process the same XML document
that got "SAXed" above. This time, let's group the output by
author. This will take a little more work. Here's the code:
//UseDOM.java
import java.io.*;
import java.net.*;
import java.util.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class UseDOM {
private Document outputDoc;
private Element body;
private Element html;
private HashMap authors = new HashMap();
public String toString() {
if (html != null) {
return html.toString();
}
return super.toString();
}
public void processWithDOM(String urlString)
throws Exception {
System.out.println("Processing URL " + urlString);
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(urlString);
Element elem = doc.getDocumentElement();
NodeList nl = elem.getElementsByTagName("author");
for (int n=0; n<nl.getLength(); n++)
{
Element author = (Element)nl.item(n);
String id = author.getAttribute("id");
String fullName = author.getAttribute("fullName");
Element h2 = outputDoc.createElement("H2");
body.appendChild(h2);
h2.appendChild(outputDoc.createTextNode("by " + fullName));
Element list = outputDoc.createElement("OL");
body.appendChild(list);
authors.put(id, list);
}
NodeList nlTips =
elem.getElementsByTagName("tip");
for (int i=0; i<nlTips.getLength(); i++)
{
Element tip = (Element)nlTips.item(i);
String title = tip.getAttribute("title");
String htmlURL = tip.getAttribute("htmlURL");
String author = tip.getAttribute("author");
Node list = (Node) authors.get(author);
Node item =
list.appendChild(outputDoc.createElement("LI"));
Element a = outputDoc.createElement("A");
item.appendChild(a);
a.appendChild(outputDoc.createTextNode(title));
a.setAttribute("HREF", htmlURL);
}
}
public void createHTMLDoc(String heading)
throws ParserConfigurationException
{
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
outputDoc = db.newDocument();
html = outputDoc.createElement("HTML");
outputDoc.appendChild(html);
body = outputDoc.createElement("BODY");
html.appendChild(body);
Element h1 = outputDoc.createElement("H1");
body.appendChild(h1);
h1.appendChild(outputDoc.createTextNode(heading));
}
public static void main(String[] args) {
try {
UseDOM ud = new UseDOM();
ud.createHTMLDoc("JDC Tech Tips Archive");
ud.processWithDOM(args[0]);
String htmlOut = ud.toString();
System.out.println("Saving result to " + args[1]);
FileWriter fw = new FileWriter(args[1]);
fw.write(htmlOut, 0, htmlOut.length());
fw.flush();
}
catch (Throwable t) {
t.printStackTrace();
}
}
}
Assuming you save the XML as TechTipArchive.xml, you can run the
code with this command line:
java UseDOM file:TechTipArchive.xml ListByAuthor.html
Then point your browser to ListByAuthor.html to see a list of tips
organized by author.
To see how the code works, start by looking at the createHTMLDoc
method. This method creates the outputDoc Document, which will be
used to build the HTML output. Notice that just as with SAX, the
parser is created using factory methods. However here the factory
method is in the DocumentBuilderFactory
class. The second half of
createHTMLDoc builds the basic elements of an HTML page.
outputDoc.appendChild(html);
body = outputDoc.createElement("BODY");
html.appendChild(body);
Element h1 = outputDoc.createElement("H1");
body.appendChild(h1);
h1.appendChild(outputDoc.createTextNode(heading));
Compare that code with the code in the SAX example that builds
the elements of an HTML page:
//direct string manipulation from SAX example
htmlOut = new StringBuffer("<HTML><BODY><H1>JDC Tech Tips
Archive</H1>");
Using the DOM API to build documents isn't as terse or as fast as
direct String manipulation, but it is much less error-prone,
especially in larger documents.
The important part of the useDOM example is the processWithDOM
method. This method does two things: (1) it finds the author
elements and provides them as output, and (2) finds the tips and
provides them as output organized by their respective author.
Each of these steps requires access to the top level element of
the document. This is done via the getDocumentElement()
method.
The author information is in <author> elements. These elements
are found by calling getElementsByTagName("author")
on the
top-level element. The getElementsByTagName
method returns
a NodeList; this is a simple collection of Nodes. Each Node is
then cast to an Element in order to use the convenience method
getAttribute()
. The getAttribute
method gets the
author's id and
fullName. Each author is listed as a second-level heading; to do
this, the output document is used to create an <H2> element
containing the author's fullName. Adding a Node requires
two steps. First the output document is used to create the Node
with a factory method such as createElement()
. Then the node is
added with appendChild()
. Nodes can only be added to the document
that created them.
After the author headings are in place, it is time to create the
links for individual tips. The <tip> elements are found in the
same way as the <author> elements, that is, via
getElementsByTagName(). The logic for extracting the tip attributes
is also similar. The only difference is deciding where to add the
Nodes. Different authors should be added to different lists. The
groundwork for this was laid back when the author elements were
processed by adding an <OL> node and storing it in a HashMap
indexed by author id. Now, the author id attribute of the tip can
be used to look up the appropriate <OL> node for adding the tip.
For more in-depth coverage of XML, see The XML Companion, by Neil
Bradley, Addision-Wesley 2000. For more information about JAXP,
see the JavaTM Technology and XML page. For more information
about
SAX2, see www.megginson.com. The DOM
standard is available at www.w3.org.
The names on the JDCSM
mailing list
are used for internal Sun MicrosystemsTM
purposes only. To remove your name from the list, see
Subscribe/Unsubscribe
below.
Feedback
Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster
Subscribe/Unsubscribe
To unsubscribe from JDC email, go
to the following
address and enter the email address you wish to remove from
the mailing list:
http://developer.java.sun.com/unsubscribe.html
To become a JDC member and subscribe to this newsletter go to:
http://java.sun.com/jdc/