This issue presents tips, techniques, and sample code for the following topics:
This issue of the JDC Tech Tips is written by Patrick Chan,the author of the
publication
"The
JavaTM Developers Almanac".
Extracting Links from an HTML File
There are many applications that fetch an HTML page from the Web and then
extract the links from the page. For example, a link-checker application
fetches a page, extracts the links, and then checks the links to see of they
refer to actual pages.
The HTML 3.2 support in the JavaTM 2
platform makes it fairly easy to find
and parse links. This tip demonstrates how to use that support.
The first step is to create an editor kit. The purpose of an editor kit is to
parse data in some format, such as HTML or RTF, and store the information in a
data structure that fully represents the data. This data structure, called a
Document, allows you to examine and modify the data in a convenient way.
Let's look at an example. In the following example program, we're going to
examine the HTML data in a Document object. The program looks for A (anchor)
tags and extracts the HREF attribute information from these tags.
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetLinks {
public static void main(String[] args) {
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet
// handle charset's properly.
doc.putProperty("IgnoreCharsetDirective",
Boolean.TRUE);
try {
// Create a reader on the HTML content.
Reader rd = getReader(args[0]);
// Parse the HTML.
kit.read(rd, doc, 0);
// Iterate through the elements
// of the HTML document.
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
SimpleAttributeSet s = (SimpleAttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
System.out.println(
s.getAttribute(HTML.Attribute.HREF));
}
}
} catch (Exception e) {
e.printStackTrace();
}
System.exit(1);
}
// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException {
if (uri.startsWith("http:")) {
// Retrieve from Internet.
URLConnection conn =
new URL(uri).openConnection();
return new
InputStreamReader(conn.getInputStream());
} else {
// Retrieve from file.
return new FileReader(uri);
}
}
}
This program takes one parameter from the command line. If the parameter
starts with "http:", the program treats the parameter as a URL and
fetches the HTML from that URL. Otherwise, the parameter is treated as a
filename and the HTML is fetched from that file.
For example,
$ java GetLinks http://java.sun.com
retrieves the HTML from the main page at java.sun.com.
The editor kit is an HTMLEditorKit object that contains an HTML parser. It
creates a Document object that can represent HTML. And it's the editor kit's
read()
method that parses the HTML and stores the information in
the Document.
Once the HTML data is saved in the Document object, we're ready to look for
links. This is done by creating an iterator
(using ElementIterator)
that iterates over all the visible text
pieces (called elements) in the HTML. For each text piece, we check to see if
it has been formatted for linking, in other words, whether the text is
formatted with the A (anchor) tag. We do this by calling
getAttributes().getAttribute(HTML.Tag.A)
. If the text piece has
been formatted with the A tag, the method call returns the set of attributes
of the A tag used to format that text piece. Otherwise the method call simply
returns null.
Note: The name getAttributes()
is a little confusing because it
has nothing to do with HTML attributes; the "attributes" in this case are all
the HTML tags (such as an A tag) that were used to format that text piece.
Now we have the set of attributes of the A tag used to format a piece of text;
it's stored in a SimpleAttributeSet
object. So we just need to
get the value of the HREF attribute and we're done. We can do this by calling
getAttribute(HTML.Attribute.HREF)
on the A tag's attribute set.
SORTING ARRAYS
This tip discusses how you can sort data in arrays. Sorting arrays of
primitive types is easy. There are seven methods in the class Arrays for
sorting arrays of each of the seven primitive types: byte, char, double,
float, int, long, and short. Here's an example that sorts an array of
doubles.
import java.util.*;
import java.awt.*;
class Sort1 {
// Sorts an array of random double values.
public static void main(String[] args) {
double[] dblarr = new double[10];
for (int i=0; i<dblarr.length; i++) {
dblarr[i] = Math.random();
}
// Sort the array.
Arrays.sort(dblarr);
//Print the array
for (int i=0; i<dblarr.length; i++){
System.out.println(dblarr[i]);
}
}
}
Sorting an array of objects is just as easy if the objects implement the
Comparable interface, java.util.Comparable
. This interface gives
a natural ordering for a class so that objects of that class can be sorted.
Here's an example that sorts an array of type String that implements
Comparable.
import java.util.*;
import java.awt.*;
class Sort2 {
// Sorts the arguments in args.
public static void main(String[] args) {
Arrays.sort(args);
//Print the arguments in args
for (int i=0; i<args.length; i++){
System.out.println(args[i]);
}
}
}
What if the objects do not implement Comparable? Well, you've got two choices:
you can modify the objects to implement Comparable, or you can supply a
Comparator to the sort method. Let's look at the first option first.
To make an object comparable you need to add Comparable to the object's
implements list. You then need to modify the object to implement the
compareTo()
method. The compareTo(
) method compares
the object with another object of the same type. If the object should appear
before the other object, compareTo()
should return a negative
number. If the object should appear after the other object,
compareTo()
should return a non-zero positive number. Zero should
be returned if the objects are equal.
Point is an AWT class that is not comparable. The following example creates a
version of Point that is comparable. It sorts points by distance from the
origin.
import java.util.*;
import java.awt.*;
class MyPoint extends java.awt.Point implements
Comparable {
MyPoint(int x, int y) {
super(x, y);
}
public int compareTo(Object o) {
MyPoint p = (MyPoint)o;
double d1 = Math.sqrt(x*x + y*y);
double d2 = Math.sqrt(p.x*p.x + p.y*p.y);
if (d1 < d2) {
return -1;
} else if (d2 < d1) {
return 1;
}
return 0;
}
}
class Sort3 {
public static void main(String[] args) {
Random rnd = new Random();
MyPoint[] points = new MyPoint[10];
for (int i=0; i<points.length; i++) {
points[i] = new MyPoint(rnd.nextInt(100),
rnd.nextInt(100));
}
Arrays.sort(points);
//Print the points
for (int i=0; i<points.length; i++){
System.out.println(points[i]);
}
}
}
If you can't or don't want to make an object Comparable, you can supply a
Comparator object to the Arrays.sort()
method. The Comparator
object must implement a method called compare(). The behaviour of the
compare()
method is almost identical to the
compareTo()
method of the Comparable interface.
The next example is similar to the one above. However, instead of creating a
special kind of Point, we create a comparator that can sort Point objects.
import java.util.*;
import java.awt.*;
class PointComparator implements Comparator {
public int compare(Object o1, Object o2) {
Point p1 = (Point)o1;
Point p2 = (Point)o2;
double d1 = Math.sqrt(p1.x*p1.x + p1.y*p1.y);
double d2 = Math.sqrt(p2.x*p2.x + p2.y*p2.y);
if (d1 < d2) {
return -1;
} else if (d2 < d1) {
return 1;
}
return 0;
}
}
class Sort4 {
public static void main(String[] args) {
Random rnd = new Random();
Point[] points = new Point[10];
for (int i=0; i<points.length; i++) {
points[i] = new Point(rnd.nextInt(100),
rnd.nextInt(100));
}
Arrays.sort(points, new PointComparator());
//Print the points
for (int i=0; i<points.length; i++){
System.out.println(points[i]);
}
}
}
Note
The names on the JDCSM mailing list
are used for internal Sun MicrosystemsTM
purposes only. To remove your name from the list, see Subscribe/Unsubscribe below.
Feedback
Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster
Subscribe/Unsubscribe
The JDC Tech Tips are sent to you because you elected to subscribe when you registered as a JDC member. To unsubscribe from JDC email, go to the following address and enter the email address you wish to remove from the mailing list:
http://developer.java.sun.com/unsubscribe.html
To become a JDC member and subscribe to this newsletter go to:
http://java.sun.com/jdc/