WELCOME to the Java Developer ConnectionTM (JDC) Tech Tips,
June 13, 2000. These tips were developed using JavaTM 2 SDK, Standard Edition,
v 1.2.2.
This issue of the JDC Tech Tips is written by Glen McCluskey.
USING BREAKITERATOR TO PARSE TEXT
The standard JavaTM packages such as java.util include several
classes that you can use to break text into words or other logical
units. One of these classes is java.util.StringTokenizer
. When you
use StringTokenizer
, you specify a set of delimiter characters;
instances of StringTokenizer
then return words delimited by these
characters. java.io.StreamTokenizer
is a class that does something
similar.
These classes are quite useful. However they have some limitations.
This is especially true when you're trying to parse text that
represents human language. For example, the classes don't have
built-in knowledge of punctuation rules, and the classes might
define a "word" as simply a string of contiguous non-whitespace
characters.
java.text.BreakIterator
is a class specifically designed to parse
human language text into words, lines, and sentences. To see how it
works, here's a simple example:
import java.text.BreakIterator;
public class BreakDemo1 {
public static void main(String args[]) {
// string to be broken into sentences
String str = "\"Testing.\" \"???\"
(This is a test.)";
// create a sentence break iterator
BreakIterator brkit =
BreakIterator.getSentenceInstance();
brkit.setText(str);
// iterate across the string
int start = brkit.first();
int end = brkit.next();
while (end != BreakIterator.DONE) {
String sentence = str.substring(
start, end);
System.out.println(start + " " +
sentence);
start = end;
end = brkit.next();
}
}
}
The input string is:
"Testing." "???" (This is a test.)
It is immediately apparent that parsing this input is not trivial.
For example, suppose you follow a simple rule that a sentence ends
with a period. Well, actually, it doesn't. The fact that it
doesn't is demonstrated by the following two sentences, both
of which are considered correct:
"This is a test."
"This is a test".
The first of these sentences is more standard relative to
long-standing English usage.
BreakIterator
applies a set of rules to handle situations such as
this. When you run the BreakDemo1
program in the United States
locale, the result is:
0 "Testing."
11 "???"
17 (This is a test.)
The numbers are offsets into the string where each sentence starts.
In other words, BreakIterator
return a series of offsets that tell
where some particular unit (sentence, word) starts in a string.
BreakIterator
is particularly useful in applications such as word
processing, where, for example, you might be trying to find the
location of the next sentence in some currently displayed text.
The demo program uses default locale settings, but it could have
specified a specific locale, for example:
... BreakIterator.getSentenceInstance(Locale.GERMAN);
Another way you can use BreakIterator is to find line breaks,
that is, locations in text where a line could be broken for
text formatting. Here's an example:
import java.text.BreakIterator;
public class BreakDemo2 {
public static void main(String args[]) {
// string to be broken into sentences
String str = "This sen-tence con-tains
hyphenation.";
// create a line break iterator
BreakIterator brkit =
BreakIterator.getLineInstance();
brkit.setText(str);
// iterate across the string
int start = brkit.first();
int end = brkit.next();
while (end != BreakIterator.DONE) {
String sentence =
str.substring(start, end);
System.out.println(start + " " +
sentence);
start = end;
end = brkit.next();
}
}
}
Program output is:
0 This
5 sen-
9 tence
15 con-
19 tains
25 hyphenation.
BreakIterator
applies punctuation rules about where text can be
broken, such as between words or within a hyphenated word (but not
between a word and a following ".").
You can also use BreakIterator
to find word and character breaks.
It's important to note that in finding breaks, BreakIterator
analyzes characters independently of how they are stored.
A "character" in a human language is not necessarily equivalent to
a single Java 16-bit char. For example, an accented character might
be stored as a base character along with a mark. BreakIterator
analyzes these kinds of composite characters as a single character.
One final note about BreakIterator
: it's intended for use with
human languages, not computer ones. For example, a "sentence" in
programming language source code has little meaning.
For more information about BreakIterator
, see
BreakIterator
GOTO STATEMENTS AND JAVATM PROGRAMMING
Suppose you write a C/C++ program that searches a 5 x 5 array
to find the first occurrence of a particular value. You might use
the following approach:
#include <stdio.h>
/* 5 x 5 array of numbers */
#define N 5
static int vec[N][N] = {
{1, 2, 3, 4, 5},
{2, 3, 4, 5, 6},
{3, 4, 5, 6, 7},
{4, 5, 6, 7, 8},
{5, 6, 7, 8, 9}
};
/* target number to be searched for */
static int TARGET = 8;
int main() {
int i = 0;
int j = 0;
int found = 0;
/* iterate through the array,
looking for the target */
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (vec[i][j] == TARGET) {
found = 1;
goto done;
}
}
}
done:
if (found) {
printf("Found at %d %d\n", i, j);
}
return 0;
}
If you run the program, you get the result:
Found at 3 4
In this example, a loop nested in another loop is used to find
the matching array element. If the program finds the element, it
needs to "break" from the nested loops. It's not sufficient to
simply break from the inner loop. Doing that only takes the program
to the outer loop, it does not actually terminate both loops. So
a goto is used to jump out of the inner loop and transfer control
to the "done:" label. Using a goto is not the only way to solve the
problem in C/C++, but this is one place where a goto is sometimes
used.
Goto statements are controversial. One problem is that it's hard
to control the program logic effectively if you use these
statements. For example, look again at the program above. It's
clear that the "found" test that is just after the "done:" label
is intended for use after the loop has terminated (that is, after
the loop terminates normally or through the goto). But there's no
way to enforce this rule; control can be transferred to this label
from anywhere in the function.
In the JavaTM programming language, goto is a reserved word;
the Java programming language does not have a goto statement.
However there are alternative statements that you can use in
the Java programming language in place of the goto statement.
This tip demonstrates three alternative statements.
The first of these is a rewrite of the above program:
public class ControlDemo1 {
// 5 x 5 array of numbers
static int vec[][] = {
{1, 2, 3, 4, 5},
{2, 3, 4, 5, 6},
{3, 4, 5, 6, 7},
{4, 5, 6, 7, 8},
{5, 6, 7, 8, 9}
};
static final int N = 5;
// target number to be searched for
static final int TARGET = 8;
public static void main(String args[]) {
int i = 0;
int j = 0;
boolean found = false;
// iterate through the array,
// looking for the target
outer:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (vec[i][j] == TARGET) {
found = true;
break outer;
}
}
}
if (found) {
System.out.println("Found at " + i +
" " + j);
}
}
}
The key point in this example is that break statements can be
labeled, that is, a break can designate a labeled loop. Specifying
"break outer" in the above example terminates the loop labeled
"outer". In other words, the break statement terminates both
loops.
The same idea applies to continue statements, for example:
public class ControlDemo2 {
public static void main(String args[]) {
outer:
for (int i = 1; i <= 3; i++) {
for (int j = 1; j <= 3; j++) {
System.out.println(i + " " + j);
if (i == 2 && j == 2) {
continue outer;
}
}
}
}
}
Output here is:
1 1
1 2
1 3
2 1
2 2
3 1
3 2
3 3
Break statements are normally used in loop and switch statements,
but you can also use them in any labeled block. Here's an example
that illustrates this idea:
public class ControlDemo3 {
// add two numbers together,
// a >= 0 and b >= 0
// throw IllegalArgumentException
// if a or b out of range
static int add(int a, int b) {
block1: {
if (a < 0) {
break block1;
}
if (b < 0) {
break block1;
}
return a + b;
}
throw new IllegalArgumentException(
"a < 0 || b < 0");
}
public static void main(String args[]) {
// legal case
try {
int a = 37;
int b = 47;
int c = add(a, b);
System.out.println(c);
}
catch (IllegalArgumentException e) {
System.err.println(e);
}
// illegal case
try {
int a = 37;
int b = -47;
int c = add(a, b);
System.out.println(c);
}
catch (IllegalArgumentException e) {
System.err.println(e);
}
}
}
In this example there's a block labeled "block1". The program
handles errors by breaking out of the block. If there are no
errors, the program returns normally from within the block.
An error causes an exception to be thrown after the block is
exited. Note in this example that there are other ways of
structuring the code. For example, you might simply say:
if (a < 0 || b < 0) {
throw new IllegalArgumentException(
"a < 0 || b < 0");
}
return a + b;
Which approach is "correct" depends a lot on the complexity of the
logic, and what style you prefer.
The final example illustrates the case where you'd like to perform
some actions, and then somehow gain control for cleanup processing.
You want to do this whether the actions succeed, fail, or trigger
an exception. This case is sometimes implemented in C/C++ by using
a goto to jump to the end of a function, where there is some
cleanup code.
Here's an example of how you can do this using a JavaTM program:
public class ControlDemo4 {
// add two numbers together,
// a >= 0 and b >= 0
// throw IllegalArgumentException
// if a or b out of range
static int traceadd(int a, int b) {
try {
if (a < 0 || b < 0) {
throw new IllegalArgumentException(
"a < 0 || b < 0");
}
return a + b;
}
finally {
System.out.println("trace:
leaving traceadd");
}
}
public static void main(String args[]) {
// legal case
try {
int a = 37;
int b = 47;
int c = traceadd(a, b);
System.out.println(c);
}
catch (IllegalArgumentException e) {
System.err.println(e);
}
// illegal case
try {
int a = 37;
int b = -47;
int c = traceadd(a, b);
System.out.println(c);
}
catch (IllegalArgumentException e) {
System.err.println(e);
}
}
}
This example does program tracing. It prints a message when the
traceadd method exits. The exit can be normal, through the return
statement, or abnormal, through an exception. Using try...finally
(no catch) like this:
try {
statement 1
statement 2
statement 3
...
}
finally {
cleanup
}
is a way to get control for cleanup, no matter what happens in the
try clause.
For further reading, see chapter 14 in The JavaTM Language
Specification by James Gosling, Bill Joy, and Guy Steele.
The names on the JDCSM
mailing list
are used for internal Sun MicrosystemsTM
purposes only. To remove your name from the list, see
Subscribe/Unsubscribe
below.
Feedback
Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster
Subscribe/Unsubscribe
To unsubscribe from JDC email, go
to the following
address and enter the email address you wish to remove from
the mailing list:
http://developer.java.sun.com/unsubscribe.html
To become a JDC member and subscribe to this newsletter go to:
http://java.sun.com/jdc/