Brought to you by EarthWeb
ITKnowledge Logo Login Graphic Click Here!
Click Here!
ITKnowledge
Find:
 
EXPERT SEARCH ----- nav

EarthWeb Direct

EarthWeb sites: other sites

Previous Table of Contents Next


Because very few text editors are available that allow you to write in Unicode, Java source code files are written in ISO Latin-1. Furthermore, the Java compiler expects to see source code written in ISO Latin-1. If you actually have a text editor that works in Unicode and try to write Java files with it, the compiler will get hopelessly confused when it tries to compile your files.

In fact, Java can be written perfectly well with only ASCII. All Java keywords, operators, and literals, as well as all method, class, and field names in the java packages, can be written in pure ASCII. Because ISO Latin-1 makes your source code difficult to move between Macs and other platforms, you should probably restrict yourself to ASCII in your programs.

You can use Unicode characters in Java string and char literals as well as in identifiers. To embed a non-ASCII character in a string, prefix the hexadecimal number for the character with \u. For example, the division sign is Unicode character 247. Therefore, you can make it part of the string by writing \u00F7. The Greek letter [pi] is Unicode character 12,480 or hexadecimal \u03C0. Thus,

     double \u03C0 = 3.141592;

All Unicode characters can be encoded in this fashion, even those you could type literally. For example, the small letter t can also be written as \u0074. The backslash itself can be written as \u005C. Writing code this way is a very bad idea unless you’re deliberately trying to make it obscure.

When a Java compiler reads Java source code, it first converts all such \u escapes to the actual characters, taking into account double backslash escapes as well. This pre-processing happens before anything else. For example, consider this statement:

     System.out.println("This is not a \\u0074");

The double backslash is interpreted as a literal backslash, not as the start of an escape sequence. Thus you get “This is not a \u0074” instead of “This is not a \t.” To get the second effect, you would have to write

     System.out.println("This is not a \\\u0074");

or better yet, just

     System.out.println("This is not a \\t");

Unicode escape translation is not cumulative. “\u005Cu0074” is translated to the six characters “\u0074” rather than the single character “t.”

As if Unicode input to Java weren’t complex enough, Unicode output is equally troublesome. You already know that PrintStreams like System.out just chop off the high byte of a Unicode character. Although it varies from platform to platform, different output classes in the java package either chop off the high byte like PrintStream or output \u escapes.

UTF8

To summarize what you have learned so far, characters in Java source code are 8-bit ISO Latin-1 characters. Internally, Java translates these characters and any embedded \u escapes into 16-bit Unicode characters.

Using 16-bit characters is relatively inefficient, however, when almost all the text you’re working with is likely to be regular 7-bit ASCII. Therefore, Java byte code embeds string literals in an intermediate format called “Universal Character Set Transformation Format 8-bit form.” Since that’s way more than a mouthful, this is almost always written as the acronym UTF8.

UTF8 encodes the most common characters (the ASCII character set) in a single byte for each character. However, less-common characters use two bytes, including the upper 128 ISO Latin-1 characters (which normally only take one byte apiece). The least common characters of all — the upper 32,768 Unicode characters — are encoded in three bytes.

The details are as follows. Characters between 1 and 127 (\u0001 and \u007F) — that is, ASCII characters except null — are encoded as their low-order byte. The high byte (which is just zeroes anyway) is discarded. If the Unicode character is between 128 and 28,927 (\u0080 to \u07FF) — that is, if its top five bits are zero — then it has 11 bits of data. These 11 bits are encoded as a pair of bytes like this

     1 1 0 x x x x x 1 0 x x x x x x
          bits 6-10     bits 0-5

The null character is also encoded in two bytes as 1100000010000000.

Characters in the range \u0800 to \uFFFF have a full 16 bits of data. These are encoded in three bytes, like this:

     1 1 1 0 x x x x x 1 0 x x x x x 1 0 x x x x x x
           bits 12-15    bits 6-11     bits 0-5


Note:  This is not exactly the official UTF8 encoding. Java differs from the formal standard in that it uses two bytes to encode the null character (\u0000) rather than one. Furthermore, the real UTF8 standard has several more formats to handle four byte characters as well. By using a 4-byte character set, it’s no longer necessary to unify the Chinese, Japanese, and Vietnamese scripts.

This encoding scheme is designed to be easy and quick to parse. Any byte that begins with a 0 bit is a 1-byte ASCII character. Any byte that begins with 110 starts a 2-byte character. Any byte that starts with 1110 is a 3-byte character. Finally, any byte that starts with 10 is the second or third byte of a multi-byte character.

The more ASCII characters in a text string, the more space that can be saved by UTF8. Pure ASCII text is only half as large in UTF8 as it is in true Unicode. In the worst case, where all characters occupy three bytes, a UTF8 string is only 50 percent larger than the equivalent Unicode string. However, the worst case is rarely seen in practice.

The DataInputStream and DataOutputStream classes have writeUTF() and readUTF() methods to handle UTF8 data. readUTF() first reads two bytes from the underlying stream. These are interpreted as an unsigned short specifying the number of bytes to read from the stream (not the number of characters to read from the stream). These bytes are then read and translated from UTF8 into Unicode, and a String containing the translated data is returned. We use this method in Chapter 4 to read the UTF8 strings stored in the constant pool of a byte code file.

The DataOutputStream writeUTF(String s) method writes a Unicode string onto the underlying output stream after translating the string to UTF8 format. The string is preceded by an unsigned short that gives the number of bytes that will be written.


Previous Table of Contents Next
HomeAbout UsSearchSubscribeAdvertising InfoContact UsFAQs
Use of this site is subject to certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.