Brought to you by EarthWeb
ITKnowledge Logo Login Graphic Click Here!
Click Here!
ITKnowledge
Find:
 
EXPERT SEARCH ----- nav

EarthWeb Direct

EarthWeb sites: other sites

Previous Table of Contents Next


Table 2-4 Upper 128 characters of the ISO Latin-1 character set

Code Character Code Character Code Character Code Character

128 160 non-breaking space 192 ¿ 224
129 161 ¡ 193 ¡ 225 ·
130 bph 162 ¢ 194 ¬ 226 ´
131 nbh 163 £ 195 v 227 "
132 164 ¤ 196 [fnof] 228 [permil]
133 nel 165 ¥ 197 ~ 229 Â
134 ssa 166 | 198 ? 230 Ê
135 esa 167 § 199 « 231 Á
136 hts 168 | 200 » 232 Ë
137 htj 169 © 201 233 È
138 vts 170 ª 202 234 Í
139 pld 171 « 203 À 235 Î
140 plu 172 ¬ 204 Ã 236 Ï
141 ri 173 shy 205 Õ 237 Ì
142 ss2 174 Æ 206 [OElig] 238 Ó
143 ss3 175 Ø 207 [oelig] 239 Ô
144 dcs 176 8 208 -D 240 ?
145 pu1 177 ± 209 241 Ò
146 pu2 178 2 210 242 Ú
147 sts 179 3 211 243 Û
148 cch 180 ¥ 212 244 Ù
149 mw 181 µ 213 õ 245
150 spa 182 214 246 [circ]
151 epa 183 . 215 x 247 ~
152 sos 184 , 216 ÿ 248 -
153 185 1 217 [Yuml] 249 ?
154 sci 186 · 218 / 250 ?
155 csi 187 » 219 ¤ 251 º
156 st 188 1/4 220 < 252 ,
157 osc 189 1/2 221 Ý 253 ý
158 pm 190 3/4 222 capital thorn 254 little thorn
159 apc 191 ¿ 223 ? 255 ?

Programs that don’t support ISO Latin-1 characters often operate by ignoring the most significant bit of each character; that is, they presume that each byte begins with a zero bit. For example, the umlaut (ü), ISO Latin-1 character 252, would be reduced to ASCII character 252-128, which is character 124, the vertical bar, |. This can be a reasonable approximation if most of the text is ASCII.

Unicode

Just as ISO Latin-1 extends ASCII by adding an extra high-order bit, so too does Unicode extend ISO Latin-1 by adding an extra high-order byte. If the high-order byte is zero (00000000), then the Unicode character is identical to the ISO Latin-1 character in the low-order byte. You can do an approximate conversion from Unicode to ISO Latin-1 by chopping off all the high-order bytes. This works as long as all the text is composed only of ISO Latin-1 characters. Most of the time, especially when you’re working in English, this is a reasonable assumption. Many of Java’s classes that output text make this assumption, most notably PrintStream, which includes System.out.


Note:  I’d love to show you a table of all the extra characters in Unicode, but it would be so lengthy that this book would be mostly that table and not much else. If you need to know more about the specific encodings of the different characters in Unicode, you should check out The Unicode Standard, Second edition, ISBN 0-201-48345-9, from Addison-Wesley. This 950-page book includes the complete Unicode 2.0 specification. Errata for this volume are on the Web at http://www.unicode.org/.
Mac Roman

Remember that I said there were two ways to encode these extra characters in the upper 128 bytes? The Macintosh uses a completely different character-encoding scheme called Mac Roman. It has most of the same glyphs as the ISO Latin-1 character set, but different glyphs are mapped to different numbers. If Java programs try to print the upper 128 characters on a Macintosh, they come out in the Mac Roman character set, not the ISO Latin-1 character set like they are supposed to.

This is a royal pain for more than just Java programs because it makes file translation between platforms excessively difficult. In fact, Java 1.1 provides one of the few class libraries that can translate between the Mac Roman and ISO Latin-1 character sets. This is especially painful to authors trying to write about ISO Latin-1 on a Macintosh.

When the Macintosh was created in the early 1980s, it was one of the very few computers that could handle non-ASCII text. ISO Latin-1 was not yet established. Therefore, Apple had to invent their own scheme for encoding the extra characters. Regrettably, backward-compatibility means that Macs will never get in sync with the rest of the world. That’s one of the disadvantages of pioneering new technology.

To make matters worse, it’s happening again. Apple developed their 2-byte WorldScript technology before Unicode was ready. Everyone who came after Apple standardized on Unicode. This means that we’re probably stuck with ASCII as the lowest common denominator for text data for the foreseeable future.


Previous Table of Contents Next
HomeAbout UsSearchSubscribeAdvertising InfoContact UsFAQs
Use of this site is subject to certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.