|
|
Table 2-4 Upper 128 characters of the ISO Latin-1 character set
|
Code
| Character
| Code
| Character
| Code
| Character
| Code
| Character
|
|
128
|
| 160
| non-breaking space
| 192
| ¿
| 224
|
|
129
|
| 161
| ¡
| 193
| ¡
| 225
| ·
|
130
| bph
| 162
| ¢
| 194
| ¬
| 226
| ´
|
131
| nbh
| 163
| £
| 195
| v
| 227
| "
|
132
|
| 164
| ¤
| 196
| [fnof]
| 228
| [permil]
|
133
| nel
| 165
| ¥
| 197
| ~
| 229
| Â
|
134
| ssa
| 166
| |
| 198
| ?
| 230
| Ê
|
135
| esa
| 167
| §
| 199
| «
| 231
| Á
|
136
| hts
| 168
| |
| 200
| »
| 232
| Ë
|
137
| htj
| 169
| ©
| 201
|
| 233
| È
|
138
| vts
| 170
| ª
| 202
|
| 234
| Í
|
139
| pld
| 171
| «
| 203
| À
| 235
| Î
|
140
| plu
| 172
| ¬
| 204
| Ã
| 236
| Ï
|
141
| ri
| 173
| shy
| 205
| Õ
| 237
| Ì
|
142
| ss2
| 174
| Æ
| 206
| [OElig]
| 238
| Ó
|
143
| ss3
| 175
| Ø
| 207
| [oelig]
| 239
| Ô
|
144
| dcs
| 176
| 8
| 208
| -D
| 240
| ?
|
145
| pu1
| 177
| ±
| 209
|
| 241
| Ò
|
146
| pu2
| 178
| 2
| 210
|
| 242
| Ú
|
147
| sts
| 179
| 3
| 211
|
| 243
| Û
|
148
| cch
| 180
| ¥
| 212
|
| 244
| Ù
|
149
| mw
| 181
| µ
| 213
| õ
| 245
|
|
150
| spa
| 182
| ¶
| 214
|
| 246
| [circ]
|
151
| epa
| 183
| .
| 215
| x
| 247
| ~
|
152
| sos
| 184
| ,
| 216
| ÿ
| 248
| -
|
153
|
| 185
| 1
| 217
| [Yuml]
| 249
| ?
|
154
| sci
| 186
| ·
| 218
| /
| 250
| ?
|
155
| csi
| 187
| »
| 219
| ¤
| 251
| º
|
156
| st
| 188
| 1/4
| 220
| <
| 252
| ,
|
157
| osc
| 189
| 1/2
| 221
| Ý
| 253
| ý
|
158
| pm
| 190
| 3/4
| 222
| capital thorn
| 254
| little thorn
|
159
| apc
| 191
| ¿
| 223
| ?
| 255
| ?
|
|
Programs that dont support ISO Latin-1 characters often operate by ignoring the most significant bit of each character; that is, they presume that each byte begins with a zero bit. For example, the umlaut (ü), ISO Latin-1 character 252, would be reduced to ASCII character 252-128, which is character 124, the vertical bar, |. This can be a reasonable approximation if most of the text is ASCII.
Unicode
Just as ISO Latin-1 extends ASCII by adding an extra high-order bit, so too does Unicode extend ISO Latin-1 by adding an extra high-order byte. If the high-order byte is zero (00000000), then the Unicode character is identical to the ISO Latin-1 character in the low-order byte. You can do an approximate conversion from Unicode to ISO Latin-1 by chopping off all the high-order bytes. This works as long as all the text is composed only of ISO Latin-1 characters. Most of the time, especially when youre working in English, this is a reasonable assumption. Many of Javas classes that output text make this assumption, most notably PrintStream, which includes System.out.
Note: Id love to show you a table of all the extra characters in Unicode, but it would be so lengthy that this book would be mostly that table and not much else. If you need to know more about the specific encodings of the different characters in Unicode, you should check out The Unicode Standard, Second edition, ISBN 0-201-48345-9, from Addison-Wesley. This 950-page book includes the complete Unicode 2.0 specification. Errata for this volume are on the Web at http://www.unicode.org/.
Mac Roman
Remember that I said there were two ways to encode these extra characters in the upper 128 bytes? The Macintosh uses a completely different character-encoding scheme called Mac Roman. It has most of the same glyphs as the ISO Latin-1 character set, but different glyphs are mapped to different numbers. If Java programs try to print the upper 128 characters on a Macintosh, they come out in the Mac Roman character set, not the ISO Latin-1 character set like they are supposed to.
This is a royal pain for more than just Java programs because it makes file translation between platforms excessively difficult. In fact, Java 1.1 provides one of the few class libraries that can translate between the Mac Roman and ISO Latin-1 character sets. This is especially painful to authors trying to write about ISO Latin-1 on a Macintosh.
When the Macintosh was created in the early 1980s, it was one of the very few computers that could handle non-ASCII text. ISO Latin-1 was not yet established. Therefore, Apple had to invent their own scheme for encoding the extra characters. Regrettably, backward-compatibility means that Macs will never get in sync with the rest of the world. Thats one of the disadvantages of pioneering new technology.
To make matters worse, its happening again. Apple developed their 2-byte WorldScript technology before Unicode was ready. Everyone who came after Apple standardized on Unicode. This means that were probably stuck with ASCII as the lowest common denominator for text data for the foreseeable future.
|
|