|
|
ASCII
Unicode is based on two character sets that predate it: ASCII and ISO Latin-1. ASCII is a 7-bit character set with 128 different characters. ASCII was designed for communication in United States English. It therefore contains the lowercase letters a-z, the capital letters A-Z, the digits 0-9, various punctuation marks, and a number of non-printing control characters, many of which are closely related to the types of terminals and printers that were in use when ASCII was invented. The characters in ASCII are numbered from 0 to 127. Character 0 is the non-printing null character. Character 127 is the delete character. Characters 48 through 57 are the digits 0 through 9. Characters 65 through 90 are the capital letters A through Z. Characters 97 through 122 are the lowercase letters a through z. The remaining ASCII characters are various punctuation marks and non-printing characters. Table 2-3 is a complete list.
Table 2-3 The ASCII character set
|
Code
| Character
| Code
| Character
| Code
| Character
| Code
| Character
|
|
0
| null
| 32
| space
| 64
| @
| 96
| `
|
1
| soh
| 33
| !
| 65
| A
| 97
| a
|
2
| stx
| 34
| "
| 66
| B
| 98
| b
|
3
| etx
| 35
| #
| 67
| C
| 99
| c
|
4
| eot
| 36
| $
| 68
| D
| 100
| d
|
5
| enq
| 37
| %
| 69
| E
| 101
| e
|
6
| ack
| 38
| &
| 70
| F
| 102
| f
|
7
| bell
| 39
| '
| 71
| G
| 103
| g
|
8
| backspace
| 40
| (
| 72
| H
| 104
| h
|
9
| tab (\t)
| 41
| )
| 73
| I
| 105
| i
|
10
| linefeed (\n)
| 42
| *
| 74
| J
| 106
| j
|
11
| vertical tab
| 43
| +
| 75
| K
| 107
| k
|
12
| formfeed (\f)
| 44
| ,
| 76
| L
| 108
| l
|
13
| carriage return, (\r)
| 45
| -
| 77
| M
| 109
| m
|
14
| so
| 46
| .
| 78
| N
| 110
| n
|
15
| si
| 47
| /
| 79
| O
| 111
| o
|
16
| dle
| 48
| 0
| 80
| P
| 112
| p
|
17
| dc1
| 49
| 1
| 81
| Q
| 113
| q
|
18
| dc2
| 50
| 2
| 82
| R
| 114
| r
|
19
| dc3
| 51
| 3
| 83
| S
| 115
| s
|
20
| dc4
| 52
| 4
| 84
| T
| 116
| t
|
21
| nak
| 53
| 5
| 85
| U
| 117
| u
|
22
| syn
| 54
| 6
| 86
| V
| 118
| v
|
23
| etb
| 55
| 7
| 87
| W
| 119
| w
|
24
| can
| 56
| 8
| 88
| X
| 120
| x
|
25
| em
| 57
| 9
| 89
| Y
| 121
| y
|
26
| sub
| 58
| :
| 90
| Z
| 122
| z
|
27
| escape
| 59
| ;
| 91
| [
| 123
| {
|
28
| is4
| 60
| <
| 92
| \
| 124
| |
|
29
| is3
| 61
| =
| 93
| ]
| 125
| }
|
30
| is2
| 62
| >
| 94
| ^
| 126
| ~
|
31
| is1
| 63
| ?
| 95
| _
| 127
| delete
|
|
ISO Latin-1
As I said, ASCII is designed to handle U.S. English. It can do a reasonable approximation of other dialects of English, but it begins to have problems with many other European languages, like French and German. There are no cedillas, umlauts, or any of the other characters not used in English, but present in these languages.
The first bit of each ASCII character is 0. You can define another 128 characters by using the bytes whose first bit is one. Indeed, this is the scheme used in most modern computers. The characters with numeric values between 128 and 255 are used to encode the additional characters needed by most languages that are written in some approximation of the Latin alphabet. There are at least two common ways ASCII is extended into the upper 128 characters. The one around which Unicode and Java are built is the ISO 8859-1 Latin-1 character set, often just referred to as ISO Latin-1. Table 2-4 lists the upper 128 characters of the ISO Latin-1 character set. The lower 128 characters are exactly the same as they are for ASCII.
|