Introduction to
Character Encodings,
Java and You
Agenda
Defining the problem
–
–
Where webMethods products encounter character set
problems.
What the symptoms look like.
Understand core concepts
–
–
What is a character set? What’s an encoding?
What is Unicode, really?
Code Examples to avoid problems
2
Private and Confidential
Confusion Reigns
Generally, the most confusing aspect of
internationalization.
1.
2.
3.
–
3
Many, many standards to choose from.
Arcane terminology
American programmers rarely (seem) to encounter it
head-on.
We’re presenting this because many of our
products are encountering this problem now.
Private and Confidential
Problem Domain
webMethods products interface with:
–
–
4
non-Java systems (for example, in the adapters)
non-Java environments (file systems, databases,
libraries, email, ftp, http, etc.).
Private and Confidential
Java’s Text Representation
Java provides a convenient text processing
architecture centered on the Java String object.
–
5
A Java String is basically an array of Java Character
Objects.
Private and Confidential
Java Characters
Each Java Character object represents a Unicode
character.
–
–
(Currently) a 16-bit unsigned integer value between 0
and 65,535.
Character class provides access to character properties.





6
Private and Confidential
UPPER, lower, and Titlecase mapping
Comparison
Directionality
Compatibility
C-TYPE values such as ‘alpha-ness’, ‘digit-ness’, ‘alphanumericness’
Non-Java Text
Non-Java files, applications, filesystems, database,
et.al. typically do not use Unicode. Java sees them
as an array of bytes (byte[]).
7
Private and Confidential
Three Problems
?
Bad Conversion
No glyph
ƒÃƒ\ǂكÃ\ǂÙ
8
Private and Confidential
Random-seeming
trash characters
Bad Conversion
Target character set doesn’t have this character in
it. Java replaces each character with a “?”
Input String: 日本語
Output String: ???
Typically:
–
–
9
Using the default encoding when we meant to specify
one.
Writing on a device (such as System.out) whose legacy
encoding doesn’t support the characters.
Private and Confidential
“No Glyph”
Java knows what the character is and is handling it
properly, but doesn’t have a picture of it to show
you (in the current Font selected).
Input String: 日本語
Output String:
Typically:
–
10
Nothing is wrong, just using the wrong Font.
Private and Confidential
Random Trash
A byte[] was converted using the wrong character
encoding. Bytes were mapped to the wrong
characters.
Input String: 日本語
Output String: ú
“ {–ê
Œ
Typically:
–
11
Using the wrong encoding, the underlying bytes are
mapped to different, random-seeming characters.
Private and Confidential
Examples
Same byte sequences, different results:
Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “漓パ”
Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAƒ
p”
Java String = 0xE0, 0x41, 0x83, 0x70 = “ 荰”
Java String = “漓パ” = U+6F13 U+30D1
12
Private and Confidential
Character Set Terminology
What is a Character?
A character is a single, atomic unit of text.
The definition has a different meaning according to
the writing system and context.
14
Private and Confidential
Abstract characters
Some abstract characters include:
A Roman Letter Capital A
` Combining Accent Grave
に Hiragana character “ni”
語 CJK Ideograph
‫ ي‬Arabic letter
앚 Hangul syllable
A Fullwidth compatibility letter A
15
Private and Confidential
What is a Character Set?
A character set is a “set”--- a collection of
characters, usually organized in some fashion.
You’re probably most familiar with ASCII:
–
–
–
16
0x41 ‘A’
0x42 ‘B’
Etc.
Private and Confidential
What is a Character Encoding?
Character set: a collection of characters, basically, a
bucket.
Character encoding: the specific ones and zeroes assigned
to a character set.
Character Set: ‘A’ == 0x41
Character Encoding: ‘A’ == 0x41
17
Private and Confidential
Eight Bit Encodings
8-bit encodings allow
for 256 characters.
128 ASCII
32 ‘C1’ controls
96 extended
18
Private and Confidential
Latin-1
The standard for
Western Europe is
generally ISO-8859-1
AKA “Latin-1”
Used by UNIX systems
and the Web.
Extended version
used by Microsoft for
Windows.
19
Private and Confidential
Let a Thousand Encodings Bloom…
Each language has it’s own character set…
–
–
–
–
20
Everywhere: ASCII*
Western European (like German or French): Latin-1
Eastern European (like Polish or Slovak): Latin-2
Simplified Chinese: GB2312
Private and Confidential
Actually, many for each language…
21
Private and Confidential
Other Writing Systems
Writing systems vary around the world (in order of increasing
complexity, more or less):
– Latin-based alphabets

–
Cyrillic and Greek-based alphabets

–
(...‫ )זוהדגבא‬Hebrew
Complex scripts (everything else):

22
(一丁勺両亀困...) Japanese
Bi-directional (RTL) languages go right to left

–
(АБВГДЕЖЩ...) Russian
Ideographic writing systems have thousands of
characters

–
(ABCDEFG…) English
Private and Confidential
(ऋऌऍऎ )Devanagari
Expanded Character Sets
Most languages have alphabetic or phonetic writing
systems:
–
–
–
Russian, Greek, Slavic, (many) Native American, Bahasa,
Hebrew, Arabic, Semitic, etc.: alphabetic
Indian (subcontinent), Thai, Japanese kana, Korean:
phonetic writing systems
8 bits is enough for all of the above (with some tricks)
Some languages use scripts based on Chinese
ideographic writing (“Han” or “Hanja”):
–
–
–
23
–
Chinese
Korean
Vietnamese (traditional)
Japanese Kanji
Private and Confidential
“Double-Byte”
8-bit character encodings use eight bits per
character.
–
28 = 255 characters
“Double-byte” character sets must be 2 bytes per
character ?
–
216 = 65,535 characters
Should actually be called “multi-byte” (MBCS).
–
–
24
Each character can be ONE, TWO, THREE and sometimes
FOUR bytes in length.
MAY involve shift states.
Private and Confidential
Multibyte Encodings
A typical Japanese Character Set:
JIS X 208
(漢字)
Character Encodings of JIS X 208:
Shift-JIS (CP932): 0x8A 0xBF 0x8E 0x9A
EUC-JP:
0xB4 0xC1 0xBB 0xFA
ISO 2022-JP:
0x1B, 0x24, 0x42, 0x34 0x41 0x3B
0x7A 0x1B 0x28 0x4A
Non-Legacy:
UTF-16:
25
Private and Confidential
(0x6F22 0x5B57)
An MBCS Example: Shift-JIS
 Character set used
by DOS, Windows,
Macs, and a few
UNIX-like systems
for Japanese.
– Code Page 932
– JIS X 208:1997
26
Private and Confidential
Shift-JIS
In order to reach
more characters,
double byte values
start with a limited
range of “lead
bytes”
These can be
followed by any
character value
> 0x40 (“trail
byte”)
27
Private and Confidential
Shift-JIS
Each “lead byte”
provides a
“window” onto
additional
characters.
28
Private and Confidential
Shift-JIS
Problems:
– Lead byte values
are also valid as
trail bytes.
– Common special
characters (“\”!!)
are valid trail bytes.
29
Private and Confidential
Han
CJK scripts require up to 100,000 unique
characters for complete representation.
–
Four major variants:




30
Private and Confidential
Traditional Chinese
Simplified Chinese
Japanese Kanji
Korean (non-Hangul)
“Kanji”
Sometimes you hear Japanese called “kanji”
–
–
Kanji is actually one of four writing systems used in
Japan.
Kanji should be avoided as a generic term for DBCS.
Kanji (“Han” or Chinese writing): 日本語
Hiragana (phonetic for Japanese words): にほんご
Katakana (phonetic for “foreign” words): ニホンゴ
Romanji (“Roman script”): nihongo
31
Private and Confidential
Chinese
Upper two are
Traditional.
Lower character
is the Simplified
variant.
32
Private and Confidential
Hangul
Korean Hangul is a syllabic phonetic system, which
has thousands of combinations.
–
33
Hangul is not related to Han ideographic writing.
Private and Confidential
Code Page Hell
With hundreds of encodings and character sets to
choose from, making internationalized code work in
the late 1980’s and early 1990’s was “hellish”.
Internationalization folks referred to this as “code
page hell”
34
Private and Confidential
Unicode and Java
To the Rescue
Unicode (ISO 10646-2)
 Unicode is a character set that supports all of the
world’s languages and writing systems.*
 Originally designed as a “wide character set”--every
character was represented by 16-bits. This allowed for
65,535 potential characters.
 Extended to allow 1.1 million characters.
 Unicode is maintained by an industry consortium. ISO
10646-2 is maintained by WG2. The two are exactly
identical.
36
Private and Confidential
It’s a character set?
Unicode is a character set. It has these encodings:
–
UTF-32. (BE/LE)

–
UTF-16. (BE/LE)


–
A 16-bit encoding. All characters are 16-bits.
Characters above 0xFFFF (the “Basic Multilingual Plane”) require
two special “surrogate” characters.
UTF-8.



37
A 32-bit encoding. All characters 32 bits.
Private and Confidential
An 8-bit variable width encoding. Characters are 1, 2, 3 or 4
bytes long. Always non-endian.
ASCII == ASCII
All other characters have a special bit pattern
UTF-8 Bit Pattern
ASCII == ASCII
–
0x41 == ‘A’
All other characters are multibyte.
–
110xxxxx == two bytes
1110xxxx == three bytes
11110xxx == four bytes
10xxxxxx == trail byte
–
U+00C0 == À == 0xC3 0x80 (11000011 10000000)
–
–
–
38
Private and Confidential
Convenience Method for UTF8
Almost True: readUTF and writeUTF allow direct
access to UTF-8 DataInput/DataOutputStreams.
–
–
39
This is not really UTF-8, but a Sun specialized version.
Use InputStreamReader/OutputStreamWriter to do
proper conversions.
Private and Confidential
Java Uses Unicode
Every character in every Java String object is
encoded as UTF-16 Unicode.
–
–
Every string is converted from a legacy encoding, either
by the compiler or by the String class.
This is the reason for native2ascii and –encoding
switches.
Once you have a String object, everything is
Unicode UTF-16.
40
Private and Confidential
“Special” encodings
There are two encodings that the system treats as
special:
–
–
file.encoding
ISO-8859-1
All basic conversion functions use your system
default encoding.
Most servlet conversion functions use ISO-8859-1
as the default.
41
Private and Confidential
Two File Encodings
Windows systems generally have two different file
encodings:
–
–
42
“ANSI” encoding is the Windows default code page for
GUI applications.
“OEM” encoding is the code page used by the ‘cmd’ or
‘command’ interpreter shells.
Private and Confidential
Stream Readers and Writers
InputStreamReader and OutputStreamWriter
classes perform controlled conversion between
byte[] and String.
–
–
–
43
Always pass the encoding as a variable.
Use the IANA preferred name for the encoding, if possible
(see ftp://ftp.isi.edu/in-notes/iana/assignments/)
Prefer UTF8 for on-the-wire transport.
Private and Confidential
Code Sample
// use with any type of InputStream class
InputStream is = new FileInputStream(file);
InputStreamReader isr =
new InputStreamReader(is, encoding);
// use Buffered Reader for efficiency
BufferedReader br =
new BufferedReader(isr);
StringBuffer sb = new StringBuffer();
int chr;
while ((chr = br.read() > -1) {
sb.append(chr);
}
* Note: Try blocks eliminated for clarity.
44
Private and Confidential
OutputStreamWriter Code Sample
// use with any type of OutputStream class
OutputStream os =
new ByteArrayOutputStream(file);
OutputStreamWriter osw =
new OutputStreamWriter((OutputStream)os,
encoding);
osw.write(myString, 0, myString.length());
osw.flush();
* Note: Try blocks eliminated for clarity.
45
Private and Confidential
Character Class
Provides access to Unicode character properties.
–
–
–
–
–
–
–
–
–
46
UnicodeBlock inside class
Character getType (defined types)
isDigit
isLetter
isLetterOrDigit
isUpperCase/isLowerCase/isTitleCase
toUpperCase/toLowerCase/toTitleCase
isSpace/isWhitespace
isISOControl/isJavaIdentifierStart/isJavaIdentiferPart
Private and Confidential
Normalization
Many characters have two (or more)
representations in Unicode.
–
–
47
Normalization makes the sequences the same.
Simplifies user input parsing and validation.
Private and Confidential
ICUj Normalizer Class
Four forms of Normalization:
–
–
–
–
–
–
48
Form C (composed)
Form D (decomposed)
Form KC (canonical composed)
Form KD (canonical decomposed)
Special handling for Hangul characters!
Note that there is a private class java.text.Normalizer in
the JDK.
Private and Confidential
Demo Programs
UnicodeDemo – a Java program that demonstrates
the byte sequences of different encodings and also
provides some code that shows ISR and OSW in
action.
Charsets – a Windows program by my buddy Bill
Hall for playing with encodings.
http://www.inter-locale.com -- my personal website,
with examples and demos of certain Java I18n
things.
49
Private and Confidential
Questions?
Addison Phillips
[email protected]
Descargar

Character Encodings, Java and You - Inter