ENCODING AND DECODING
Experiencing one (or more) bytes
out of your A’s
Overview
• It’s not your father’s character set
– 8 bit characters
– ASCII
– The rest of the world wakes up to computers
• Unicode
– Character codes
– Different flavors
• Encoding and Decoding classes
• Example
The Good Old Days
•
•
•
•
•
•
•
Focus on unaccented, English letters
Every letter, number, capital, etc
Represented by codes 0-127
Space, 32; “A”, 65; “a”, 97
Used 7 bits, one bit free on most computers
Wordstar and the 8th bit
Below 32 – control bits  7, beep; 12, formfeed
8th bit, values 128-255
• Everybody had their own ideas
• OEM Character sets
• IBM-PC -> graphics (horizontal bars,
vertical bars, bars with dangles, etc.)
• Outside U.S.  different languages
– Code 130
8th bit, values 128-255
• Everybody had their own ideas
• OEM Character sets
• IBM-PC -> graphics (horizontal bars,
vertical bars, bars with dangles, etc.)
• Outside U.S.  different languages
– Code 130
8th bit, values 128-255
• Everybody had their own ideas
• OEM Character sets
• IBM-PC -> graphics (horizontal bars, vertical bars, bars
with dangles, etc.)
• Outside U.S.  different languages
– Code 130 é in US, Gimel ‫ ג‬character in Israel
– Difficult to exchange documents
• Code pages – regional definition of bit values 128-255
– Israel: Code page 862
– Greek: Code page 737
– ISO/ANSI code pages
• Asia – Alphabets had thousands of characters
– No way to store in one byte (8 bits)
Unicode
• Not a 16-bit code
• A new way of thinking about characters
• Old way:
– Character “A” maps to memory or disk bits
– A-> 0100 0001
• Unicode way:
–
–
–
–
Each letter in every alphabet maps to a “code point”
Abstract concept
“A” is Platonic “form” – just floats out there
A -> U+0639  code point
Unicode
• Hello -> U+0048 U+0065 U+006C U+006C U+006F
• Storing in 2 bytes each:
– 0048 0065 006C 006C 006F (big endian)
– Or 4800 6500 6C00 6C00 6F00 (little endian)
• Need to have a Byte Order Mark (BOM) at beginning of
stream
• UTF8 coding system
–
–
–
–
Stores Unicode points (magic numbers) as 8 bit bytes
Values 0-127 go into byte 1
Values 128+ go into bytes 2, 3, etc.
For characters up to 127, UTF8 looks just like ASCII
UNICODE Encodings
• UTF-8
• UTF-16 – characters stored in 2 byte, 16-bit
(halfword) sequences – also called UTF-2
• UTF-32 – characters stored in 4byte, 32 bit
sequences
• UTF-7 – forces a zero in high order bit - firewalls
• Ascii Encoding – everything above 7 bits is
dropped
Definitions
• .NET uses UTF-16 encoding internally to store
text
• Encoding:
– transfers a set of Unicode characters into a sequence
of bytes
– Send a string to a file or a network stream
• Decoding:
– transfers a sequence of bytes into a set of Unicode
characters
– Read a string from a file or a network stream
• StreamReader, StreamWriter default to UTF-8
Encoding/Decoding Classes
• UTF32Encoding class
– Convert characters to and from UTF-32 encoding
• UnicodeEncoding class
– Convert characters to and from UTF-16 encoding
• UTF8Encoding class to convert to and from
UTF-8 encoding – 1, 2, 3, or 4 bytes per char
• ASCIIEncoding class to convert to and from
ASCII Encoding – drops all values > 127
• System.Text.Encoding supports a wide range
of ANSI/ISO encodings
Convert a string into a stream of
encoded bytes
1. Get an encoding object
Encoding e = Encoding.GetEncoding(“Korean”);
2. use the encoding object’s GetBytes() method to
convert a string into its byte representation
byte[ ] encoded;
encoded = e.GetBytes(“I’m gonna be Korean!”);
Demo: D:\_Framework 2.0 Training Kits\70-536\Chapter 03\EncodingDemo
Write a file in encoded form
FileStream fs = new FileStream("text.txt", FileMode.OpenOrCreate);
...
StreamWriter t = new StreamWriter (fs, Encoding.UTF8); t.
Write("This is in UTF8");
Read an encoded file
FileStream fs = new FileStream("text.txt", FileMode.Open);
...
StreamReader t = new StreamReader(fs, Encoding.UTF8);
String s = t.ReadLine();
Summary
• ASCII is one of oldest encoding standards.
• UNICODE provides multilingual support
• System.Text.Encoding has static methods
for encoding and decoding text.
• Use an overloaded Stream constructor
that accepts an encoding object when
writing a file.
• Not necessary to specify Encoding object
when reading, will default.
References
• www.unicode.org
• Unicode and .Net – what does .NET Provide?
http://www.developerfusion.co.uk/show/4710/3/
• Hello Unicode, Goodbye ASCII
http://www.nicecleanexample.com/ViewArticle.as
px?TID=unicode_encoding
• The Absolute Minimum Every Software
Developer Absolutely, Positively Must Know
About Unicode and Character Sets (No
Excuses!)
http://www.joelonsoftware.com/articles/Unicode.
html
Descargar

ENCODING AND DECODING