Unicode 4.0
Mark Davis
President, The Unicode
Consortium
Schedule
 2003, April:
UCD/UAXes
 Final
data files available
 Implementation can proceed
 2003:
September:
 Book
Available
New Characters: 1,228

Modern Scripts



Historic Scripts


(additions to) Indic, Khmer, Latin, Greek, Arabic,
Syriac
(minority scripts) Limbu, Tai Le, Osmanya
Linear B, Cypriot, Ugaritic, Shavian, Aegean
Numbers
Symbols


Monograms, digrams, tetragrams, other symbols
modifier & combining characters
New Characters (cont.)
 Special
Characters
 additional
variation selectors (for future
CJK variants), double-diacritics for
dictionary use
 For
a detailed list, see Derived Age in
the UCD 4.0, and the beta Charts.
 Character repertoire corresponds to
ISO/IEC 10646:2003.
Conformance

Substantially improved specification of
conformance requirements



Incorporated UTR #17: Character Encoding
Model, clearly separating encoding forms and
encoding schemes
Tightened definitions of UTF-8, UTF-16, UTF-32
Separate definition of Unicode String

Clarified conformance status of Unicode
Standard Annexes
 Formal definitions of properties & algorithms

Provisional properties: draft, NRFPT
UTF vs Unicode String

UTF



Unique representation for Code Point
All else illegal
C0 80
D800 0061
Unicode String



Sequence of code units
Internal Processing, not interchange
Not necessarily valid UTF
C0 A0
D800 0061
Conformance (cont.)

Formalized policies for stability of the
standard
 Clarification of semantics of important
characters, including BOM
 Revised scope of enclosing combining marks
 Revised semantics of ZWJ for cursive scripts
 Normalization Corrections

U+2F868; U+2F874; U+2F91F; U+2F95F;
U+2F9BF
Textual Clarifications

Major changes to Chapters 2, 3, 6, 14 and 15
 Definitive terminology for code points:

graphic, format, control, private-use


surrogate, noncharacter, reserved


= assigned characters
not characters
Substantial improvements to many character
block descriptions, especially Indic
Programming language
identifiers
 Now
backwards-compatible
 Once
a Unicode identifier,
 Always a Unicode identifier
 Alternate
 Fix
definition for complete stability
set of allowed characters
 Allow all reserved code points
 + Complete stability
 - “Odd” characters
Case mappings now
normative (but tailorable)
 Clearer
definition of string functions:
 isUpper(),
isLower(), isTitle(), isFold()
 toUpper(), toLower(), toTitle(), toFold()
 Definition
of titlecase uses word
boundaries
 Note
that the Turkic mappings do not
maintain canonical equivalence, without
additional processing.
UAX #9: The Bidirectional
Algorithm
 canonically
 data
change, not algorithm
 shaping
 but
equivalence now preserved
is done after reordering
not across directional boundaries
 clarifications
 ZWJ,
of:
ZWNJ
 intermediate level processing
UAX #14: Line Breaking
Properties

Negative numbers and dates with hyphens
will not break across lines
 Word-Joiner will link any characters (except
hard line breaks)
 Behavior of soft hyphen clarified


Rules for GL relaxed


marks opportunity for breaking, not specific
graphic appearance.
SP and ZW override GL
New Property Values: NL, WJ
UAX #15: Unicode
Normalization Forms

Description of Stable Code Points.
 Notation NFC(x) and isNFC(x), in Notation.
 Added pointer to UTN #5 Canonical
Equivalences in Applications
 Rewrote Annex 12: Corrigenda for clarity, and
to describe the use of Normalization
Corrections.
 Added Annex 13: Canonical Equivalence.
UAX #29: Text Boundaries
 New:
extracted from 3.0, but
significantly revised
 Default definitions
 Word, sentence: tailoring expected
 Grapheme cluster (“user character”)
 Hangul
Syllable or other Base
 plus (optionally) any number of NSMs
No Sub. Changes
 UAX
#11: East Asian Width
 UAX #24: Script Names
 except
now UAX!
Superseded UAXes
 Incorporated
into and thus superseded
by Unicode Version 4.0:
 UAX
#13: Unicode Newline Guidelines
 UAX #19: UTF-32
 UAX #21: Case Mappings
 UAX #27: Unicode 3.1
 UAX #28: Unicode 3.2
Unicode Character Database

Documentation coalesced into UCD.html.
 New properties and values




UCD fallback props more precisely defined.


Hangul_Syllable_Type, Unicode_Radical_Stroke
CJK numeric values added.
PropertyValueAliases adds block names
for code points not explicitly in data files
New Characters

Appropriate properties assigned
UCD4.0 (cont.)

Modifier letters


Khmer


Two Khmer characters are deprecated; four others
strongly discouraged.
Decimal Digits


The general category of 02B9..02BA, 02C6..02CF
changed to general category Lm.
Numeric_Type=decimal digit now aligned with
General_Category=Nd
Braille

Added script value
UCD4.0 (cont. 2)

Case Mapping


Default Ignorables




Fixed for Turkish, Lithuanian
Hangul Filler characters
Soft-Hyphen, CGJ, ZWS
Arabic End of Ayah and Syriac Abbreviation Mark
no longer DI, shaping classes fixed.
Grapheme_Extend

removes halfwidth katakana marks, most Mc
(except as needed for canonical equivalence)
Related Items

UTS #10: Unicode Collation Algorithm



UTS #6: SCSU


Added suitability for XML
Draft UTS #18: Unicode Regular Expressions


Not part of Unicode 4.0, but closely related
From 4.0 on, to be sync'ed in repertoire and
version with the Unicode Standard.
Draft as UTS with conformance requirements
Draft UTR #23: Character Properties

Draft Character Property Model
Q& A
Background Slides
Unicode 3.2 (March, 2002)

New Characters: 1,016
 Symbols


Special Characters


Large collection of mathematical symbols,
especially targeted at MathML, recycling symbols,
ornamental brackets.
combining grapheme joiner, word joiner, invisible
operators for math, variation selectors
Modern Scripts

minority scripts of the Philippines
Conformance
 Eliminates
irregular UTF-8
 Defines variation sequences
 Replaces ZWNBSP with Word Joiner
 Clarifies scope of combining marks
(further revised in 4.0)
 Clarifications of conjoining jamo
behavior, hangul syllable structure,
decomposables,
Textual Clarifications
 Combined
vowels in Khmer, characters
discouraged in Khmer
 Use of dingbats
Unicode Standard Annexes
 UAX
#21: Case Mappings (was UTR)
Unicode Character Database

New properties:




IDS_Binary_Operator, IDS_Trinary_Operator,
Radical, Unified_Ideograph,
Default_Ignorable_Code_Point, Deprecated
Soft_Dotted, Logical_Order_Exception
Grapheme_Base,
Grapheme_Extend,Grapheme_Link
DerivedAge
 Normalization Corrections
 Added Property & Property Value Aliases
 Adds StandardizedVariants.html
Related Items

UTS #10: Unicode Collation Algorithm



Ignorable character handling, dual versioning,
more conditions on well-formed weights, separate
weights for CJK and unassigned characters, noncharacters
Note: base version still U3.1
UTR #26: CESU-8
 Unicode Technical Notes
 Updated Character Encoding Stability Policy
 Added Public Review process
 Updated Glossary
Unicode 3.1 (March, 2001)

New Characters: 44,946


Modern scripts


CJK Ideographs (now totaling 71,039)
Historic scripts


First supplementaries encoded!
Old Italic, Gothic, Deseret, Byzantine Musical
Symbols
Symbols

Mathematical Alphanumeric Symbols, (Western)
Musical Symbols
Conformance

Non-shortest-form UTF-8 excluded
 Clarification of the stability of the standard,


code units vs. code points, non-characters,
normative properties, informative properties,
normative references
Revisions of guidelines:

wchar_t, unassigned code points, identifiers

Major revision of Georgian
 Use of ZWNJ and ZWJ for ligatures
 Language tag characters encoded

but discouraged
Unicode Standard Annexes
 UAX
#19: UTF-32
Unicode Character Database

Major revision of PropList properties:




White_Space, Bidi_Control, Join_Control,
Hex_Digit
Alphabetic, Ideographic, Lowercase, Uppercase
ID_Start, ID_Continue, XID_Start, XID_Continue
Noncharacter_Code_Point
Quotation_Mark, Terminal_Punctuation, Math,
Dash, Hyphen, Diacritic, Extender
New properties: Case folding, Scripts
 Added DerivedProperties, NormalizationTest
Related Items

Documented Character Encoding Stability
Policy
 UTS #10: Unicode Collation Algorithm


Merged data files; updated to base version 3.1
UTR #18: Unicode Regular Expression
Guidelines
 UTR #20: Unicode in XML and other Markup
Languages
 UTR #22: Character Mapping Tables
 UTR #24: Script Names
Descargar

1,228 new characters: