New in Unicode
Mark Davis, John Jenkins
Agenda







Unicode 4.1.0
UCA 4.1.0
Regular Expressions
Security Considerations
Character Mapping
Common Locale Data Repository
Expanded Role for Consortium
Unicode 4.1.0




Released 2005 March 31
New Characters
New Unicode Character Database
New Specifications
1,273 New Characters




Roundtripping for HKSCS and GB 18030
Five new currency signs
Additional characters for Indic and Korean
Eight new scripts
Changes in the Standard

Conformance Changes



Modifications to Default Case Operations
Clarification of Decomposition Mappings
Other Changes




SPACE not recommended as base for nonspacing marks
Use of CGJ to prevent reordering, prevent contractions in
sorting/matching (UCA)
Positioning of Meteg
Rendering of Thai Combining Marks
Unicode Character Database

Determines the behavior of characters in modern
software:


New properties



Grapheme_Cluster_Break, Sentence_Break, Word_Break,
Pattern_Syntax, and Pattern_White_Space
Revised Property Values


Alphabetics, Letters, Numbers, Identifiers, Scripts, …
Eg Alphabetic ⊃ ( Lowercase ∪ Uppercase )
Expanded documentation
Each release now complete, not delta
New Specifications

UAX #31: Identifier and Pattern Syntax

Basis for Backwards-Compatible Identifiers
Programming Languages
 Resources and Services


Basis for Stable Syntax characters
Whitespace
 Operators


UAX #34: Unicode Named Character Sequences


Mechanism for identifying/naming significant sequences
Standardized list
Major Revisions in Annexes

UAX #15: Unicode Normalization Forms



UAX #14: Line Breaking Properties




Correction for Idempotency Problem
Enhanced discussion of Hangul
Modifications for Hangul
Changes because SPACE not recommended as base for
nonspacing marks
Separated all suggested tailorings into separate section
UAX #29: Text Boundaries


Using new properties, adding Joiner/Non-Joiner
Modifications to Word -Break
UTS #10: Unicode Collation
Algorithm



Basis for language-sensitive sorting, searching, and
matching
Synchronized with Unicode 4.1.0
New:
Characters
 Revised Weights
 Specification: matching, ignorables, Thai, …

UTS #18: Unicode Regular
Expressions




Regular expressions used widely in programs, for
matching patterns (eg Wildcards)
Unicode expands the scope drastically
Explicit Conformance Clauses
POSIX-Conformance
UAX #36: Unicode Security



Incorrect usage of Unicode can expose programs or systems to
possible security attacks! Examples:
Numbers: ৪୨ = 42 !
 Bengali {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, Oriya {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}.
Domain Names:
String
1a
1b
2a
2b
4a
4b
ät.com
ät.com
tοp.com
tοp.com
so̷s.com
søs.com
UTF-16
Internal - IDNA
0061 0308 0074 002E 0063 006F 006D
xn--t-zfa.com
00E4 0074 002E 0063 006F 006D
xn--t-zfa.com
0074 03BF 0070 002E 0063 006F 006D
xn--tp-jbc.com
0074 006F 0070 002E 0063 006F 006D
top.com
0073 006F 0337 0073 002E 0063 006F 006D
xn--sos-rjc.com
0073 00F8 0073 002E 0063 006F 006D
xn--ss-lka.com
Character Mapping ML



XML format for the interchange of mapping data for
character encodings and aliases.
Promoted to Unicode Technical Standard; with
new Conformance section (2).
Added explicit text about multi-character
mappings.
Common Locale Data Repository


Common, necessary
software locale data for
world languages
XML format for
effective interchange
Arabic – arabski
Bulgarian – bułgarski
Czech – czeski
…
Z<Å
Δευτέρα, 05 Σεπτεμβρίου 2005
Montag, 5. September 2005
¥1,234.57
AED – .‫إ‬.‫د‬
BHD – .‫ب‬.‫د‬
DZD – .‫ج‬.‫د‬
EGP – .‫م‬.‫ج‬
EUR – €
…
1 234,57руб.
Africa – 非洲
Central America – 中美洲
Eastern Africa – 东非
Northern Africa – 北非
…
Typical Locale Data




Dates/time formats
Number/Currency formats
Measurement Systems
Collation Specifications (UCA-based)



Used for sorting, searching, matching
Tailorings of translated names for language,
territory, script, timezones, currencies, …
...
Latest Release: CLDR 1.3

296 locales: 96 languages, 130 territories



Complete set of generated POSIX-format data


Plus tool to generate versions tuned for different platforms.
Expanded locale data




Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic [አማርኛ];
Arabic [‫ ;]العربية‬Armenian [Հայերէն]; …
Territories: Afghanistan [‫ ;]افغانستان‬Albania [Shqipëria]; Algeria [‫;]الجزائر‬
Argentina; Armenia [Հայաստանի Հանրապետութիւն]; Australia;
Austria [Österreich]; Azerbaijan [Azərbaycan, Азәрбајҹан]; …
Timezone localizations
Including UN M.49 continents and regions
Many other revisions and additions of data
New Tests & Tools
Expanded Role for Consortium

Dedicated to the goal that all the world's
languages can be used on computers
everywhere, from mobile phones to mainframes.

Providing the fundamental specifications for full
software globalization, full interoperability
Full Members
Institutional &
Supporting Members
(New Membership Categories)
Associate Members
Liaison Members









Center of Computer and Information
Development (CCID), Beijing, China
High Council of Informatics (HCI),
Iran
Information and Communication
Technology Agency of Sri Lanka
(ICTA)
The International Forum for
Information Technology in Tamil
(INFITT)
The Internet Engineering Task Force
(IETF)
ISO/IEC JTC1/SC2 and WG2
Linguistic Society of America (LSA)
National Endowment for the
Humanities (NEH)
National Information Standards
Organization (NISO)








NSAI/ICTSCC/SC4:Irish
standardization: Codes, Character Sets,
and Int’lization
Open I18n.org: The Free standards
Group Open Internationalization
Initiative
Research Institute for ILCAA, Tokyo
University of Foreign Studies
Research Institute for the Languages of
Finland (RILF)
Special Libraries Association (SLA )
Technical Committee on Information
Technology (TCVN/TC1), Hanoi, Viet
Nam
United Nations Group of Experts on
Geographical Names (UNGEGN)
World Wide Web Consortium - W3C
I18N Core Working Group
Unicode Technical Committee

Multiple Globalization Standards
The Unicode Standard, including UAXes
 Unicode Technical Standards: Collation, …
 Unicode Technical Notes: Best Practices,
Background Information



Quarterly F2F Meetings
Email Discussion
CLDR Technical Committee

Meetings
Short, frequent: Telecon + Instant Messaging
 Email Discussion


Data
All additions / revisions in bug database
 Anyone can file; committee assesses, vets

Why Join?

Support the technology


Protect your investment




That enables your success in international, technical, and
emerging markets.
The stability you need
The extensions you require
The developments you call for: security, …
Demonstrate your leadership

For the goal that all the world's languages can be used on
computers everywhere, from mobile phones to
mainframes.
Descargar

New in Unicode