An ICU Overview
Mark Davis
Chief Globalization Architect, IBM
IBM Globalization Center of Competency
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Agenda
 What is ICU?
 Architecture Overview
 Significant New ICU Features
 Near Future Features
 References
 Q and A
2
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Why Globalization?
3
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Unicode
 All world languages
 Efficient and effective processing
 Lossless data exchange
 Enables single-binary global software
 But… all languages ⇒ large, complex standard
– 1,400 pages + Annexes + additional standards
– 90,000+ characters
– Major update every 3 years
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regex, …
4
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Locales
 Features vary widely across languages & countries
– Sorting, line breaks, date/time/number/currency formatting,
codepage conversion, …
– Performance is key: easy to do the right thing; hard to do it
fast
5
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
What is ICU?
 Globalization / Unicode / Locales
 Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization – but goes far beyond
 Very portable – identical results on all platforms /
programming languages
– C/C++: 30+ platforms/compilers
– Java: IBM & Sun JDK
 Full threading model; customizable; modular
 Open source – but not viral
 ICU 3.0: 78 languages; 118 countries; 870 codepages
6
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Who uses ICU?
 Products Within IBM
– PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint
Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus
Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMAQ, OTI, Pervasive Computing WECMS, SS&S Websphere Banking
Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler
and Monitor/ Solution Technology Development/WBI-Financial TePI,
Websphere Application Server/ Studio Workload Simulator/Transcoding
Publisher, XML Parser
 Other Companies and Organizations
– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects,
caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes,
Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal,
Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun
Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods,
Wine, Leica Geosystems GIS & Mapping, LLC.
7
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
ICU Features
 Unicode text handling
 Charset conversions (700+)
 Collation & Searching
 Locales (170+)
 Resource Bundles
 Calendar & Time zones
 Complex-text layout engine
 Unicode Regular
Expressions
 Breaks: word, line, …
 Formatting
– Date & time
– Messages
– Numbers & currencies
 Transforms
– Normalization
– Casing
– Transliterations
8
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Architecture Overview 1
 Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: [email protected]=phonebook
 Resource inheritance: shared resources
root
Language
en
de
Hant
Script
Country
9
zh
US
IE
DE
CH
TW
Hans
CN
10/7/2015
CN
TW
© 2003 IBM Corporation
Business Unit or Product Name
Architecture Overview 2
 Open and Close Service Model
– Better performance by avoiding setup costs per operation
– Warning: use properly for maximum performace
 ICU Threading Model
– Multiple versions in use simultaneously
– Large resources shared in read-only cache
10
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Architecture Overview 3
 Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Formats, UnicodeSet
– Table-based
• Character Conversion
11
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Architecture Overview – ICU4C
 Simple Error Handling
– C++ subset for portability
– Support for multi-threaded environment
 Version Management
– Multiple versions at the same time
– Data and library versioning
 String Buffer Management
– Preflighting and overflow protection
 Misc: Load/Unload ICU
 Recent Additions:
– Runtime-settable memory allocation and mutex functions
12
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
ICU4J: Supplement for Java
 Core globalization (no char. conversion, no GUI components)
– We do supply complex text support for Sun
 Modularized: products may add just needed functionality
 CLDR 1.1 (Common Locale Data Repository)
 Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character support (GB 18030, JIS X 213,
HKSCS)
– Full properties – JDK has only a fraction
– Local calendars (Thailand, Japan,…); ISO dates
– Currencies, String Search, Int’l Domain Names
– Transforms: Case, Scripts, Normalization
 Much faster turn-around on bug-fixes, enhancements
13
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Unicode Text Handling
C
– UChar*: null-terminated or with length
 C++
– UnicodeString: full featured string class
 Java
– Uses normal JDK String, adds utilities
 All handle supplementary characters
– Required for GB 18030 and JIS 213 repertoire
14
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Unicode Text Handling 2
 All Unicode 4.0 properties
– direct API
• values, names, enumerations
– UnicodeSet
• fast, compact set operations
• all properties:
– [\p{lowercase}-[a-z]]
– [\p{greek} & \p{uppercase}]
15
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Data: Recent Additions
 Conforms to CLDR 1.1
– 50% more data than CLDR 1.0: adding many translated terms for
languages, scripts, countries, currencies, and time zones.
– improved collation for Eastern Europe, Chinese pinyin
 Reduced multiplatform install image size
 Improved XLIFF-ICU conversion tools
 Locale canonicalization spec defined and implemented (C+J)
– Provides interoperability with POSIX and .NET locale IDs, more
RFC 3066 support
16
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Character Set Conversion
 Precise alias information:
– When you ask for “SJIS”, you can request the precise
definition by platform:
• windows, ibm, solaris,…
 Buffer management
– automatically handles characters that cross buffers
 Customizations allowed for:
– illegal sequences
– undefined characters
 Unicode Text Compression – SCSU, BOCU
17
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Collation and Searching
 Fast international comparison and string search;
fully UCA compliant
– Compressed sort keys, optimized string comparison,
sublinear string search
– incremental sortkeys for radix-sort
 Precise binary sortkey stability over time
 Fully data driven
 API / rule customizations
– strength, normalization, upper vs. lowercase first, ignore
punctuation, …
18
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Collation and Searching: Recent Additions
 Numeric sorting: sequences of digits can be sorted
numerically instead of alphabetically
– e.g., filenames would sort "ab-2" < "ab-10"
– without material performance cost
– with reduced sortkey length.
 Significantly improved sorting orders for many other
languages
 Data in separate tree, for easier modularization and
maintenance
 getFunctionalEquivalent API allows for better caching and UI
support.
19
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Calendar & Time Zones
 International Calendars – Arabic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
 Olson timezone support, with localizations
 Recent Additions:
– RFC822 time zone format support in DateFormat (C+J) for
compatibility.
20
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Formatting
 Date & time: 8 formats per locale
 Messages
– Completely localizable, Plural support
 Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support
• INR
• INR
• INR
In Hindi:
In English:
In German:
Rs. 1,234.57
Rs. 1.234,57
 Recent Additions
– POSIX migration library
– Allows parsing multiple currencies with one formatter
– Short and stand-alone month/day names
21
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Transforms
 Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
 Casing (upper, lower, title, folding)
 General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
 IDNA: International Domain Names
22
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Segmentation: word, line & sentence
 Fast state-table implementation
 Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
 Recent Additions:
– Greatly improved performance when going backwards
(common case when doing line break)
– Java
• The rules syntax has been extended. Rules can now return
information about the types of characters they encountered.
• Common compiled (binary) rule format with ICU4C
23
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Unicode Regular Expressions
 Full Regex Implementation
– C only: Java 1.4 has own package (though not as powerful)
 All Unicode 4.0 Properties
– supported through UnicodeSet
 Good performance
– competitive with non-Unicode regex
 Recent Additions
– Now features a C API, instead of just C++.
24
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Complex-text layout engine
 Glyph processing, positioning & adjustment
– ligature substitution, contextual forms, kerning, accent placement,
Bidi scripts, etc.
 Support for:
– Drawing
– Caret Display
– Hit Testing
– Selection Highlighting
– Caret Movement
– Layout Metrics
– Line Break
 ICU 3.0: Canonical Equivalence: a + ´ or á
25
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
References
 ICU main site:
– http://oss.software.ibm.com/icu/
– Links to
• Download ICU
• User Guide, Technical FAQ, Support, Bug Reports
 Unicode Consortium
– http://www.unicode.org
• Unicode glossary, Unicode character database
26
10/7/2015
© 2003 IBM Corporation
Business Unit or Product Name
Questions and Answers
27
10/7/2015
© 2003 IBM Corporation
Descargar

Document