Collation in ICU
Mark Davis
Chief SW Globalization Architect
IBM
What is ICU?
Premier Unicode Enablement Library
Open-source: non-viral license
Full-Featured, Cross-Platform
C, C++, Java APIs
Collation, Charset Conversion, Resources,
Boundaries, Calendars, Transforms (case,
norm., translit., …), Format/Parse (dates, times,
msgs, nums., curr., …), Unicode strings/props
http://oss.software.ibm.com/icu/
21st International Unicode Conference
2
Dublin, Ireland — 10/3/2015
Collation = Sorting Order
How hard can it be?
A<B<C<…
Complications
Languages are complex and varied
Unicode is a big set of characters
Performance is crucial
21st International Unicode Conference
3
Dublin, Ireland — 10/3/2015
Varies By:
Customizations
Language
A<a
a <A
Swedish: z < ö
German: ö < z
Versioning
Usage
Fixes
New Gov. Stds
New Characters
Dictionary: öf < of
Telephone: of < öf
21st International Unicode Conference
4
Dublin, Ireland — 10/3/2015
Strength Levels: L1, L2, L3
1. Base characters: a < b
2. Accents: as < às < at
ignored if there is a L1 character difference
3. Case: ao < Ao < aò
ignored if there is a L1 or L2 difference
4. Punctuation: ab < a-b < aB
ignored* if there is a L1, L2, or L3 difference
5. Tie-breaker: NFD code point order
21st International Unicode Conference
5
Dublin, Ireland — 10/3/2015
Context Sensitivity
Contractions
H < Z, but CZ < CH
Expansions
OE < Π< OF
Both
カー < カイ
キー > キイ
21st International Unicode Conference
6
Dublin, Ireland — 10/3/2015
Canonical Equivalence
≡
≡
x+.+^ ≡
ự
≡
≡
≡
≡
≡
Å
21st International Unicode Conference
Å
A+º
x+^+.
u+’
ư+.
ụ +’
u+.+’
u + ’+ .
7
Dublin, Ireland — 10/3/2015
Oddities
Normal accents
cote < coté < côte < côté
• first accent difference determines order
French accents
cote < côte < coté < côté
• last accent difference determines order
Logical Order Exception (Thai, Lao)
เ ก sorts like ก เ
21st International Unicode Conference
8
Dublin, Ireland — 10/3/2015
Merging Database Fields
F1 = LastName, F2 = FirstName
Sequential
F1, then F2
Weak 1st
F1 (L1), F2
Merged
L1, L2, L3
diSilva, John
diSilva, Fred
di Silva, John
di Silva, Fred
dísilva, John
dísilva, Fred
diSilva, John
dísilva, John
di Silva, John
di Silva, Fred
diSilva, Fred
dísilva, Fred
diSilva, John
di Silva, John
dísilva, John
diSilva, Fred
di Silva, Fred
dísilva, Fred
21st International Unicode Conference
9
Dublin, Ireland — 10/3/2015
Customizations
Parameters that change collation
behavior
Choice of language (locale)
Runtime choices
Examples to follow
21st International Unicode Conference
10
Dublin, Ireland — 10/3/2015
Parametric Customizations
Strength
Case:
Base
Base+Accent
Base+Accent+
Case
&c.
21st International Unicode Conference
A<a
a <A
Punctuation:
di Silva < diSilva
diSilva < di Silva
11
Dublin, Ireland — 10/3/2015
Punctuation / Spaces (Alternates)
Base Character
Ignoreable
di silva
di Silva
Di silva
Di Silva
Dickens
disilva
diSilva
Disilva
DiSilva
Dickens
di silva
disilva
di Silva
diSilva
Di silva
Disilva
Di Silva
DiSilva
21st International Unicode Conference
12
Dublin, Ireland — 10/3/2015
Extended Customizations
User-defined
Script Order
“&” ≡
“ampersand”
Merging tailorings
Iranian + French
21st International Unicode Conference
b<‫<ב‬β<б
β<b<б<‫ב‬
Numbers
A-10 < A-2
A-2 < A-10
13
Dublin, Ireland — 10/3/2015
Other Uses: String Searching
Match according to locale
conventions:
e.g. w = v for Swedish
Use collation options:
ignore case, accent
other customizations
21st International Unicode Conference
14
Dublin, Ireland — 10/3/2015
Other Uses: Selection Bounds
Return all records where:
Zoë ≤ name < Zorma
Ignore case / accents
Zoe / zoe / Zoë / zoë / …
21st International Unicode Conference
15
Dublin, Ireland — 10/3/2015
UCA
UTS #10: Unicode Collation Algorithm
Levels, Expansions, Contractions, Punctuation,
Canonical Equivalence, etc.
Default ordering: all Unicode code points
Provides for tailoring to given languages
Also see: The Unicode Standard, §5.17: Sorting
and Searching
Aligned with ISO 14651
21st International Unicode Conference
16
Dublin, Ireland — 10/3/2015
APIs
String Compare
Sort Keys
String Search
Selection Boundaries
Merged sortkeys
21st International Unicode Conference
17
Dublin, Ireland — 10/3/2015
Sort Keys
Transform string into series of bytes
which will binary-compare
a:
A:
á:
ab:
b:
06 C3 01 20 01 02 00
06 C3 01 20 01 08 00
06 C3 01 20 32 01 02 02 00
06 C3 06 D7 01 20 20 01 02 02 00
06 D7 01 20 01 02 00
Level 1 Level 2Level 3
21st International Unicode Conference
18
Dublin, Ireland — 10/3/2015
String Compare vs. Sort Keys
Same results in either case
SC faster for single comparisons
average 5 to 10 times!
SK faster for multiple comparisons
index once
binary compare many times
21st International Unicode Conference
19
Dublin, Ireland — 10/3/2015
String Search
Naïve Approach
key matches in target at <x, y>
iff target.substring(x, y) ≡ key
Boundary Complications
Ignorables: “a” matches in “(a)”?
• at <0,2> & <1, 2> & <0,3> & <1,3>?
Contractions: “c” matches in “churo”?
Normalization: “å” matches in “a¸˚”?
21st International Unicode Conference
20
Dublin, Ireland — 10/3/2015
WARNING 1: Basics
Not aligned with character set or repertoire
Latin-1: Swedish and German sorting differs
Not code point (binary) order
Binary:
English:
Z<a<v<w
Z>a
Swedish: v ≡ w
Not a property of strings: same Database
Swedish user: views/select
German user: views/selects
21st International Unicode Conference
21
Dublin, Ireland — 10/3/2015
WARNING 2: Operations
Order not preserved under
concatenation / substringing
x<y
x<y
xz < yz
zx < zy
21st International Unicode Conference
↛
↛
↛
↛
xz < yz
zx < zy
x<y
x<y
22
Dublin, Ireland — 10/3/2015
WARNING 3: Dependence
Collation is a relation over strings
Sort keys embody part of that
relation
Thus, comparing sort keys from
different tailorings (or parameters)
gives undefined results.
21st International Unicode Conference
23
Dublin, Ireland — 10/3/2015
WARNING 4: Stability
Stable Sort
Records with equal comparison come out in
original order
Property of algorithm, not comparison
Semi-Stable Comparison
x≠y→x≢y
Property of comparison, not algorithm
Degrades performance
Doesn’t do what people think (or really want)!
21st International Unicode Conference
24
Dublin, Ireland — 10/3/2015
ICU/Java Collation Architecture
L1-3, contractions, expansions, …
Locale tailorings
Fully rule-based specification
Arbitrary runtime user customizations
& ‘?’ = ‘question mark’
& ‘$’ = ‘dollar sign’
& z < ‘george’
21st International Unicode Conference
25
Dublin, Ireland — 10/3/2015
Java
Sun licensed and includes an early
version of ICU collation in Java
ICU version:
Dramatically faster
Much reduced memory consumption
Halved sort-key length
Many additional features
21st International Unicode Conference
26
Dublin, Ireland — 10/3/2015
ICU Collation I
Full UCA compliance
Full supplementary character support
Solid performance
Small Sort-Keys
Small Memory Footprint
21st International Unicode Conference
27
Dublin, Ireland — 10/3/2015
ICU Collation II
Parametric control
Tailorable to any language
Simultaneous Multiple Versions
Merging Sort Keys
Selection Bounds
21st International Unicode Conference
28
Dublin, Ireland — 10/3/2015
Memory-Mappable, Fast Init
Old: separate
allocations
21st International Unicode Conference
New: offsets
within mem-map
29
Dublin, Ireland — 10/3/2015
Delta Tailoring:
Minimize Memory Usage
input
FR
found
not
not
UCA:
One Copy;
≈80K
code
found
output
21st International Unicode Conference
synthesize
30
Dublin, Ireland — 10/3/2015
Simultaneous Multiple Versions
Programs can link against different
versions of ICU, simultaneously.
Preserves exact binary order over time.
Application
New
DB
ICU
2.1
21st International Unicode Conference
Old
DB
ICU
2.0
31
Dublin, Ireland — 10/3/2015
Performance
Checks for identical prefixes first
Invokes normalization only when
needed
Fast paths for common cases
Minimizes comparison time
Minimizes sort key length
21st International Unicode Conference
32
Dublin, Ireland — 10/3/2015
Sort Key Compression
Common weights are 1-byte
Primary, secondary, tertiary, quarternary
Sequences are compressed
UTF-16 Values for “Märk Davis” (22 bytes)
004D 00E4 0072 006B 0020 0044 0061
0076 0069 0073 0000
Sort Key (L3, ignorable punctuation - 19 bytes)
2F 17 39 2B 1D 17 41 27 3B 01
77 96 0A 01
8F 80 8F 07 00
21st International Unicode Conference
33
Dublin, Ireland — 10/3/2015
ICU vs. Windows, glibc
Full UCA!
String comparison: comparable speed
≈ -20% .. +400%
Sort keys: much shorter
≈ 50%
Warning: speed comparisons are approximate!
Depends on data, parameters, features, CPU
21st International Unicode Conference
34
Dublin, Ireland — 10/3/2015
More Information
ICU
http://oss.software.ibm.com/icu/
Design Document
http://oss.software.ibm.com/cvs/icu/icuht
ml/design/collation/
Latest Version of these slides
http://www.macchiato.com
21st International Unicode Conference
35
Dublin, Ireland — 10/3/2015
Q&A
21st International Unicode Conference
36
Dublin, Ireland — 10/3/2015
Fast C or D (FCD)
Accepts all NFD, most NFC,
without normalization
X
FCD NFC NFD
A- ring
Angstrom
A + ring
A + grave
A-ring + grave
A + cedilla + ring
A + ring + cedilla
A-ring + cedilla
21st International Unicode Conference
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
37
Dublin, Ireland — 10/3/2015
Backup Slides
Not used in the presentation,
except in response to questions
21st International Unicode Conference
38
Dublin, Ireland — 10/3/2015
Performance: Coding
Avoided unnecessary function calls.
Example: strlen too expensive!
Avoided use of objects
Rewrote core code in C
C++ API wraps the C core code.
Fast-pathed common cases
Used stack memory buffers
(with expansion if necessary)
Made inner loops as tight as possible
21st International Unicode Conference
39
Dublin, Ireland — 10/3/2015
WARNING 5: Math. Relation
S = {Unicode Strings}
Reflexive
∀a ∊ S: a ≤ a
Antisymmetric
∀a, b ∊ S: a ≤ b & b ≤ a → a = b
Transitive
∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c
Total
∀a, b ∊ S: a ≤ b ∨ b ≤ a
21st International Unicode Conference
40
Dublin, Ireland — 10/3/2015
Identical Prefixes
Sorting / Searching Databases
Many comparisons to “close” strings
Check initial prefixes with binary
compare
Drop into collation loop at first
difference
Complication…
21st International Unicode Conference
41
Dublin, Ireland — 10/3/2015
Initial Prefix Complication
Need to backup if in “bad”
position:
Type
Example
Contraction (Spanish) c
h
Normalization
a
°
Surrogate Pair
<L> <T>
21st International Unicode Conference
42
Dublin, Ireland — 10/3/2015
Fractional UCA
Fractional weights for compression
Gaps for tailoring, future UCA additions
Only stores differences in tailoring file
Reduces memory footprint
UCA
Frac. UCA
a
æ
b
a
æ
b
ɒ
ɒ
primary 0861 0865 0871 0875 17 18 60 18 66 19
secondary 20 20 20 20 03
03
03
03
tertiary 02 02 02 02 03
03
03
03
21st International Unicode Conference
43
Dublin, Ireland — 10/3/2015
Exceptional Values
Normal weight storage
P P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T
16b
8b
1 1
6b
Special Weight Storage
NOT_FOUND, EXPANSION,
CONTRACTION, THAI, …
F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d
4b
4b Tag
24 bit data
21st International Unicode Conference
44
Dublin, Ireland — 10/3/2015
Minimal Memory
Flat-file (memory mapped)
speeds initialization
reduces memory footprint
(next slide)
Delta Tailoring
Single copy of UCA (≈80K)
Small delta files per locale
21st International Unicode Conference
45
Dublin, Ireland — 10/3/2015
Descargar

Collation in ICU