Bell Laboratories
Intrinsic complexity of classification problems
Tin Kam Ho
With contributions from
Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law,
Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia
Supervised Learning:
Many Methods, Data Dependent Performances








Bayesian classifiers,
logistic regression,
linear & polynomial discriminators,
nearest-neighbors,
decision trees & forests,
neural networks,
support vector machines,
ensemble methods, …
• No clear winners good for all problems
• Often, accuracy reaches a limit for a
practical problem, even with the best
known method
2
Z e ro R
NN1
NNK
NB
C 4 .5
PART
SM O
aud
2 5 .3
7 6 .0
6 8 .4
6 9 .6
7 9 .0
8 1 .2
-
5 7 .7
aus
5 5 .5
8 1 .9
8 5 .4
7 7 .5
8 5 .2
8 3 .3
8 4 .9
8 5 .7
bal
4 5 .0
7 6 .2
8 7 .2
9 0 .4
7 8 .5
8 1 .9
-
7 9 .8
bpa
5 8 .0
6 3 .5
6 0 .6
5 4 .3
6 5 .8
6 5 .8
5 8 .0
6 8 .2
bps
5 1 .6
8 3 .2
8 2 .8
7 8 .6
8 0 .1
7 9 .0
8 6 .4
8 3 .3
b re
6 5 .5
9 6 .0
9 6 .7
9 6 .0
9 5 .4
9 5 .3
9 6 .7
9 6 .0
cm c
4 2 .7
4 4 .4
4 6 .8
5 0 .6
5 2 .1
4 9 .8
-
5 2 .3
g ls
3 4 .6
6 6 .3
6 6 .4
4 7 .6
6 5 .8
6 9 .0
-
7 2 .6
h -c
5 4 .5
7 7 .4
8 3 .2
8 3 .6
7 3 .6
7 7 .9
-
7 9 .9
hep
7 9 .3
7 9 .9
8 0 .8
8 3 .2
7 8 .9
8 0 .0
8 3 .9
8 3 .2
irs
3 3 .3
9 5 .3
9 5 .3
9 4 .7
9 5 .3
9 5 .3
-
9 4 .7
k rk
5 2 .2
8 9 .4
9 4 .9
8 7 .0
9 8 .3
9 8 .4
9 6 .1
9 8 .6
la b
6 5 .4
8 1 .1
9 2 .1
9 5 .2
7 3 .3
7 3 .9
9 3 .2
7 5 .4
le d
1 0 .5
6 2 .4
7 5 .0
7 4 .9
7 4 .9
7 5 .1
-
7 4 .8
ly m
5 5 .0
8 3 .3
8 3 .6
8 5 .6
7 7 .0
7 1 .5
-
7 9 .0
mmg
5 6 .0
6 3 .0
6 5 .3
6 4 .7
6 4 .8
6 1 .9
6 7 .0
6 3 .4
m us
5 1 .8
1 0 0 .0
1 0 0 .0
9 6 .4
1 0 0 .0
1 0 0 .0
1 0 0 .0
9 9 .8
m ux
4 9 .9
7 8 .6
9 9 .8
6 1 .9
9 9 .9
1 0 0 .0
6 1 .6
1 0 0 .0
pm i
6 5 .1
7 0 .3
7 3 .9
7 5 .4
7 3 .1
7 2 .6
7 6 .7
7 6 .0
p rt
2 4 .9
3 4 .5
4 2 .5
5 0 .8
4 1 .6
3 9 .8
-
4 3 .7
seg
1 4 .3
9 7 .4
9 6 .1
8 0 .1
9 7 .2
9 6 .8
-
9 6 .1
s ic k
9 3 .8
9 6 .1
9 6 .3
9 3 .3
9 8 .4
9 7 .0
9 3 .8
9 6 .7
soyb
1 3 .5
8 9 .5
9 0 .3
9 2 .8
9 1 .4
9 0 .3
-
7 6 .2
ta o
4 9 .8
9 6 .1
9 6 .0
8 0 .8
9 5 .1
9 3 .6
8 3 .6
8 8 .4
th y
1 9 .5
6 8 .1
6 5 .1
8 0 .6
9 2 .1
9 2 .1
-
8 6 .3
veh
2 5 .1
6 9 .4
6 9 .7
4 6 .2
7 3 .6
7 2 .6
-
7 2 .2
v o te
6 1 .4
9 2 .4
9 2 .6
9 0 .1
9 6 .3
9 6 .5
9 5 .6
9 5 .4
vow
9 .1
9 9 .1
9 6 .6
6 5 .3
8 0 .7
7 8 .3
-
8 7 .6
w ne
3 9 .8
9 5 .6
9 6 .8
9 7 .8
9 4 .6
9 2 .9
-
9 6 .3
zoo
4 1 .7
9 4 .6
9 2 .5
9 5 .4
9 1 .6
9 2 .5
-
9 2 .6
Avg
4 4 .8
8 0 .0
8 2 .4
7 8 .0
8 2 .1
8 1 .8
8 4 .1
8 1 .7
All Rights Reserved © Alcatel-Lucent 2008
XCS
Accuracy Depends on the Goodness of Match between
Classifiers and Problems
Problem A
Problem B
Better
!
Better
!
XCS
3
error=
1.9%
NN
error=
0.06%
XCS
All Rights Reserved © Alcatel-Lucent 2008
error=
0.6%
NN
error=
0.7%
Measuring Geometrical Complexity of Classification Problems
Our goal: tools and languages for studying
Characteristics of geometry & topology of high-dim data sets
How they change with feature transformations and sampling
How they interact with classifier geometry
We want to know:
What are real-world problems like?
What is my problem like?
What can be expected of a method on a specific problem?
4
All Rights Reserved © Alcatel-Lucent 2008
Parameterization of Data Complexity
5
All Rights Reserved © Alcatel-Lucent 2008
Some Useful Measures of Geometric Complexity
Degree of Linear Separability
Find separating hyperplane by linear
programming
Error counts and
distances to plane
measure separability
Length of Class Boundary
Compute minimum
spanning tree
Count class-crossing
edges
6
Fisher’s Discriminant Ratio
Classical measure of
class separability
Maximize over all
features to find the
most discriminating
(μ1  μ 2 )2
f 
σ 12  σ 2 2
Shapes of Class Manifolds
Cover same-class pts
with maximal balls
Ball counts describe
shape of class
manifold
All Rights Reserved © Alcatel-Lucent 2008
Using Complexity Measures to Study Problem Distributions
Real-World Data Sets:
Benchmarking data from UC-Irvine archive
844 two-class problems
452 are linearly separable, 392 non-separable
Synthetic Data Sets:
Random
labeling
randomly located points
100 problems in 1-100 dimensions
Linearly
separable
real-world data
7
Metric 2
Random labeling of
Linearly nonseparable realworld data
Complexity Metric 1
All Rights Reserved © Alcatel-Lucent 2008
Measures of Geometrical Complexity
8
All Rights Reserved © Alcatel-Lucent 2008
Distribution of Problems in Complexity Space
lin.sep lin.nonsep random•
9
All Rights Reserved © Alcatel-Lucent 2008
The First 6 Principal Components
10
All Rights Reserved © Alcatel-Lucent 2008
Interpretation of the First 4 PCs
PC 1: 50% of variance: Linearity of boundary and
proximity of opposite class neighbor
PC 2: 12% of variance: Balance between within-class
scatter and between-class distance
PC 3: 11% of variance: Concentration & orientation of
intrusion into opposite class
PC 4: 9% of variance: Within-class scatter
11
All Rights Reserved © Alcatel-Lucent 2008
Problem Distribution in 1st & 2nd Principal Components
• Continuous distribution
Linearly
separable
• Known easy & difficult
problems occupy
opposite ends
• Few outliers
• Empty regions
Random
labels
12
All Rights Reserved © Alcatel-Lucent 2008
Relating Classifier Behavior to Data Complexity
13
All Rights Reserved © Alcatel-Lucent 2008
Class Boundaries Inferred by Different Classifiers
XCS: a genetic
algorithm
14
Nearest neighbor
classifier
All Rights Reserved © Alcatel-Lucent 2008
Linear
classifier
Domains of Competence of Classifiers
• Which classifier works the best for a given classification problem?
• Can data complexity give us a hint?
Metric 2
?
XCS
LC
Decision
Forest
NN
Complexity metric 1
15
All Rights Reserved © Alcatel-Lucent 2008
Domain of Competence Experiment
Use a set of 9 complexity measures
Boundary, Pretop, IntraInter, NonLinNN, NonLinLP,
Fisher, MaxEff, VolumeOverlap, Npts/Ndim
Characterize 392 two-class problems from UCI data,
all shown to be linearly non-separable
Evaluate 6 classifiers
NN
LP
Odt
Pdfc
Bdfc
XCS
16
(1-nearest neighbor)
(linear classifier by linear programming)
(oblique decision tree)
(random subspace decision forest)
(bagging based decision forest)
(a genetic-algorithm based classifier)
All Rights Reserved © Alcatel-Lucent 2008
ensemble
methods
Identifiable Domains of Competence by NN and LP
Best Classifier for Benchmarking Data
17
All Rights Reserved © Alcatel-Lucent 2008
Less Identifiable Domains of Competence
Regions in complexity space where the best classifier is (nn,lp, or odt)
vs. an ensemble technique
Boundary-NonLinNN
IntraInter-Pretop
MaxEff-VolumeOverlap
•ensemble
+ nn,lp,odt
18
All Rights Reserved © Alcatel-Lucent 2008
Difficulties in Estimating Data Complexity
19
All Rights Reserved © Alcatel-Lucent 2008
Apparent vs. True Complexity:
Uncertainty in Measures due to Sampling Density
Problem may
appear deceptively
simple or complex
with small samples
20
2 points
10 points
100 points
500 points
All Rights Reserved © Alcatel-Lucent 2008
1000 points
Uncertainty of Estimates at Two Levels
Sparse training data in each problem &
complex geometry cause ill-posedness of
class boundaries
(uncertainty in feature space)
Sparse sample of problems causes
difficulty in identifying regions of
dominant competence
(uncertainty in complexity space)
21
All Rights Reserved © Alcatel-Lucent 2008
Complexity Estimates and Dimensionality Reduction
Feature selection/transformation may change the difficulty
of a classification problem:
• Widening the gap between classes
• Compressing the discriminatory information
• Removing irrelevant dimensions
It is often unclear to what extent these happen
We seek quantitative description of such changes
Feature selection
22
All Rights Reserved © Alcatel-Lucent 2008
Discrimination
Spread of classification accuracy and geometrical
complexity due to forward feature selection
F F S su b se ts a ll d a ta se ts
b o u n d a ry ve rsu s 1 N N cla ssifica tio n
e rro r
0 .7
0 .6
sp e ctra 1
co lo n
sp e ctra 2
o
e va
o g ria
at n
sp e ctra 3
1 N N e rro r
0 .5
0 .4
0 .3
0 .2
0 .1
0
10
23
20
30
40
50
B o u n d a ry
60
70
All Rights Reserved © Alcatel-Lucent 2008
80
90
Conclusions
24
All Rights Reserved © Alcatel-Lucent 2008
Summary: Early Discoveries
•Problems distribute in a continuum in complexity space
•Several key measures provide independent characterization
•There exist identifiable domains of classifier’s dominant
competency
•Sparse sampling, feature selection, and feature transformation
induce variability in complexity estimates
25
All Rights Reserved © Alcatel-Lucent 2008
For the Future
Further progress in statistical learning will need systematic,
scientific evaluation of the algorithms with problems that are
difficult for different reasons.
A “problem synthesizer” will be useful to provide a complete
evaluation platform, and reveal the “blind spots” of current
learning algorithms.
Rigorous statistical characterization of complexity estimates from
limited training data will help gauge the uncertainty, and
determine applicability of data complexity methods.
Ongoing:
DCol: Data Complexity Library
ICPR 2010 Contest on Domain of Dominant Competence
26
All Rights Reserved © Alcatel-Lucent 2008
Descargar

Value Proposition for Acoustic Fan Monitoring Technology