Discrimination and
Classification
Discrimination
Situation:
We have two or more populations p1, p2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
p1: g(x1, … , xn) or
p2: h(x1, … , xn)
We want to make the decision to
D1: Classify the case in p1 (g is the
correct distribution) or
D2: Classify the case in p2 (h is the
correct distribution)
The Two Types of Errors
Misclassifying the case in p1 when it actually lies
in p2.
Let P[1|2] = P[D1|p2] = probability of this type of error
1.
Misclassifying the case in p2 when it actually lies
in p1.
Let P[2|1] = P[D2|p1] = probability of this type of error
2.
This is similar Type I and Type II errors in hypothesis
testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.
1.
C1 = the region were we make the decision D1.
(the decision to classify the case in p1)
2.
C2 = the region were we make the decision D2.
(the decision to classify the case in p2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]
1.
Set up the regions C1 and C2 so that one of the
probabilities of misclassification , P[2|1] say, is at
some low acceptable value a. Accept the level of
the other probability of misclassification P[1|2] =
b.
2.
Set up the regions C1 and C2 so that the total
probability of misclassification:
P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
3.
Set up the regions C1 and C2 so that the total
expected cost of misclassification:
E[Cost of Misclassification] = ECM
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
c2|1= the cost of misclassifying the case in p2
when the case belongs to p1.
c1|2= the cost of misclassifying the case in p1
when the case belongs to p2.
The Optimal Classification Rule
Suppose that the data x1, … , xp has joint density
function
f(x1, … , xp ;q)
where q is either q1 or q2.
Let
g(x1, … , xp) = f(x1, … , xn ;q1) and
h(x1, … , xp) = f(x1, … , xn ;q2)
We want to make the decision
D1: q = q1 (g is the correct distribution) against
D2: q = q2 (h is the correct distribution)
then the optimal regions (minimizing ECM, expected
cost of misclassification) for making the decisions D1
and D2 respectively are C1 and C2


C 1    x1 ,


, xp   
L q 1 
L q 2 

g  x1 ,
, xp 
h  x1 ,
, xp 
g  x1 ,
, xp 
h  x1 ,
, xp 


 k


and


C 2    x1 ,


where
, xp   
k 
c1 2 P  2 
c 2 1 P 1 
L q 1 
L q 2 



 k


Proof:
ECM = E[Cost of Misclassification]
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
P 1 2  
 hx ,
, x p  dx1
dx p
, x p  dx1
dx p
1
C1
P  2 1  
  g x ,
1
C2
 1
  g x ,
, x p  dx1
1
dx p
C1

E C M  c 2|1 P 1 1 


  g x ,
1
, x p  dx1
C1
hx ,
c1|2 P  2  
1
C1
, x p  dx1

dx p  


dx p
Therefore
ECM 
 c1|2 P  2  h  x1 ,


C1
, x p   c 2|1 P 1 g  x1 ,
, x p   dx1

 c 2|1 P 1 
Thus ECM is minimized if C1 contains all of the points
(x1, …, xp) such that the integrand is negative
c1|2 P  2  h  x1 ,
, x p   c 2|1 P 1 g  x1 ,
g  x1 ,
, xn 
h  x1 ,
, xn 

c1|2 P  2 
c 2|1 P 1 
, xp   0
dx p
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
 1 or  2
The covariance matrix  is the same for both
populations p1 and p2.
g x 
hx 
1
 2p 
p/2

1/ 2
1
 2p 
p/2

1/ 2
e
 12  x   1  
e
1
 12  x   2  
 x  1 
1
 x  2 
The Neymann-Pearson Lemma states that we should
classify into populations p1 and p2 using:
1
 
g x
hx

 2p 
p/2

1/ 2
1
 2p 
e
1
2
p/2

1/ 2
e
e
 12  x   1  
1
 x  1 
 12  x   2  
1
 x  2 
 x   2    1  x   2   12  x   1    1  x   1 
That is make the decision
D1 : population is p1
if  > k
or ln  
1
2
 x  2  

1
 x  2  
1
2
 x  1  

1
 x   1   ln k
or
 x   2     1  x   2    x   1     1  x   1   2 ln k
or
x  x  2  2  x   2   2
1
1
1
1
1
1
 x  x  2  1 x   1  1  2 ln k
and
  1   2    1 x
 ln k 
1
2
  1  1   2   2 
1
1
Finally we make the decision
D1 : population is p1
if
ax  K
where
a 
1
 1   2 
k 
and
and K  ln k 
1
2
  1  1   2   2 
1
1
c1 2 P  2 
c 2 1 P 1 
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].
an d K 
1
2
  1  1   2   2  
1
1
1
2
  1   2  
1
 1   2 
The function
a  x    1   2   x
1
Is called Fisher’s linear discriminant function
p2
p1
2
1
a  x    1   2   x  K a
1
In the case where the populations are unknown
but estimated from data
Fisher’s linear discriminant function
 1
ˆ

a x   x1  x 2  S x
x
200
2
Classify as p 1
Classify as
p2
100
p1
p2
0
0
20
40
60
80
100
A P ictoria l repr esenta tion of Fisher 's proc edure for tw o pop ulation s
120
x1
Example 1
p1 : Riding-mower owners
x1 (Income
in $1000s)
20.0
28.5
21.6
20.5
29.0
36.7
36.0
27.6
23.0
31.0
17.0
27.0
x2 (Lot size
in 1000 sq ft)
9.2
8.4
10.8
10.4
11.8
9.6
8.8
11.2
10.0
10.4
11.0
10.0
p2 : Nonowners
x1 (Income
in $1000s)
25.0
17.6
21.6
14.4
28.0
16.4
19.8
22.0
15.8
11.0
17.0
21.0
x2 (Lot size
in 1000 sq ft)
9.8
10.4
8.6
10.2
8.8
8.8
8.0
9.2
8.2
9.4
7.0
7.4
L ot Siz e (in thousa nds of sq ua re f e e t)
12
8
Riding Mower owner s
Non ownwers
4
10
20
30
Inc ome ( in thousands of dollars)
40
Example 2
Annual financial data are collected for firms
approximately 2 years prior to bankruptcy and for
financially sound firms at about the same point in
time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt),
• x2 = NI/TA = (net income)/(Total assets),
• x3 = CA/CL = (current assets)/(current liabilties, and
• x4 = CA/NS = (current assets)/(net sales) are given in
the following table.
The data are given in the following table:
Bankrupt Firms
x1
x2
x3
Firm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x4
CF/TD
NI/TA
CA/CL
CA/NS
Firm
-0.4485
-0.4106
1.0865
0.4526
1
-0.5633
-0.3114
1.5314
0.1642
2
0.0643
0.0156
1.0077
0.3978
3
-0.0721
-0.0930
1.4544
0.2589
4
-0.1002
-0.0917
1.5644
0.6683
5
-0.1421
-0.0651
0.7066
0.2794
6
0.0351
0.0147
1.5046
0.7080
7
-0.6530
-0.0566
1.3737
0.4032
8
0.0724
-0.0076
1.3723
0.3361
9
-0.1353
-0.1433
1.4196
0.4347
10
-0.2298
-0.2961
0.3310
0.1824
11
0.0713
0.0205
1.3124
0.2497
12
0.0109
0.0011
2.1495
0.6969
13
-0.2777
-0.2316
1.1918
0.6601
14
0.1454
0.0500
1.8762
0.2723
15
0.3703
0.1098
1.9914
0.3828
16
-0.0757
-0.0821
1.5077
0.4215
17
0.0451
0.0263
1.6756
0.9494
18
0.0115
-0.0032
1.2602
0.6038
19
0.1227
0.1055
1.1434
0.1655
20
-0.2843
-0.2703
1.2722
0.5128
21
22
23
24
25
Nonbankrupt Firms
x1
x2
x3
x4
CF/TD
NI/TA
CA/CL
CA/NS
0.5135
0.1001
2.4871
0.5368
0.0769
0.0195
2.0069
0.5304
0.3776
0.1075
3.2651
0.3548
0.1933
0.0473
2.2506
0.3309
0.3248
0.0718
4.2401
0.6279
0.3132
0.0511
4.4500
0.6852
0.1184
0.0499
2.5210
0.6925
-0.0173
0.0233
2.0538
0.3484
0.2169
0.0779
2.3489
0.3970
0.1703
0.0695
1.7973
0.5174
0.1460
0.0518
2.1692
0.5500
-0.0985
-0.0123
2.5029
0.5778
0.1398
-0.0312
0.4611
0.2643
0.1379
0.0728
2.6123
0.5151
0.1486
0.0564
2.2347
0.5563
0.1633
0.0486
2.3080
0.1978
0.2907
0.0597
1.8381
0.3786
0.5383
0.1064
2.3293
0.4835
-0.3330
-0.0854
3.0124
0.4730
0.4875
0.0910
1.2444
0.1847
0.5603
0.1112
4.2918
0.4443
0.2029
0.0792
1.9936
0.3018
0.4746
0.1380
2.9166
0.4487
0.1661
0.0351
2.4527
0.1370
0.5808
0.0371
5.0594
0.1268
Examples using SPSS
Classification or Cluster Analysis
Have data from one or several
populations
Situation
• Have multivariate (or univariate) data from
one or several populations (the number of
populations is unknown)
• Want to determine the number of populations
and identify the populations
Example
Table: Numerals in eleven languages
English Norwegian Danish Dutch
German
French Spanish
one
two
three
four
five
six
seven
eight
nine
ten
ein
zwei
drei
vier
funf
sechs
sieben
acht
neun
zehn
un
deux
trois
quatre
cinq
six
sept
huit
neuf
dix
en
to
tre
fire
fem
seks
sju
atte
ni
ti
en
to
tre
fire
fem
seks
syv
otte
ni
ti
een
twee
drie
vier
vijf
zes
zeven
acht
negen
tien
uno
dos
tres
cuarto
cinco
seix
siete
ocho
nueve
diez
Italian
Polish Hungarian
uno
jeden
due
dwa
tre
trzy
quattro
cztery
cinque
piec
sei
szesc
sette siedem
otto
osiem
nove dziewiec
dieci dziesiec
egy
ketto
harom
negy
ot
hat
het
nyole
kilenc
tiz
Finnish
yksi
kaksi
kolme
neua
viisi
kuusi
seitseman
kahdeksan
yhdeksan
kymmenen
Distance Matrix
Distance = # of numerals (1 to 10) differing in first letter
E
N
Da
Du
G
Fr
Sp
I
P
H
Fi
E
0
2

2

7
6

6

6

6
7

9

 9
N Da Du G Fr
Sp
0
1 0
5 6
0
4 5
5
0
6 6
9
7
0
6 5
9
7
2
6 5
9
7
7 6 10 8
8 8
8
9
9 9
9
9
0
I P H Fi









1 1 0


5 3 4 0

10 10 10 10 0

9 9 9 9 8 0 
Hierarchical Clustering Methods
The following are the steps in the agglomerative Hierarchical
clustering algorithm for grouping N objects (items or variables).
1.
2.
3.
Start with N clusters, each consisting of a single entity and
an N X N symmetric matrix (table) of distances (or
similarities) D = (dij).
Search the distance matrix for the nearest (most similar)
pair of clusters. Let the distance between the "most
similar" clusters U and V be dUV.
Merge clusters U and V. Label the newly formed cluster
(UV). Update the entries in the distance matrix by
a)
b)
deleting the rows and columns corresponding to
clusters U and V and
adding a row and column giving the distances
between cluster (UV) and the remaining clusters.
4.
Repeat steps 2 and 3 a total of N-1 times. (All objects
will be a single cluster a termination of this algorithm.)
Record the identity of clusters that are merged and the
levels (distances or similarities) at which the mergers
take place.
Different methods of computing inter-cluster distance
Cluster D istanc e
Single Linkage
3
1
2
d
24
5
4
Com plete Linka ge
3
1
d
2
15
5
4
Average Linka ge
3
1
2
4
5
d
13
+ d
14
+ d
15
6
+ d
23
+ d
24
+ d
25
Example
To illustrate the single linkage algorithm, we consider the
hypothetical distance matrix between pairs of five objects given
below:
1
2
D = {d
ik
}=
3
4
5
1 2
 0
 9 0

3 7

6 5

 11 10
3 4 5



0

9 0

2 8 0 
Treating each object as a cluster, the clustering
begins by merging the two closest items (3 & 5).
To implement the next level of clustering we
need to compute the distances between cluster
(35) and the remaining objects:
d(35)1 = min{3,11} = 3
d(35)2 = min{7,10} = 7
d(35)4 = min{9,8} = 8
The new distance matrix becomes:
The new distance matrix becomes:
(3 5 
(3 5   0
3
1

2
7

4
 8
1
2
0
9
0
6
5
4




0 
The next two closest clusters ((35) & 1) are
merged to form cluster (135). Distances between
this cluster and the remaining clusters become:
Distances between this cluster and the remaining
clusters become:
d(135)2 = min{7,9} = 7
d(135)4 = min{8,6} = 6
The distance matrix now becomes:
 3 5  2 4 
 3 5   0


2  7 0


4  6 5 0 
Continuing the next two closest clusters (2 & 4)
are merged to form cluster (24).
Distances between this cluster and the remaining
clusters become:
d(135)(24) = min{d(135)2,d(135)4)=
min{7,6} = 6
The final distance matrix now becomes:
 35   24 
 35   0



 24   6
0 
At the final step clusters (135) and (24) are merged to
form the single cluster (12345) of all five items.
The results of this algorithm can be summarized
graphically on the following "dendogram"
Dendograms
for clustering the 11 languages on the
basis of the ten numerals
Example 2: Public Utility data
variables
Company
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Arizona Public Service
Boston Edison Co
Central Louisiana Electric Co
Commonwealth Edison Co
Consolidated Edison Co (NY)
Florida Power & Light Co
Hawaiian Electric Co
Idaho Power Co
Kentucky Utilities Co
Madison Gas & Electric Co
Nevada Power Co
New England Electric Co
Northern States Power Co
Oklahoma Gas & Electric Co
Pacific Gas & Electric Co
Puget Sound Power & Light Co
San Diego Gas & Electric Co
The Southern Co
Texas Utilities Co
Wisconsin Electric Power Co
United Illuminating Co
Virginia Electric & Power Co
X1
1.06
0.89
1.43
1.02
1.49
1.32
1.22
1.10
1.34
1.12
0.75
1.13
1.15
1.09
0.96
1.16
0.76
1.05
1.16
1.20
1.04
1.07
X2 X3 X4
9.2
10.3
15.4
11.2
8.8
13.5
12.2
9.2
13.0
12.4
7.5
10.9
12.7
12.0
7.6
9.9
6.4
12.6
11.7
11.8
8.6
9.3
151
202
113
168
192
111
175
245
168
197
173
178
199
96
164
252
136
150
104
148
204
174
X5
54.4 1.6
57.9 2.2
53.0 3.4
56.0 0.3
51.2 1.0
60.0 -2.2
67.6 2.2
57.0 3.3
60.4 7.2
53.0 2.7
51.5 6.5
62.0 3.7
53.7 6.4
49.8 1.4
62.2 -0.1
56.0 9.2
61.9 9.0
56.7 2.7
54.0 -2.1
59.9 3.5
61.0 3.5
54.3 5.9
X6
X7
X8
9077
5088
9212
6423
3300
11127
7642
13082
8406
6455
17441
6154
7179
9673
6468
15991
5714
10140
13507
7287
6650
10093
0.0
25.3
0.0
34.3
15.6
22.5
0.0
0.0
0.0
39.2
0.0
0.0
50.2
0.0
0.9
0.0
8.3
0.0
0.0
41.1
0.0
26.6
0.628
1.555
1.058
0.700
2.044
1.241
1.652
0.309
0.862
0.623
0.768
1.897
0.527
0.588
1.400
0.620
1.920
1.108
0.636
0.702
2.116
1.306
X1: Fixed charge coverage ratio (income/debt)
X2: Rate of return on capital
X3: Cost per KW capacity in place
X4: Annual load factor
X5: Peak KWH demand growth from 1974 to1975
X6: Sales (KWH per year)
X7: Percent Nuclear
X8: Total fuel costs (cents per KWH)
Table: Distances between 22 Utilities
Firm
number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1
2
3
4
5
6
7
8
9
0.00
3.10
3.68
2.46
4.12
3.61
3.90
2.74
3.25
3.10
3.49
3.22
3.96
2.11
2.59
4.03
4.40
1.88
2.41
3.17
3.45
2.51
0.00
4.92
2.16
3.85
4.22
3.45
3.89
3.96
2.71
4.79
2.43
3.43
4.32
2.50
4.84
3.62
2.90
4.63
3.00
2.32
2.42
0.00
4.11
4.47
2.99
4.22
4.99
2.75
3.93
5.90
4.03
4.39
2.74
5.16
5.26
6.36
2.72
3.18
3.73
5.09
4.11
0.00
4.13
3.20
3.97
3.69
3.75
1.49
4.86
3.50
2.58
3.23
3.19
4.97
4.89
2.65
3.46
1.82
3.88
2.58
0.00
4.60
4.60
5.16
4.49
4.05
6.46
3.60
4.76
4.82
4.26
5.82
5.63
4.34
5.13
4.39
3.64
3.77
0.00
3.35
4.91
3.73
3.83
6.00
3.74
4.55
3.47
4.07
5.84
6.10
2.85
2.58
2.91
4.63
4.03
0.00
4.36
2.80
4.51
6.00
1.66
5.01
4.91
2.93
5.04
4.58
2.95
4.52
3.54
2.68
4.00
0.00
3.59
3.67
3.46
4.06
4.14
4.34
3.85
2.20
5.43
3.24
4.11
4.09
3.98
3.24
0.00
3.57
5.18
2.74
3.66
3.82
4.11
3.63
4.90
2.43
4.11
2.95
3.74
3.21
10
0.00
5.08
3.94
1.41
3.61
4.26
4.53
5.48
3.07
4.13
2.05
4.36
2.56
11
0.00
5.21
5.31
4.32
4.74
3.43
4.75
3.95
4.52
5.35
4.88
3.44
12
0.00
4.50
4.34
2.33
4.62
3.50
2.45
4.41
3.43
1.38
3.00
13
0.00
4.39
5.10
4.41
5.61
3.78
5.01
2.23
4.94
2.74
14
0.00
4.24
5.17
5.56
2.30
1.88
3.74
4.93
3.51
15
0.00
5.18
3.40
3.00
4.03
3.78
2.10
3.35
16
0.00
5.56
3.97
5.23
4.82
4.57
3.46
17
0.00
4.43
6.09
4.87
3.10
3.63
18
0.00
2.47
2.92
3.19
2.55
19
20
21
22
0.00
3.90 0.00
4.97 4.15 0.00
3.97 2.62 3.01 0.00
Dendogram
Cluster Analysis of N=22 Utility companies
Euclidean distance, Average Linkage
Dendogram
Cluster Analysis of N=22 Utility companies
Euclidean distance, Single Linkage
Descargar

Classification or Cluster Analysis