```Decision-Tree Induction &
Decision-Rule Induction
Evgueni Smirnov
Overview
1.
2.
3.
4.
5.
6.
Instances, Concepts, Languages
Decision Trees
Decision Rules
Evaluation Techniques
Intro to Weka
Instances, Concepts, Languages
A concept is a set of objects in a world that are unified by a reason.
A reason may be a similar appearance, structure or function.
friendly robots
Example. The set: {children, photos, cat, diplomas} can be
viewed as a concept “Most important things to take out of your
apartment when it catches fire”.
Instances, Concepts, Languages
Li
body = round
smiling = yes
holding = flag
color = yellow
friendly robots
Instances, Concepts, Languages
smiling = yes 
friendly robots
Lc
M
Li
body = round
smiling = yes
holding = flag
color = yellow
friendly robots
<Li, Lc, M, <I+, I->>
Lc
M

Li
I +:


I-:

<Li, Lc, M, I>
Lc
M
Li
<Li, Lc, M, I>
Lc
M
Li
Decision Trees
•
•
•
•
•
•
•
Decision trees
Appropriate problems for decision trees
Entropy and Information Gain
The ID3 algorithm
Avoiding Overfitting via Pruning
Handling Continuous-Valued Attributes
Handling Missing Attribute Values
Decision Trees
Definition: A decision tree is a tree s.t.:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
Outlook
Sunny
Overcast
Rainy
Humidity
yes
Windy
High
Normal
False
True
no
yes
yes
no
Data Set for Playing Tennis
Outlook
Temp.
Humidity
Windy
Play
Outlook
Temp.
Humidity
Windy
play
Sunny
Hot
High
False
no
Sunny
Mild
High
False
no
Sunny
Hot
High
True
no
Sunny
Cool
Normal
False
yes
Overcast
Hot
High
False
yes
Rainy
Mild
Normal
False
yes
Rainy
Mild
High
False
yes
Sunny
Mild
Normal
True
yes
Rainy
Cool
Normal
False
yes
Overcast
Mild
High
True
yes
Rainy
Cool
Normal
True
no
Overcast
Hot
Normal
False
yes
Overcast
Cool
Normal
True
yes
Rainy
Mild
High
True
no
Decision Tree For Playing Tennis
Outlook
Sunny
Overcast
Rainy
Humidity
yes
Windy
High
Normal
False
True
no
yes
yes
no
When to Consider Decision Trees
• Each instance consists of an attribute with discrete values
(e.g. outlook/sunny, etc..)
• The classification is over discrete values (e.g. yes/no )
• It is okay to have disjunctive descriptions – each path in
the tree represents a disjunction of attribute combinations.
Any Boolean function can be represented!
• It is okay for the training data to contain errors – decision
trees are robust to classification errors in the training data.
• It is okay for the training data to contain missing values –
decision trees can be used even if instances have missing
attributes.
Rules in Decision Trees
Outlook
Sunny
Overcast
Rainy
Humidity
yes
Windy
High
Normal
False
True
no
yes
yes
no
If Outlook = Sunny & Humidity = High then Play = no
If Outlook = Sunny & Humidity = Normal then Play = yes
If Outlook = Overcast then Play = yes
If Outlook = Rainy & Windy = False then Play = yes
If Outlook = Rainy & Windy = True then Play = no
Decision Tree Induction
Basic Algorithm:
1. A  the “best" decision attribute for a node N.
2. Assign A as decision attribute for the node N.
3. For each value of A, create new descendant of the node N.
4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified, THEN STOP.
ELSE iterate over new leaf nodes
Decision Tree Induction
Outlook
Sunny
Rain
Overcast
____________________________________
Outlook Temp Hum Wind Play
------------------------------------------------------Sunny
Hot
High Weak no
Sunny
Hot
High Strong no
Sunny
Mild
High Weak no
Sunny
Cool
Normal Weak yes
Sunny
Mild
Normal Strong yes
_____________________________________
Outlook
Temp Hum Wind Play
--------------------------------------------------------Overcast
Hot
High
Weak yes
Overcast
Cool Normal Strong yes
_____________________________________
Outlook
Temp Hum Wind Play
--------------------------------------------------------Rain
Mild
High
Weak yes
Rain
Cool
Normal Weak yes
Rain
Cool
Normal Strong no
Rain
Mild
Normal Weak yes
Rain
Mild
High
Strong no
Entropy
Let S be a sample of training examples, and
p+ is the proportion of positive examples in S
and
p- is the proportion of negative examples in S.
Then: entropy measures the impurity of S:
E( S) = - p+ log2 p+ – p- log2 p-
Entropy Example from the Dataset
Outlook
Temp.
Humidity
Windy
Play
Outlook
Temp.
Humidity
Windy
play
Sunny
Hot
High
False
no
Sunny
Mild
High
False
no
Sunny
Hot
High
True
no
Sunny
Cool
Normal
False
yes
Overcast
Hot
High
False
yes
Rainy
Mild
Normal
False
yes
Rainy
Mild
High
False
yes
Sunny
Mild
Normal
True
yes
Rainy
Cool
Normal
False
yes
Overcast
Mild
High
True
yes
Rainy
Cool
Normal
True
no
Overcast
Hot
Normal
False
yes
Overcast
Cool
Normal
True
yes
Rainy
Mild
High
True
no
p yes 

9

 14


 log
2



9

 14





 0.41

 5 
5 
log 2    0.53
 14 
 14 




p no   
E ( S )  p yes  p no  0.94
Information Gain
Information Gain is the expected reduction in entropy
caused by partitioning the instances according to a given
attribute.
Gain(S, A) = E(S) -

vValues ( A )
| Sv |
|S |
E (Sv )
where Sv = { s  S | A(s) = v}
S
Sv1 = { s  S | A(s) = v1}
Sv12 = { s  S | A(s) = v2}
Example
Outlook
Sunny
Rain
Overcast
____________________________________
Outlook Temp Hum Wind Play
------------------------------------------------------Sunny
Hot
High False
No
Sunny
Hot
High True
No
Sunny
Mild
High False No
Sunny
Cool
Normal False Yes
Sunny
Mild
Normal True Yes
_____________________________________
Outlook
Temp Hum Wind Play
--------------------------------------------------------Overcast
Hot
High
Weak Yes
Overcast
Cool Normal Strong Yes
_____________________________________
Outlook
Temp Hum Windy Play
--------------------------------------------------------Rain
Mild
High
False Yes
Rain
Cool
Normal False Yes
Rain
Cool
Normal True
No
Rain
Mild
Normal False Yes
Rain
Mild
High
True
No
Which attribute should be tested here?
Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570
Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019
ID3 Algorithm
Informally:
– Determine the attribute with the highest
information gain on the training set.
– Use this attribute as the root, create a branch for
each of the values the attribute can have.
– For each branch, repeat the process with subset
of the training set that is classified by that
branch.
Hypothesis Space Search in ID3
• The hypothesis space is the
set of all decision trees
defined over the given set of
attributes.
• ID3’s hypothesis space is a
compete space; i.e., the target
description is there!
• ID3 performs a simple-tocomplex, hill climbing search
through this space.
Hypothesis Space Search in ID3
• The evaluation function is
the information gain.
• ID3 maintains only a single
current decision tree.
• ID3
performs
no
backtracking in its search.
• ID3 uses all training
instances at each step of the
search.
Overfitting
Definition: Given a hypothesis space H, a hypothesis h
 H is said to overfit the training data if there exists
some hypothesis h’  H, such that h has smaller error
that h’ over the training instances, but h’ has a smaller
error that h over the entire distribution of instances.
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no
• Noisy training instances. Consider an noisy training example:
Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No
This instance affects the training instances:
Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes
Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
Windy
yes
no
false
true
yes
Temp
mild
yes
cool
?
Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No
Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes
Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes
high
no
Reasons for Overfitting
• Small number of instances are associated with leaf nodes. In
this case it is possible that for coincidental regularities to occur
that are unrelated to the actual target concept.
-
+ + +
+
+ ++
-
-
area with probably
wrong predictions
- +
-
-
-
- -
-
-
Approaches to Avoiding Overfitting
• Pre-pruning: stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data
• Post-pruning: Allow the tree to overfit the
data, and then post-prune the tree.
Validation Set
• Validation set is a set of instances used to evaluate the utility
of nodes in decision trees. The validation set has to be chosen
so that it is unlikely to suffer from same errors or fluctuations
as the training set.
• Usually before pruning the training data is split randomly into
a growing set and a validation set.
Reduced-Error Pruning
Split data into
validation sets.
growing
and
Pruning a decision node d consists of:
1. removing the subtree rooted at d.
2. making d a leaf node.
3. assigning d the most common
classification of the training
instances associated with d.
Outlook
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no
3 instances
2 instances
Accuracy of the tree on the validation
set is 90%.
Reduced-Error Pruning
Split data into
validation sets.
growing
and
Pruning a decision node d consists of:
1. removing the subtree rooted at d.
2. making d a leaf node.
3. assigning d the most common
classification of the training
instances associated with d.
Outlook
sunny
overcast
rainy
no
yes
Windy
false
true
yes
no
Accuracy of the tree on the validation
set is 92.4%.
Reduced-Error Pruning
Split data into
validation sets.
growing
and
Pruning a decision node d consists of:
1. removing the subtree rooted at d.
2. making d a leaf node.
3. assigning d the most common
classification of the training
instances associated with d.
Do until further pruning is harmful:
1. Evaluate impact on validation set
of pruning each possible node
(plus those below it).
2. Greedily remove the one that most
improves validation set accuracy.
Outlook
sunny
overcast
rainy
no
yes
Windy
false
true
yes
no
Accuracy of the tree on the validation
set is 92.4%.
Reduced Error Pruning Example
Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.
3. Sort final rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances.
Outlook
no
sunny
overcast
rainy
Humidity
yes
Windy
IF (Outlook = Sunny) & (Humidity = High)
THEN PlayTennis = No
IF (Outlook = Sunny) & (Humidity = Normal)
THEN PlayTennis = Yes
……….
normal
false
true
yes
yes
no
Continuous Valued Attributes
1. Create a set of discrete attributes to test continuous.
2. Apply Information Gain in order to choose the best
attribute.
Temperature:
PlayTennis:
40
No
48
No
60
Yes
Temp>54
72
Yes
80
Yes
90
No
Tem>85
Missing Attribute Values
Strategies:
1. Assign most common value of A among other instances
belonging to the same concept.
2. If node n tests the attribute A, assign most common value
of A among other instances sorted to node n.
3. If node n tests the attribute A, assign a probability to each
of possible values of A. These probabilities are estimated
based on the observed frequencies of the values of A. These
probabilities are used in the information gain measure (via
entropy). (E( S) = - p+ log2 p+ – p- log2 p-)
Summary Points
1. Decision tree learning provides a practical method for
concept learning.
2. ID3-like algorithms search complete hypothesis space.
3. The inductive bias of decision trees is preference (search)
bias.
4. Overfitting the training data is an important issue in
decision tree learning.
5. A large number of extensions of the ID3 algorithm have
been proposed for overfitting avoidance, handling missing
attributes, handling numerical attributes, etc.
Learning Decision Rules
• Decision Rules
• Basic Sequential Covering Algorithm
• Learn-One-Rule Procedure
• Pruning
Definition of Decision Rules
Definition: Decision rules are rules with the following form:
if <conditions> then concept C.
Example: If you run the Prism algorithm from Weka on the weather
data you will get the following set of decision rules:
if outlook = overcast then PlayTennis = yes
if humidity = normal and windy = FALSE then PlayTennis = yes
if temperature = mild and humidity = normal then PlayTennis = yes
if outlook = rainy and windy = FALSE then PlayTennis = yes
if outlook = sunny and humidity = high then PlayTennis = no
if outlook = rainy and windy = TRUE then PlayTennis = no
Why Decision Rules?
• Decision rules are more compact.
• Decision rules are more understandable.
X
Example: Let X {0,1}, Y {0,1},
Z {0,1}, W {0,1}. The rules are:
1
0
Z
Y
if X=1 and Y=1 then 1
if Z=1 and W=1 then 1
1
0
1
0
1
Z
W
0
Otherwise 0;
1
0
1
0
W
0
1
0
1
0
1
0
Why Decision Rules?
Decision boundaries of decision trees
++
+
+
+ +
-
-
-
+
+
+
-
+
+
+
-
-
-
-
-
-
-
-
Decision boundaries of decision rules
++
+
+
+ +
-
-
-
-
+
+
+
+
-
-
-
+
+
-
-
-
-
How to Learn Decision Rules?
1. We can convert trees to rules
2. We can use specific rule-learning methods
Sequential Covering Algorithms
function LearnRuleSet(Target, Attrs, Examples, Threshold):
LearnedRules := 
Rule := LearnOneRule(Target, Attrs, Examples)
while performance(Rule,Examples) > Threshold, do
LearnedRules := LearnedRules  {Rule}
Examples := Examples \ {examples covered by Rule}
Rule := LearnOneRule(Target, Attrs, Examples)
sort LearnedRules according to performance
return LearnedRules
Illustration
+
-
-
+
+
+
+
+
+
+
+
-
+
+
+
-
-
-
IF true THEN pos
-
-
-
Illustration
+
-
-
+
+
+
+
+
+
+
+
-
+
+
+
-
-
-
IF A
true
THEN
THEN
pospos
-
-
-
Illustration
+
-
-
+
+
+
+
+
+
+
-
+
+
+
+
-
-
-
-
IF A
true
THEN
& THEN
B THEN
pospospos
-
-
Illustration
+
-
-
+
+
+
+
+
+
+
-
+
+
+
+
-
-
-
-
IF A & B THEN pos
IF true THEN pos
-
-
Illustration
+
-
-
+
+
+
+
+
+
+
-
+
+
+
+
-
-
-
-
IF A & B THEN pos
IF C
true
THEN
THEN pos
pos
-
-
Illustration
+
-
-
+
+
+
+
+
+
+
-
+
+
+
+
-
-
-
-
IF A & B THEN pos
IF C
true
THEN
& THEN
D THEN
pos
pospos
-
-
Learning One Rule
To learn one rule we use one of the strategies below:
• Top-down:
– Add literals one by one
• Bottom-up:
– Remove literals one by one
• Combination of top-down and bottom-up:
– Candidate-elimination algorithm.
Bottom-up vs. Top-down
Bottom-up: typically more specific rules
+
-
-
+
+
+
+
+
+
+
+
-
-
+
+
+
-
-
-
-
-
Top-down: typically more general rules
-
Learning One Rule
Bottom-up:
• Example-driven (AQ family).
Top-down:
• Generate-then-Test (CN-2).
Example of Learning One Rule
Heuristics for Learning One Rule
– When is a rule “good”?
• High accuracy;
• Less important: high coverage.
– Possible evaluation functions:
• Relative frequency: nc/n, where nc is the number of correctly
classified instances, and n is the number of instances covered
by the rule;
• m-estimate of accuracy: (nc+ mp)/(n+m), where nc is the
number of correctly classified instances, n is the number of
instances covered by the rule, p is the prior probablity of the
class predicted by the rule, and m is the weight of p.
• Entropy.
How to Arrange the Rules
1. The rules are ordered according to the order they have been
learned. This order is used for instance classification.
2. The rules are ordered according to their accuracy. This
order is used for instance classification.
3. The rules are not ordered but there exists a strategy how to
apply the rules (e.g., an instance covered by conflicting
rules gets the classification of the rule that classifies
correctly more training instances; if an instance is not
covered by any rule, then it gets the classification of the
majority class represented in the training data).
Approaches to Avoiding Overfitting
• Pre-pruning: stop learning the decision
rules before they reach the point where they
perfectly classify the training data
• Post-pruning: allow the decision rules to
overfit the training data, and then postprune the rules.
Post-Pruning
1.
2.
3.
4.
Split instances into Growing Set and Pruning Set;
Learn set SR of rules using Growing Set;
Find the best simplification BSR of SR.
while (Accuracy(BSR, Pruning Set) >
Accuracy(SR, Pruning Set) )
do
4.1
SR = BSR;
4.2
Find the best simplification BSR of SR.
5. return BSR;
Incremental Reduced Error Pruning
Post-pruning
D1
D2
D3
D1 D21
D3
D22
Incremental Reduced Error Pruning
1.
2.
3.
4.
4.1
4.2
4.2
5.
Split Training Set into Growing Set and Validation Set;
Learn rule R using Growing Set;
Prune the rule R using Validation Set;
if performance(R, Training Set) > Threshold
Add R to Set of Learned Rules
Remove in Training Set the instances covered by R;
go to 1;
else return Set of Learned Rules
Summary Points
1. Decision rules are easier for human comprehension
than decision trees.
2. Decision rules have simpler decision boundaries than
decision trees.
3. Decision rules are learned by sequential covering of
the training instances.
Model Evaluation Techniques
• Evaluation on the training set: too optimistic
Classifier
Training set
Training set
Model Evaluation Techniques
• Hold-out Method: depends on the make-up
of the test set.
Classifier
Training set
Test set
Data
• To improve the precision of the hold-out method:
it is repeated many times.
Model Evaluation Techniques
• k-fold Cross Validation
Classifier
Data
train
train
test
train
test
train
test
train
train
Intro to Weka
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {TRUE, FALSE}
@data
sunny,hot,high,FALSE,FALSE
sunny,hot,high,TRUE,FALSE
overcast,hot,high,FALSE,TRUE
rainy,mild,high,FALSE,TRUE
rainy,cool,normal,FALSE,TRUE
rainy,cool,normal,TRUE,FALSE
overcast,cool,normal,TRUE,TRUE
………….
```