Top Down FP-Growth for
Association Rule Mining
By
Ke Wang
1
Introduction
• Classically, for rule A  B :
– support: computed by count( AB )
• frequent --- if pass minimum support threshold
– confidence: computed by count( AB ) / count(A )
• confident – if pass minimum confidence threshold
• How to mine association rules?
– find all frequent patterns
– generate rules from the frequent patterns
2
Introduction
• Limitations of current research
– use uniform minimum support threshold
– only use support as pruning measure
• Our contribution
– improve efficiency
– adopt multiple minimum supports
– introduce confidence pruning
3
Related work
-- Frequent pattern mining
• Apriori algorithm
– method: use anti-monotone property of
support to do pruning, i.e.
• if length k pattern is infrequent, its length k+1
super-pattern can never be frequent
• FP-growth algorithm--better than Apriori
– method:
• build FP-tree to store database
• mine FP-tree in bottom-up order
4
Related work
-- Association rule mining
• Fast algorithms trying to guarantee
completeness of frequent patterns
• Parallel algorithms & association rule
based query languages
• Various association rule mining problems
– multi-level multi-dimension rule
– constraints on specific item
5
TD-FP-Growth for frequent
pattern mining
• Similar tree structure as FP-growth
– Compressed tree to store the database
– nodes on each path of the tree are globally
ordered
• Different mining method VS.FP-growth
– FP-growth: bottom-up tree mining
– TD-FP-Growth : top-down tree mining
6
TD-FP-Growth for frequent
pattern mining
b, e
a, b, c, e
b, c, e
a, c, d
a
minsup = 2
root
Construct a FP-tree:
Entry value
count
a
b
c
e
3
3
3
3
b: 2
side-link
e: 1
a: 3
c: 1
b: 1
c: 1
H ead er T ab le H
e: 1
F P -tree and header table H
c: 1
e: 1
7
TD-FP-Growth for frequent
pattern mining
FP-growth: bottom-up mining
root
Mining order:
e, c, b, a
item
b: 2
Head of node-link
a
b
c
e
e: 1
b, e
a, b, c, e
b, c, e
a, c, d
a
minsup = 2
a: 3
c: 1
b: 1
c: 1
H ead er T ab le H
(b: 1)
(b: 1, c: 1)
(a: 1, b: 1, c: 1)
e: 1
F P -tree and header table H
e’s conditional pattern base
c: 1
e: 1
8
TD-FP-Growth for frequent
pattern mining
root
• FP-growth: bottom-up mining
(b: 1)
(b: 1, c: 1)
(a: 1, b: 1, c: 1)
e’s con d ition al p attern b ase
item
Head of node-link
b: 3
b
c
e’s con d ition al F P -tree
• both e’s conditional pattern base and
conditional FP-tree are stored in memory
c: 2
 drawback!
• mine e’s conditional FP-tree recursively
• conditional pattern bases and FP-trees are
built for all other items and their super-patterns
9
TD-FP-Growth for frequent
pattern mining
• TD-FP-Growth : adopt top-down mining
strategy
– motivation: avoid building extra databases and
sub-trees as FP-growth does
– method: process nodes on the upper level
before those on the lower level
– result: any modification happened on the
upper level nodes would not affect the lower
level nodes
See example 
10
TD-FP-Growth for frequent
pattern mining
root
Mining order:
a, b, c, e
b: 2
Entry value
count
a
b
c
e
3
3
3
3
b, e
a, b, c, e
b, c, e
a, c, d
a
minsup = 2
a: 3
side-link
e: 1
H ead er T ab le H
c: 1
b: 1
e: 1
c: 1
c: 1
CT-tree and header table H
e: 1
11
CT-tree for frequent pattern
mining
b: 1
sub-header-table H_c
Entry value
count
a
b
2
2
root
side-link
a: 2
b: 2
Entry value
count
a
b
c
e
3
3
3
3
b, e
a, b, c, e
b, c, e
a, c, d
a
minsup = 2
a: 3
side-link
e: 1
H ead er T ab le H
c: 1
b: 1
e: 1
c: 1
c: 1
CT-tree and header table H
e: 1
12
CT-tree for frequent pattern
mining
• Completeness
– for entry i in H, we mine all the frequent
patterns that end up with item i, no more and
no less
• Complete set of frequent patterns:
{a }
{b }
{c }, {b, c }, {a, c }
{e }, {b, e }, {c, e }, {b, c, e }
13
TD-FP-Growth for frequent
pattern mining
• Comparing to FP-growth, TD-FP-Growth is:
– Space saving:
• only one tree and a few header tables
• no extra databases and sub-trees
– Time saving:
• does not build extra databases and sub-trees
• walk up path only once to update count
information for nodes on the tree and build subheader-tables.
14
TD-FP-Growth for association
rule mining
• Assumptions:
– There is a class-attribute in the database
– Items in the class-attribute called class-items, others
are non-class-items
– Each transaction is associated a class-item
– Only class-item appears in the right-hand of the rule
Transaction
ID
non-classattribute
class-attribute
1
a, b…
C1
2
d…
C2
3
e, d, f…
C3
…
…
…
example rule:
a, b  Ci
15
TD-FP-Growth for association
rule mining--multi mini support
• Why?
– Use uniform minimum support, computation of
count considers only number of appearance
– Uniform minimum support is unfair to items
that appears less but worth more.
• Eg. responder vs. non-responder
• How?
– Use different support threshold for different
class
16
TD-FP-Growth for association
rule mining -- multi mini support
• multiple VS. uniform
– C1 : 4, C 2 : 2
– rules with relative minsup = 50% proportional
to each class -- multiplier in performance
• uniform minimum support: absolute minsup = 1;
– 11 nodes tree, 23 rules
c, f, C1
b, e, C2
b, e, f, C1
a, c, f, C1
c, e, C2
b, c, d, C1
• multiple minimum supports: absolute minsup1 = 2;
absolute minsup2 = 1;
– 7 nodes tree, 9 rules
– more effective and space-saving
– time-saving --- show in performance
17
TD-FP-Growth for association
rule mining--conf pruning
• Motivation
– make use of the other constraint of association
rule: confidence, to speed up mining
• Method
– confidence is not anti-monotone
– introduce: acting constraint of confidence,
which is anti-monotone
– push it inside the mining process
18
TD-FP-Growth for association
rule mining--conf pruning
conf(A  B) = count(AB) / count(A) >= minconf
 count(AB) >= count(A) * minconf

count(AB) >= minsup * minconf
(anti-monotone & weaker)
--- the acting constraint of confidence for the original
confidence constraint of rule A  B
• support of rule is computed by: count(A)
• count(AB): class-count of itemset A related to class
B
19
TD-FP-Growth for association
c, f, C
b, e, C
rule mining--conf pruning
b, e, f, C
1
2
Entry
value i
count (i)
count(i,C1)
count(i,C2)
sidelink
a
b
c
2
2
3
1
1
2
1
1
1
…
…
…
e
2
1
1
f
3
3
0
…
a, c, f, C1
a, c, d, C2
minsup = 2
… minconf= 60%
b: 2
…
Header table H: count(i) = count(i, C1) + count(i, C2)
count(e) >= minsup; However,
both count(e, C1) & count(e, C2) < minsup * minconf;
 terminate mining for e!
If no confidence pruning 
1
root
e: 2
…
Entry
value i
count (i)
count(i,Ci)
count(i,C2)
b
2
1
1
sub-header-table H_e
sidelink
20
Performance
• Choose several data sets from UC_Irvine Machine
Learning Database Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html.
name of
dataset
# of
transactions
# of items
in each
transaction
class distribution
# of distinct
items
Dna-train
2000
61
23.2%, 24.25%,
52.55%
240
Connect-4
67557
43
9.55%, 24.62%,
65.83%
126
13
0.47%, 1.63%, 2.99%,
3.53%, 6.15%, 36.36%,
48.76%
15916
Forest
581012
21
Performance—frequent pattern
22
Performance — mine rules with
multiple minimum supports
FP-growth is only
for frequent
pattern mining
relative minsup, proportional to each class
23
Performance — mine rules with
confidence pruning
24
Conclusions and future work
• Conclusions of TD-FP-Growth algorithm
– more efficient in finding both frequent
patterns and association rules
– more effective in mining rules by using
multiple minimum supports
– Introduce a new pruning method: confidence
pruning, and push it inside the mining process;
thus further speed up mining
25
Conclusions and future work
• Future work
– Explore other constraint-based association
rule mining method
– Mine association rules with item concept
hierarchy
– Apply TD-FP-Growth to applications based on
association rule mining
• Clustering
• Classification
26
Reference
•
•
•
•
•
•
(1) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. Proc. 1993 ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD’93), pages 207-216, Washington, D.C., May 1993.
(2) U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.). Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
(3) H. Toivonen. Sampling large databases for association rules. Proc. 1996 Int. Conf.
Very Large Data Bases (VLDB’96), pages 134-145, Bombay, India, September 1996.
(4) R. Agrawal and S. Srikant. Mining sequential patterns. Proc. 1995 Int. Conf. Data
Engineering (ICDE’95), pages 3-14, Taipei, Taiwan, March 1995.
(5) J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation.
Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’00), pages 1-12,
Dallas, TX, May 2000.
(6) J. Han, J. Pei, G. Dong, and K. Wang. Efficient Computation of Iceberg Cubes with
Complex Measures. Proc. 2001 ACM-SIGMOD Int. Conf., Santa Barbara, CA, May 2001.
And more!
27
Descargar

CT-tree for association rule mining