```Learning the Structure of
Markov Logic Networks
Stanley Kok & Pedro Domingos
Dept. of Computer Science and Eng.
University of Washington
1
Overview
Motivation
 Background
 Structure Learning Algorithm
 Experiments
 Future Work & Conclusion

2
Motivation

Statistical Relational Learning (SRL)
combines the benefits of:


Statistical Learning: uses probability to handle
uncertainty in a robust and principled way
Relational Learning: models domains with
multiple relations
3
Motivation

Many SRL approaches combine a logical
language and Bayesian networks

e.g. Probabilistic Relational Models
[Friedman et al., 1999]
The need to avoid cycles in Bayesian networks
causes many difficulties [Taskar et al., 2002]
 Started using Markov networks instead

4
Motivation

Relational Markov Networks



conjunctive database queries + Markov networks
Require space exponential in the size of the cliques
Markov Logic Networks



[Richardson & Domingos, 2004]
First-order logic + Markov networks
Compactly represent large cliques
Did not learn structure (used external ILP system)
5
Motivation

Relational Markov Networks






conjunctive database queries + Markov networks
Require space exponential in the size of the cliques
Markov Logic Networks

[Richardson & Domingos, 2004]
First-order logic + Markov networks
Compactly represent large cliques
Did not learn structure (used external ILP system)
This paper develops a fast algorithm that
learns MLN structure

Most powerful SRL learner to date
6
Overview
Motivation
 Background
 Structure Learning Algorithm
 Experiments
 Future Work & Conclusion

7
Markov Logic Networks

First-order KB: set of hard constraints


Violate one formula, a world has zero probability
MLNs soften constraints



OK to violate formulas
The fewer formulas a world violates,
the more probable it is
Gives each formula a weight,
reflects how strong a constraint it is
8
MLN Definition

A Markov Logic Network (MLN) is a set of
pairs (F, w) where



F is a formula in first-order logic
w is a real number
Together with a finite set of constants,
it defines a Markov network with


One node for each grounding of each predicate
in the MLN
One feature for each grounding of each formula F
in the MLN, with the corresponding weight w
9
Ground Markov Network
2.7
constants: STAN, PEDRO
Student(STAN)
Professor(STAN)
Professor(PEDRO)
Student(PEDRO)
10
MLN Model
11
MLN Model
Vector of
value assignments to
ground predicates
12
MLN Model
Vector of
value assignments to
ground predicates
Partition function.
Sums over all possible
value assignments to
ground predicates
13
MLN Model
Vector of
Weight of ith formula
value assignments to
ground predicates
Partition function.
Sums over all possible
value assignments to
ground predicates
14
MLN Model
Vector of
Weight of ith formula
value assignments to
ground predicates
Partition function.
# of true groundings
Sums over all possible
of ith formula
value assignments to
ground predicates
15
MLN Weight Learning
Likelihood is concave function of weights
 Quasi-Newton methods to find optimal weights


e.g. L-BFGS [Liu & Nocedal, 1989]
16
MLN Weight Learning
Likelihood is concave function of weights
 Quasi-Newton methods to find optimal weights


e.g. L-BFGS [Liu & Nocedal, 1989]
SLOW
#P-complete
17
MLN Weight Learning
Likelihood is concave function of weights
 Quasi-Newton methods to find optimal weights


e.g. L-BFGS [Liu & Nocedal, 1989]
SLOW
#P-complete
SLOW
#P-complete
18
MLN Weight Learning

R&D used pseudo-likelihood
[Besag, 1975]
19
MLN Weight Learning

R&D used pseudo-likelihood
[Besag, 1975]
20
MLN Structure Learning

R&D “learned” MLN structure in two disjoint steps:



Learn first-order clauses with an off-the-shelf
ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997])
Learn clause weights by optimizing
pseudo-likelihood
Unlikely to give best results because CLAUDIEN


find clauses that hold with some accuracy/frequency
in the data
don’t find clauses that maximize data’s
(pseudo-)likelihood
21
Overview
Motivation
 Background
 Structure Learning Algorithm
 Experiments
 Future Work & Conclusion

22
MLN Structure Learning

This paper develops an algorithm that:



Learns first-order clauses by directly optimizing
pseudo-likelihood
Is fast enough
Performs better than R&D, pure ILP,
purely KB and purely probabilistic approaches
23
Structure Learning Algorithm

High-level algorithm
REPEAT
MLN Ã MLN [ FindBestClauses(MLN)
UNTIL FindBestClauses(MLN) returns NULL

FindBestClauses(MLN)
Create candidate clauses
FOR EACH candidate clause c
Compute increase in evaluation measure
RETURN k clauses with greatest increase
24
Structure Learning
Evaluation measure
 Clause construction operators
 Search strategies
 Speedup techniques

25
Evaluation Measure
 R&D
used pseudo-log-likelihood

 This
gives undue weight to predicates
with large # of groundings
26
Evaluation Measure
 Weighted
pseudo-log-likelihood (WPLL)

 Gaussian
weight prior
 Structure prior
27
Evaluation Measure
 Weighted
pseudo-log-likelihood (WPLL)

weight given to predicate r
 Gaussian
weight prior
 Structure prior
28
Evaluation Measure
 Weighted
pseudo-log-likelihood (WPLL)

weight given to predicate r
sums over groundings of predicate r
 Gaussian
weight prior
 Structure prior
29
Evaluation Measure
 Weighted
pseudo-log-likelihood (WPLL)

weight given to predicate r
CLL: conditional
log-likelihood
sums over groundings of predicate r
 Gaussian
weight prior
 Structure prior
30
Clause Construction Operators
 Remove a literal
 Flip signs of literals
 Limit # of distinct variables to restrict
search space

31
Beam Search


Same as that used in ILP & rule induction
Repeatedly find the single best clause
32
Shortest-First Search (SFS)
1.
2.
3.
4.

Start from empty or hand-coded MLN
FOR L Ã 1 TO MAX_LENGTH
Apply each literal addition & deletion to
each clause to create clauses of length L
Repeatedly add K best clauses of length L
to the MLN until no clause of length L
improves WPLL
Similar to Della Pietra et al. (1997),
McCallum (2003)
33
Speedup Techniques

FindBestClauses(MLN)
Creates candidate clauses
FOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)
RETURN k clauses with greatest increase
34
Speedup Techniques

FindBestClauses(MLN)
Creates candidate clauses
FOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)
RETURN k clauses with greatest increase
SLOW
Many candidates
35
Speedup Techniques

FindBestClauses(MLN)
Creates candidate clauses
FOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)
RETURN k clauses with greatest increase
SLOW
Many candidates
SLOW
Many CLLs
SLOW
Each CLL involves a
#P-complete problem
36
Speedup Techniques

FindBestClauses(MLN)
Creates candidate clauses
FOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)
RETURN k clauses with greatest increase
NOT THAT FAST
SLOW
Many candidates
SLOW
Many CLLs
SLOW
Each CLL involves a
#P-complete problem
37
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

38
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

39
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

40
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

41
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

42
Speedup Techniques
Clause Sampling
 Predicate Sampling
 Avoid Redundancy
 Loose Convergence Thresholds
 Ignore Unrelated Clauses
 Weight Thresholding

43
Overview
Motivation
 Background
 Structure Learning Algorithm
 Experiments
 Future Work & Conclusion

44
Experiments

UW-CSE domain





22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc.
10 types, e.g., Person, Course, Quarter, etc.
# ground predicates ¼ 4 million
# true ground predicates ¼ 3000
Handcrafted KB with 94 formulas



Each student has at most one advisor
If a student is an author of a paper, so is her advisor
Cora domain


Computer science research papers
Collective deduplication of author, venue, title
45
Systems
MLN(SLB): structure learning with beam search
MLN(SLS): structure learning with SFS
46
Systems
MLN(SLB)
MLN(SLS)
KB: hand-coded KB
CL: CLAUDIEN
FO: FOIL
AL: Aleph
47
Systems
MLN(SLB)
MLN(SLS)
MLN(KB)
MLN(CL)
MLN(FO)
MLN(AL)
KB
CL
FO
AL
48
Systems
MLN(SLB)
MLN(SLS)
MLN(KB)
MLN(CL)
MLN(FO)
MLN(AL)
KB
CL
FO
AL
NB:
BN:
Naïve Bayes
Bayesian
networks
49
Methodology

UW-CSE domain



DB divided into 5 areas:
AI, Graphics, Languages, Systems, Theory
Leave-one-out testing by area
Measured


average CLL of the ground predicates
average area under the precision-recall curve
of the ground predicates (AUC)
50
UW-CSE
UW-CSE
CLL
0.0
-0.061 -0.088
-0.151
-0.208 -0.223
-0.142
-0.574
-0.5
-0.661
-0.579
-0.812
-1.0
0.6
0.533
AUC
0.472
0.429
0.306
0.266
0.3
0.140
0.148
0.170
0.131
0.117
51
0.0
UW-CSE
UW-CSE
CLL
0.0
-0.061 -0.088
-0.151
-0.208 -0.223
-0.142
-0.574
-0.5
-0.661
-0.579
-0.812
-1.0
0.6
0.533
AUC
0.472
0.429
0.306
0.266
0.3
0.140
0.148
0.170
0.131
0.117
52
0.0
UW-CSE
UW-CSE
CLL
0.0
-0.061 -0.088
-0.151
-0.208 -0.223
-0.142
-0.574
-0.5
-0.661
-0.579
-0.812
-1.0
0.6
0.533
AUC
0.472
0.429
0.306
0.266
0.3
0.140
0.148
0.170
0.131
0.117
53
0.0
UW-CSE
UW-CSE
CLL
0.0
-0.061 -0.088
-0.151
-0.208 -0.223
-0.142
-0.574
-0.5
-0.661
-0.579
-0.812
-1.0
0.6
0.533
AUC
0.472
0.429
0.306
0.266
0.3
0.140
0.148
0.170
0.131
0.117
54
0.0
UW-CSE
UW-CSE
0.0
-0.061
-0.088
-0.166
CLL
-0.370
-0.5
-1.0
0.6
0.533
0.472
AUC
0.390
0.397
0.3
55
0.0
Timing

MLN(SLS) on UW-CSE



Cluster of 15 dual-CPUs 2.8 GHz Pentium 4
machines
Without speedups: did not finish in 24 hrs
With speedups: 5.3 hrs
56
Lesion Study

Disable one speedup technique at a time; SFS
UW-CSE (one-fold)
30.0
24.8
21.6
Hour
20.0
8.4
10.0
6.5
4.1
4.0
0.0
all
speedups
no clause no predicate don’t avoid
sampling
sampling redundancy
no weight
no loose
57
converg. thresholding
threshold
Overview
Motivation
 Background
 Structure Learning Algorithm
 Experiments
 Future Work & Conclusion

58
Future Work
 Speed
up counting of # true
groundings of clause
 Probabilistically bound the loss in
accuracy due to subsampling
 Probabilistic predicate discovery
59
Conclusion





Markov logic networks: a powerful combination
of first-order logic and probability
Richardson & Domingos (2004) did not learn
MLN structure
We develop an algorithm that automatically learns
both first-order clauses and their weights
We develop speedup techniques to make our
algorithm fast enough to be practical
We show experimentally that our algorithm
outperforms





Richardson & Domingos
Pure ILP
Purely KB approaches
Purely probabilistic approaches
(For software, email: koks@cs.washington.edu)
60
```