Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington 1 Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion 2 Motivation Statistical Relational Learning (SRL) combines the benefits of: Statistical Learning: uses probability to handle uncertainty in a robust and principled way Relational Learning: models domains with multiple relations 3 Motivation Many SRL approaches combine a logical language and Bayesian networks e.g. Probabilistic Relational Models [Friedman et al., 1999] The need to avoid cycles in Bayesian networks causes many difficulties [Taskar et al., 2002] Started using Markov networks instead 4 Motivation Relational Markov Networks conjunctive database queries + Markov networks Require space exponential in the size of the cliques Markov Logic Networks [Taskar et al., 2002] [Richardson & Domingos, 2004] First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system) 5 Motivation Relational Markov Networks conjunctive database queries + Markov networks Require space exponential in the size of the cliques Markov Logic Networks [Taskar et al., 2002] [Richardson & Domingos, 2004] First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system) This paper develops a fast algorithm that learns MLN structure Most powerful SRL learner to date 6 Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion 7 Markov Logic Networks First-order KB: set of hard constraints Violate one formula, a world has zero probability MLNs soften constraints OK to violate formulas The fewer formulas a world violates, the more probable it is Gives each formula a weight, reflects how strong a constraint it is 8 MLN Definition A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a finite set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w 9 Ground Markov Network 2.7 AdvisedBy(S,P) ) Student(S) ^ Professor(P) constants: STAN, PEDRO AdvisedBy(STAN,STAN) Student(STAN) Professor(STAN) AdvisedBy(PEDRO,STAN) AdvisedBy(STAN,PEDRO) Professor(PEDRO) Student(PEDRO) AdvisedBy(PEDRO,PEDRO) 10 MLN Model 11 MLN Model Vector of value assignments to ground predicates 12 MLN Model Vector of value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates 13 MLN Model Vector of Weight of ith formula value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates 14 MLN Model Vector of Weight of ith formula value assignments to ground predicates Partition function. # of true groundings Sums over all possible of ith formula value assignments to ground predicates 15 MLN Weight Learning Likelihood is concave function of weights Quasi-Newton methods to find optimal weights e.g. L-BFGS [Liu & Nocedal, 1989] 16 MLN Weight Learning Likelihood is concave function of weights Quasi-Newton methods to find optimal weights e.g. L-BFGS [Liu & Nocedal, 1989] SLOW #P-complete 17 MLN Weight Learning Likelihood is concave function of weights Quasi-Newton methods to find optimal weights e.g. L-BFGS [Liu & Nocedal, 1989] SLOW #P-complete SLOW #P-complete 18 MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975] 19 MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975] 20 MLN Structure Learning R&D “learned” MLN structure in two disjoint steps: Learn first-order clauses with an off-the-shelf ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997]) Learn clause weights by optimizing pseudo-likelihood Unlikely to give best results because CLAUDIEN find clauses that hold with some accuracy/frequency in the data don’t find clauses that maximize data’s (pseudo-)likelihood 21 Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion 22 MLN Structure Learning This paper develops an algorithm that: Learns first-order clauses by directly optimizing pseudo-likelihood Is fast enough Performs better than R&D, pure ILP, purely KB and purely probabilistic approaches 23 Structure Learning Algorithm High-level algorithm REPEAT MLN Ã MLN [ FindBestClauses(MLN) UNTIL FindBestClauses(MLN) returns NULL FindBestClauses(MLN) Create candidate clauses FOR EACH candidate clause c Compute increase in evaluation measure of adding c to MLN RETURN k clauses with greatest increase 24 Structure Learning Evaluation measure Clause construction operators Search strategies Speedup techniques 25 Evaluation Measure R&D used pseudo-log-likelihood This gives undue weight to predicates with large # of groundings 26 Evaluation Measure Weighted pseudo-log-likelihood (WPLL) Gaussian weight prior Structure prior 27 Evaluation Measure Weighted pseudo-log-likelihood (WPLL) weight given to predicate r Gaussian weight prior Structure prior 28 Evaluation Measure Weighted pseudo-log-likelihood (WPLL) weight given to predicate r sums over groundings of predicate r Gaussian weight prior Structure prior 29 Evaluation Measure Weighted pseudo-log-likelihood (WPLL) weight given to predicate r CLL: conditional log-likelihood sums over groundings of predicate r Gaussian weight prior Structure prior 30 Clause Construction Operators Add a literal (negative/positive) Remove a literal Flip signs of literals Limit # of distinct variables to restrict search space 31 Beam Search Same as that used in ILP & rule induction Repeatedly find the single best clause 32 Shortest-First Search (SFS) 1. 2. 3. 4. Start from empty or hand-coded MLN FOR L Ã 1 TO MAX_LENGTH Apply each literal addition & deletion to each clause to create clauses of length L Repeatedly add K best clauses of length L to the MLN until no clause of length L improves WPLL Similar to Della Pietra et al. (1997), McCallum (2003) 33 Speedup Techniques FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase 34 Speedup Techniques FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase SLOW Many candidates 35 Speedup Techniques FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase SLOW Many candidates SLOW Many CLLs SLOW Each CLL involves a #P-complete problem 36 Speedup Techniques FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURN k clauses with greatest increase NOT THAT FAST SLOW Many candidates SLOW Many CLLs SLOW Each CLL involves a #P-complete problem 37 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 38 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 39 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 40 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 41 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 42 Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding 43 Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion 44 Experiments UW-CSE domain 22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc. 10 types, e.g., Person, Course, Quarter, etc. # ground predicates ¼ 4 million # true ground predicates ¼ 3000 Handcrafted KB with 94 formulas Each student has at most one advisor If a student is an author of a paper, so is her advisor Cora domain Computer science research papers Collective deduplication of author, venue, title 45 Systems MLN(SLB): structure learning with beam search MLN(SLS): structure learning with SFS 46 Systems MLN(SLB) MLN(SLS) KB: hand-coded KB CL: CLAUDIEN FO: FOIL AL: Aleph 47 Systems MLN(SLB) MLN(SLS) MLN(KB) MLN(CL) MLN(FO) MLN(AL) KB CL FO AL 48 Systems MLN(SLB) MLN(SLS) MLN(KB) MLN(CL) MLN(FO) MLN(AL) KB CL FO AL NB: BN: Naïve Bayes Bayesian networks 49 Methodology UW-CSE domain DB divided into 5 areas: AI, Graphics, Languages, Systems, Theory Leave-one-out testing by area Measured average CLL of the ground predicates average area under the precision-recall curve of the ground predicates (AUC) 50 UW-CSE UW-CSE CLL 0.0 -0.061 -0.088 -0.151 -0.208 -0.223 -0.142 -0.574 -0.5 -0.661 -0.579 -0.812 -1.0 0.6 0.533 AUC 0.472 0.429 0.306 0.266 0.3 0.140 0.148 0.170 0.131 0.117 51 0.0 UW-CSE UW-CSE CLL 0.0 -0.061 -0.088 -0.151 -0.208 -0.223 -0.142 -0.574 -0.5 -0.661 -0.579 -0.812 -1.0 0.6 0.533 AUC 0.472 0.429 0.306 0.266 0.3 0.140 0.148 0.170 0.131 0.117 52 0.0 UW-CSE UW-CSE CLL 0.0 -0.061 -0.088 -0.151 -0.208 -0.223 -0.142 -0.574 -0.5 -0.661 -0.579 -0.812 -1.0 0.6 0.533 AUC 0.472 0.429 0.306 0.266 0.3 0.140 0.148 0.170 0.131 0.117 53 0.0 UW-CSE UW-CSE CLL 0.0 -0.061 -0.088 -0.151 -0.208 -0.223 -0.142 -0.574 -0.5 -0.661 -0.579 -0.812 -1.0 0.6 0.533 AUC 0.472 0.429 0.306 0.266 0.3 0.140 0.148 0.170 0.131 0.117 54 0.0 UW-CSE UW-CSE 0.0 -0.061 -0.088 -0.166 CLL -0.370 -0.5 -1.0 0.6 0.533 0.472 AUC 0.390 0.397 0.3 55 0.0 Timing MLN(SLS) on UW-CSE Cluster of 15 dual-CPUs 2.8 GHz Pentium 4 machines Without speedups: did not finish in 24 hrs With speedups: 5.3 hrs 56 Lesion Study Disable one speedup technique at a time; SFS UW-CSE (one-fold) 30.0 24.8 21.6 Hour 20.0 8.4 10.0 6.5 4.1 4.0 0.0 all speedups no clause no predicate don’t avoid sampling sampling redundancy no weight no loose 57 converg. thresholding threshold Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion 58 Future Work Speed up counting of # true groundings of clause Probabilistically bound the loss in accuracy due to subsampling Probabilistic predicate discovery 59 Conclusion Markov logic networks: a powerful combination of first-order logic and probability Richardson & Domingos (2004) did not learn MLN structure We develop an algorithm that automatically learns both first-order clauses and their weights We develop speedup techniques to make our algorithm fast enough to be practical We show experimentally that our algorithm outperforms Richardson & Domingos Pure ILP Purely KB approaches Purely probabilistic approaches (For software, email: [email protected]) 60

Descargar
# Document