Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Traditional IE vs Open Domain IE • Goal: recognize people, places, companies, times, dates, … in NL text. • Supervised learning from corpus completely annotated with target entity class (e.g. “people”) • Linear-chain CRFs • Language- and genre-specific extractors • Goal: recognize arbitrary entity sets in text – Minimal info about entity class – Example 1: “ICML, NIPS” – Example 2: “Machine learning conferences” • Semi-supervised learning from very large corpora (WWW) • Graph-based learning methods • Techniques are largely language-independent (!) – Graph abstraction fits many languages Examples with three seeds Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity – Approach 2: A carefully-engineered graph derived from huge corpus (e.g’s above) History: Open-domain IE by patternmatching (Hearst, 92) • • Start with seeds: “NIPS”, “ICML” Look thru a corpus for certain patterns: • • • … “at NIPS, AISTATS, KDD and other learning conferences…” Expand from seeds to new instances Repeat….until ___ – “on PC of KDD, SIGIR, … and…” Bootstrapping as graph proximity NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” SNOWBIRD For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence Outline • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity “with” Einat Minkov (CMU Nokia) – Approach 2: A carefully-engineered graph derived from huge corpus (above) “with” Richard Wang (CMU ?) Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) nsubj boys partmod like prep.with playing all kinds det NN VB VB DT cars prep.of NN NN Dependency parsed sentence is a naturally represented as a tree Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is “naturally” represented as a graph Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Open IE Goal: • Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find • Similarity measure S so S(girl,boy) is high • What about off-the-shelf similarity measures: • Random Walk with Restart (RWR) • Hitting time • Commute time •…? Personalized PR/RWR A query language: Q: { , } The graph Nodes Node type Edge label Edge weight graph walk parameters: edge weights Θ , walk length K and reset probability γ. M[x,y] = Prob. of reaching y from x in one step: the edge weight from x to y, out of the outgoing weight from x. `Personalized PageRank’: reset probability biased towards initial distribution. Returns a list of nodes (of type ) ranked by the graph walk probs. Approximate with power iteration, cut off after fixed number of iterations K. mention girls nsubj girls1 mention-1 like1 like mention nsubj-1 like2 mention boys2 -1 boys mention girls girls1 mention girls nsubj nsubj girls1 mention-1 like1 like partmod like1 mention nsubj-1 like2 mention-1 playing1 mention boys2 mention playing -1 boys mention … -1 boys mention girls nsubj girls1 mention-1 like1 Prep.with playing1 dolls1 mention-1 dolls Useful but not our goal here… Learning a better similarity metric Task T (query class) Query a + Rel. answers a Query b + Rel. answers b … Query q + Rel. answers q GRAPH WALK node rank 1 node rank 1 node rank 1 node rank 2 node rank 2 node rank 2 node rank 3 node rank 3 node rank 3 node rank 4 node rank 4 node rank 4 … … … node rank 10 node rank 10 node rank 10 node rank 11 node rank 11 node rank 11 node rank 12 node rank 12 node rank 12 … … … node rank 50 node rank 50 node rank 50 Seed words (“girl”, “boy”, …) Potential new instances of the target concept (“doll”, “child”, “toddler”, …) Learning methods Weight tuning – weights learned per edge type [Diligenti et-al, 2005] Reranking – re-order the retrieved list using global features of all paths from source to destination [Minkov et-al, 2006] FEATURES boys Edge label sequences nsubj.nsubj-inv Lexical unigrams nsubj partmod partmod-inv nsubj-inv … “like”, “playing” dolls nsubj partmod prep.in “like”, “playing” Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn P(xz : relevant(z) | history(Vq,x) ) History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree Vq “girls” nsubj x1 nsubj-inv boys partmod partmod-inv x2 prep.in x3 dolls nsubj-inv boys boys nsubj.nsubj-inv nsubj partmod partmod-inv nsubj-inv dolls nsubj partmod prep.in City and person name extraction words Labeling MUC Complete Partial/Noisy MUC+AP City names: Person names: nodes edges 140K 82K 244K 3K (true) 2,440K 1,030K 3,550K 36K (auto) Vq = {sydney, stamford, greenville, los_angeles} Vq = {carter, dave_kingman, pedro_ramos, florio} – 10 (X4) queries for each task • Train queries q1-q5 / test queries q6-q10 – – – – NEs Extract nodes of type NE. GW: 6 steps, uniform/learned weights Reranking: top 200 nodes (using learned weights) Path trees: 20 correct / 20 incorrect; threshold 0.5 MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk 1 11 21 31 41 51 61 71 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 conj-and, prep-in, nn, appos … Graph Walk Weight Tuning 1 11 21 31 41 51 61 71 subj, obj, poss, nn … 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW 1 11 21 31 41 51 61 71 conj-and, prep-in, nn, appos … subj, obj, poss, nn … prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW Reranking 1 11 21 31 41 51 61 71 81 conj-and, prep-in, nn, appos … subj, obj, poss, nn … Prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv LEX.”based”, LEX.”downtown” LEX.”mr”, LEX.”president” 91 rank Vector-space models • Co-occurrence vectors (counts; window: +/- 2) • Dependency vectors [Padó & Lapata, Comp Ling 07] – A path value function: • Length-based value: • Relation based value: 1 / length(path) subj-5, obj-4, obl-3, gen-2, else-1 – Context selection function: • Minimal: verbal predicate-argument (length 1) • Medium: coordination, genitive construction, noun compounds (<=3) • Maximal: combinations of the above (<=4) – Similarity function: • Cosine • Lin Only score the top nodes retrieved with reranking (~1000 overall) GWs – Vector models MUC City names precision 1 Person names 1 0.9 PCW 0.9 0.8 Rerank 0.8 CO 0.7 0.7 DV 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 rank 1 11 21 31 41 51 61 The graph-based methods are best (syntactic + learning) 71 81 91 GWs – Vector models MUC + AP City names precision 1 Person names 1 0.9 Rerank 0.9 0.8 PCW 0.8 DV 0.7 0.7 CO 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 rank 1 11 21 31 41 51 61 71 The advantage of the graph based models diminishes with the amount of data. This is hard to evaluate at high ranks 81 91 Outline • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity “with” Einat Minkov (CMU Nokia) – Approach 2: A carefully-engineered graph derived from huge corpus “with” Richard Wang (CMU ?) Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas – Dynamically build the graph using queries to the web – Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages 1. 2. 3. Canon Nikon Olympus System Architecture 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers The Extractor • Learn wrappers from web documents and seeds on the fly – Utilize semi-structured documents – Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only – See ICDM 2007 paper for details <li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> I am noise <span class="dName">Curry Ford</span>...</li></ul> </li> <li class="honda"><a href="http://www.curryauto.com/"> <img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> Me too! <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li> <li class="acura"><a href="http://www.curryauto.com/"> <img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li> <li class="nissan"><a href="http://www.curryauto.com/"> <img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li> <li class="toyota"><a href="http://www.curryauto.com/"> <img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li> The Ranker • Rank candidate entity mentions based on “similarity” to seeds – • Noisy mentions should be ranked lower Random Walk with Restart (GW) • • As before… What’s the graph? Building a Graph “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com “chevrolet” 22.5% “honda” 26.1% Wrapper #3 derive Wrapper #1 “acura” 34.6% “volvo chicago” 8.4% Wrapper #4 “bmw pittsburgh” 8.4% • A graph consists of a fixed set of… – Node Types: {seeds, document, wrapper, mention} – Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) – Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, Evaluation Datasets: closed sets Evaluation Method • Mean Average Precision – – – – Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list Prec(r) = precision at rank r NewEntity (r ) 1 if (a) and (b) are true otherwise 0 (a) Extracted mention at r matches any true mention where L = ranked list of extracted mentions, r = rank • Evaluation Procedure 1. (per dataset) Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset Experimental Results: 3 seeds Overall MAP vs. Various Methods 100% MAP (%) 95% 80% 90% 60% 85% 40% 80% 20% 75% 93.13% 94.03% 82.39% 94.18% 93.13% 87.61% 82.39% 43.76% 14.59% 70% 0% E1+EF+100 G.Sets E2+GW+100 E2+EF+100 G.Sets (Eng) E2+GW+200 E2+GW+100 E1+EF+100 E2+GW+300 Methods Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences ) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 } Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06 Side by side comparisons EachMovie vs WWW Ghahramani & Heller, NIPS 2005 NIPS vs WWW A limitation of the original SEAL Preliminary Study on Seed Sizes 85% Mean Average Precision 84% 83% 82% 81% 80% 79% 78% RW PR BS WL 77% 76% 75% 2 3 4 # Seeds (Seed Size) 5 6 Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… – Expands a couple of seeds – Aggregates statistics • Evaluate iSEAL using… – Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) – Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size – Five ranking methods ISeal (Fixed Seed Size, Supervised) Initial Seeds • Finally rank nodes by proximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use highconfidence extractions when seeds run out Ranking Methods Random Graph Walk with Restart – H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank – L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) – Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length – Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency – Weights each item based on the number of wrappers that extract the item Fixed Seed Size (Supervised) 98% Mean Average Precision 97% 96% 95% 94% 93% 92% RW 91% PR BS 90% WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 89% 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Supervised) 97% Mean Average Precision 96% 95% 94% 93% RW 92% PR BS 91% WL WF 90% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Bootstrapping) Mean Average Precision 94% 93% 92% 91% RW PR 90% BS WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) 98% Increasing Seed Size (Supervised) 97% 96% 96% 95% 94% 93% 92% RW PR 91% 90% 89% 1 2 95% 94% 93% RW 92% PR Little differenceBSbetween 91% ranking methods WL for supervisedWFcase (all seeds correct); 90% 2 3 4 5 6 7 8 3 4 5 6 7 differences 8 9 10 large when1 bootstrapping # Iterations (Cumulative Expansions) # Iterations (Cumulative Expansions) BS WL WF 9 10 Increasingmakes Seed Size (Bootstrapping) Increasing seed size {2,3,4,4,…} all 94% ranking methods improve steadily in 93% case bootstrapping Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% Mean Average Precision Mean Average Precision Mean Average Precision 97% 90% 89% 88% RW PR BS WL 87% 92% 91% RW PR 90% BS WL WF 86% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) Increasing Seed Size (Supervised) 97% 97% 96% 96% 95% 94% 93% 92% RW 91% PR Mean Average Precision Mean Average Precision 98% 94% 93% WL PR BS WL WF 90% WF 89% RW 92% 91% BS 90% 95% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Increasing Seed Size (Bootstrapping) Mean Average Precision 94% Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 93% 92% 91% RW PR 90% BS WL 89% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 9 10 Current work • Start with name of concept (e.g., “NFL teams”) • Look for (language-dependent) patterns: – “… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” • Take most frequent answers as seeds • Run bootstrapping iSEAL with seed sizes 2,3,4,4…. Datasets with concept names Experimental results Direct use of text patterns Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” SNOWBIRD For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph, approach 1: – – – – Minkov & Cohen, EMNLP 08: Graph ~ dependency-parsed corpus Off-the-shelf distance metrics not great With learning: • Results significantly better than state-of-the-art on small corpora (e.g. a personal email corpus) • Results competitive on 2M+ word corpora Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph, approach 2: – Wang & Cohen, ICDM 07, 08: – Graph built on-the-fly with web queries • A good graph matters! – Off-the-shelf distance metrics work • Differences are minimal for clean seeds • Modest improvements from learning w/ clean seeds – E.g., reranking (not described here) • Bigger differences in similarity measures with noisy seeds Thanks to • DARPA PAL program Sponsored links: – Minkov, Cohen, Wang http://boowa.com (Richard’s demo) • Yahoo! Research Labs – Minkov • Google Research Grant program – Wang • The organizers for inviting me!

Descargar
# Template for Tobacco Fund Settlement Review Talks