Graph-Based Methods for
“Open Domain”
Information Extraction
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Traditional IE vs Open Domain IE
• Goal: recognize people,
places, companies, times,
dates, … in NL text.
• Supervised learning from
corpus completely annotated
with target entity class (e.g.
“people”)
• Linear-chain CRFs
• Language- and genre-specific
extractors
• Goal: recognize arbitrary
entity sets in text
– Minimal info about entity
class
– Example 1: “ICML, NIPS”
– Example 2: “Machine learning
conferences”
• Semi-supervised learning
from very large corpora
(WWW)
• Graph-based learning
methods
• Techniques are largely
language-independent (!)
– Graph abstraction fits many
languages
Examples with three seeds
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
– Approach 2: A carefully-engineered graph
derived from huge corpus (e.g’s above)
History: Open-domain IE by patternmatching (Hearst, 92)
•
•
Start with seeds: “NIPS”, “ICML”
Look thru a corpus for certain
patterns:
•
•
•
… “at NIPS, AISTATS, KDD and other
learning conferences…”
Expand from seeds to new instances
Repeat….until ___
–
“on PC of KDD, SIGIR, … and…”
Bootstrapping as graph proximity
NIPS
“…at NIPS, AISTATS, KDD and
other learning conferences…”
SNOWBIRD
For skiiers, NIPS, SNOWBIRD,…
and…”
AISTATS
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
shorter paths ~ earlier iterations
many paths ~ additional evidence
Outline
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
“with” Einat Minkov (CMU  Nokia)
– Approach 2: A carefully-engineered graph
derived from huge corpus (above)
“with” Richard Wang (CMU  ?)
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
nsubj
boys
partmod
like
prep.with
playing
all
kinds
det
NN
VB
VB
DT
cars
prep.of
NN
NN
Dependency parsed
sentence is a naturally
represented as a tree
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
Dependency parsed
corpus is “naturally”
represented as a
graph
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
Open IE Goal:
• Find “coordinate terms”
(eg, girl/boy, dolls/cars) in
the graph, or find
• Similarity measure S so
S(girl,boy) is high
• What about off-the-shelf
similarity measures:
• Random Walk with
Restart (RWR)
• Hitting time
• Commute time
•…?
Personalized PR/RWR
A query language:
Q: {
,
}
The graph
Nodes
Node type
Edge label
Edge weight
graph walk parameters:
edge weights Θ , walk length K
and reset probability γ.
M[x,y] = Prob. of reaching y
from x in one step:
the edge weight from x to y, out
of the outgoing weight from x.
`Personalized PageRank’:
reset probability biased towards
initial distribution.
Returns a list of nodes
(of type
) ranked by
the graph walk probs.
Approximate with power
iteration, cut off after fixed
number of iterations K.
mention
girls
nsubj
girls1
mention-1
like1
like
mention
nsubj-1
like2
mention
boys2
-1
boys
mention
girls
girls1
mention
girls
nsubj
nsubj
girls1
mention-1
like1
like
partmod
like1
mention
nsubj-1
like2
mention-1
playing1
mention
boys2
mention
playing
-1
boys
mention
…
-1
boys
mention
girls
nsubj
girls1
mention-1
like1
Prep.with
playing1
dolls1
mention-1
dolls
Useful but not our
goal here…
Learning a better similarity metric
Task T
(query class)
Query a
+ Rel. answers a
Query b
+ Rel. answers b
…
Query q
+ Rel. answers q
GRAPH WALK
 node rank 1
 node rank 1
 node rank 1
 node rank 2
 node rank 2
 node rank 2
 node rank 3
 node rank 3
 node rank 3
 node rank 4
 node rank 4
 node rank 4
…
…
…
 node rank 10
 node rank 10
 node rank 10
 node rank 11
 node rank 11
 node rank 11
 node rank 12
 node rank 12
 node rank 12
…
…
…
 node rank 50
 node rank 50
 node rank 50
Seed words
(“girl”,
“boy”,
…)
Potential new
instances of
the target
concept
(“doll”,
“child”,
“toddler”,
…)
Learning methods
Weight tuning – weights learned per edge type
[Diligenti et-al, 2005]
Reranking – re-order the retrieved list using global features
of all paths from source to destination [Minkov et-al, 2006]
FEATURES
boys
 Edge label sequences
nsubj.nsubj-inv
 Lexical unigrams
nsubj  partmod 
partmod-inv 
nsubj-inv
…
“like”, “playing”
dolls
nsubj  partmod 
prep.in
“like”, “playing”
Learning methods:
Path-Constrained Graph Walk
PCW (summary): for each node x, learn
 P(xz : relevant(z) | history(Vq,x) )
 History(Vq,x) = seq of edge labels leading from Vq to x,
with all histories stored in a tree
Vq
“girls”
nsubj
x1
nsubj-inv
boys
partmod
partmod-inv
x2
prep.in
x3
dolls
nsubj-inv
boys
boys
nsubj.nsubj-inv
nsubj  partmod 
partmod-inv 
nsubj-inv
dolls
nsubj  partmod 
prep.in
City and person name extraction
words
Labeling
MUC
Complete
Partial/Noisy
MUC+AP
City names:
Person names:
nodes
edges
140K
82K
244K
3K (true)
2,440K
1,030K
3,550K
36K (auto)
Vq = {sydney, stamford, greenville, los_angeles}
Vq = {carter, dave_kingman, pedro_ramos, florio}
– 10 (X4) queries for each task
• Train queries q1-q5 / test queries q6-q10
–
–
–
–
NEs
Extract nodes of type NE.
GW: 6 steps, uniform/learned weights
Reranking: top 200 nodes (using learned weights)
Path trees: 20 correct / 20 incorrect; threshold 0.5
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
1
11
21
31
41
51
61
71
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
conj-and, prep-in, nn, appos …
Graph Walk
Weight Tuning
1
11
21
31
41
51
61
71
subj, obj, poss, nn …
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
Weight Tuning
PCW
1
11
21
31
41
51
61
71
conj-and, prep-in, nn, appos …
subj, obj, poss, nn …
prep-in-inv  conj-and
nn-inv  nn
nsubj  nsubj-inv
appos  nn-inv
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
Weight Tuning
PCW
Reranking
1
11
21
31
41
51
61
71
81
conj-and, prep-in, nn, appos …
subj, obj, poss, nn …
Prep-in-inv  conj-and
nn-inv  nn
nsubj  nsubj-inv
appos  nn-inv
LEX.”based”, LEX.”downtown”
LEX.”mr”, LEX.”president”
91
rank
Vector-space models
• Co-occurrence vectors (counts; window: +/- 2)
• Dependency vectors
[Padó & Lapata, Comp Ling 07]
– A path value function:
• Length-based value:
• Relation based value:
1 / length(path)
subj-5, obj-4, obl-3, gen-2, else-1
– Context selection function:
• Minimal: verbal predicate-argument (length 1)
• Medium: coordination, genitive construction, noun compounds (<=3)
• Maximal: combinations of the above (<=4)
– Similarity function:
• Cosine
• Lin

Only score the top nodes retrieved with reranking (~1000 overall)
GWs – Vector models
MUC
City names
precision
1
Person names
1
0.9
PCW
0.9
0.8
Rerank
0.8
CO
0.7
0.7
DV
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
rank
1
11
21
31
41
51
61
 The graph-based methods are best (syntactic + learning)
71
81
91
GWs – Vector models
MUC + AP
City names
precision
1
Person names
1
0.9
Rerank
0.9
0.8
PCW
0.8
DV
0.7
0.7
CO
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
rank
1
11
21
31
41
51
61
71
 The advantage of the graph based models diminishes with
the amount of data.
 This is hard to evaluate at high ranks
81
91
Outline
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
“with” Einat Minkov (CMU  Nokia)
– Approach 2: A carefully-engineered graph
derived from huge corpus
“with” Richard Wang (CMU  ?)
Set Expansion for Any Language (SEAL) –
(Wang & Cohen, ICDM 07)
• Basic ideas
– Dynamically build the graph using queries to
the web
– Constrain the graph to be as useful as
possible
• Be smart about queries
• Be smart about “patterns”: use clever methods
for finding meaningful structure on web pages
1.
2.
3.
Canon
Nikon
Olympus
System Architecture
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
• Fetcher: download web pages from the Web
that contain all the seeds
• Extractor: learn wrappers from web pages
• Ranker: rank entities extracted by wrappers
The Extractor
• Learn wrappers from web documents
and seeds on the fly
– Utilize semi-structured documents
– Wrappers defined at character level
• Very fast
• No tokenization required; thus language
independent
• Wrappers derived from doc d applied to d only
– See ICDM 2007 paper for details
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a>
<ul><li class="last"><a
href="http://www.curryauto.com/">
I am
noise
<span class="dName">Curry Ford</span>...</li></ul>
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
<span class="dName">Curry Honda Atlanta</span>...</li>
Me too!
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
<li class="last"><a href="http://www.curryhondany.com/">
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
</li>
<li class="acura"><a href="http://www.curryauto.com/">
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a>
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
The Ranker
•
Rank candidate entity mentions based
on “similarity” to seeds
–
•
Noisy mentions should be ranked lower
Random Walk with Restart (GW)
•
•
As before…
What’s the graph?
Building a Graph
“ford”, “nissan”, “toyota”
Wrapper #2
find
northpointcars.com
extract
curryauto.com
“chevrolet”
22.5%
“honda”
26.1%
Wrapper #3
derive
Wrapper #1
“acura”
34.6%
“volvo chicago”
8.4%
Wrapper #4
“bmw pittsburgh”
8.4%
• A graph consists of a fixed set of…
– Node Types: {seeds, document, wrapper, mention}
– Labeled Directed Edges: {find, derive, extract}
• Each edge asserts that a binary relation r holds
• Each edge has an inverse relation r-1 (graph is cyclic)
– Intuition: good extractions are extracted by many good
wrappers, and good wrappers extract many good extractions,
Evaluation Datasets: closed sets
Evaluation Method
•
Mean Average Precision
–
–
–
–
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list
Prec(r) = precision at rank r
NewEntity
(r ) 
 1 if (a) and (b) are true

otherwise
0
(a) Extracted mention at r
matches any true mention
where L = ranked list of extracted mentions, r = rank
•
Evaluation Procedure
1.
(per dataset)
Randomly select three true entities and use their
first listed mentions as seeds
2. Expand the three seeds obtained from step 1
3. Repeat steps 1 and 2 five times
4. Compute MAP for the five ranked lists
(b) There exist no other
extracted mention at rank
less than r that is of the
same entity as the one at r
# True Entities = total number
of true entities in this dataset
Experimental Results: 3 seeds
Overall MAP vs. Various Methods
100%
MAP (%)
95%
80%
90%
60%
85%
40%
80%
20%
75%
93.13%
94.03%
82.39%
94.18%
93.13%
87.61%
82.39%
43.76%
14.59%
70%
0%
E1+EF+100
G.Sets
E2+GW+100
E2+EF+100
G.Sets
(Eng)
E2+GW+200
E2+GW+100
E1+EF+100
E2+GW+300
Methods
Vary: [Extractor] + [Ranker] + [Top N URLs]
Extractor:
• E1: Baseline Extractor (longest common context for all seed occurrences )
• E2: Smarter Extractor (longest common context for 1 occurrence of each seed)
Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk }
N URLs: { 100, 200, 300 }
Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
Side by side comparisons
EachMovie vs WWW
Ghahramani & Heller, NIPS 2005
NIPS vs WWW
A limitation of the original SEAL
Preliminary Study on Seed Sizes
85%
Mean Average Precision
84%
83%
82%
81%
80%
79%
78%
RW
PR
BS
WL
77%
76%
75%
2
3
4
# Seeds (Seed Size)
5
6
Proposed Solution: Iterative SEAL (iSEAL)
(Wang & Cohen, ICDM 2008)
• Makes several calls to SEAL, each call…
– Expands a couple of seeds
– Aggregates statistics
• Evaluate iSEAL using…
– Two iterative processes
• Supervised vs. Unsupervised (Bootstrapping)
– Two seeding strategies
• Fixed Seed Size vs. Increasing Seed Size
– Five ranking methods
ISeal (Fixed Seed Size, Supervised)
Initial Seeds
• Finally rank nodes by proximity
to seeds in the full graph
• Refinement (ISS): Increase
size of seed set for each
expansion over time: 2,3,4,4,…
• Variant (Bootstrap): use highconfidence extractions when
seeds run out
Ranking Methods
Random Graph Walk with Restart
–
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart
and its application. In ICDM, 2006.
PageRank
–
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation
ranking: Bringing order to the web. 1998.
Bayesian Sets (over flattened graph)
–
Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
Wrapper Length
–
Weights each item based on the length of common contextual string
of that item and the seeds
Wrapper Frequency
–
Weights each item based on the number of wrappers that extract
the item
Fixed Seed Size (Supervised)
98%
Mean Average Precision
97%
96%
95%
94%
93%
92%
RW
91%
PR
BS
90%
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
89%
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Supervised)
97%
Mean Average Precision
96%
95%
94%
93%
RW
92%
PR
BS
91%
WL
WF
90%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
93%
92%
91%
RW
PR
90%
BS
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
98%
Increasing Seed Size (Supervised)
97%
96%
96%
95%
94%
93%
92%
RW
PR
91%
90%
89%
1
2
95%
94%
93%
RW
92%
PR
Little differenceBSbetween
91% ranking methods
WL
for supervisedWFcase (all
seeds correct);
90%
2
3
4
5
6
7
8
3
4
5
6
7 differences
8
9
10
large
when1 bootstrapping
# Iterations (Cumulative Expansions)
# Iterations (Cumulative Expansions)
BS
WL
WF
9
10
Increasingmakes
Seed Size (Bootstrapping)
Increasing seed size {2,3,4,4,…}
all
94%
ranking methods improve steadily in
93% case
bootstrapping
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
Mean Average Precision
Mean Average Precision
Mean Average Precision
97%
90%
89%
88%
RW
PR
BS
WL
87%
92%
91%
RW
PR
90%
BS
WL
WF
86%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
Increasing Seed Size (Supervised)
97%
97%
96%
96%
95%
94%
93%
92%
RW
91%
PR
Mean Average Precision
Mean Average Precision
98%
94%
93%
WL
PR
BS
WL
WF
90%
WF
89%
RW
92%
91%
BS
90%
95%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
93%
92%
91%
RW
PR
90%
BS
WL
89%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
9
10
Current work
• Start with name of concept (e.g., “NFL
teams”)
• Look for (language-dependent) patterns:
– “… for successful NFL teams (e.g.,
Pittsburgh Steelers, New York Giants, …)”
• Take most frequent answers as seeds
• Run bootstrapping iSEAL with seed sizes
2,3,4,4….
Datasets with concept names
Experimental results
Direct use of text patterns
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph
NIPS
“…at NIPS, AISTATS, KDD and
other learning conferences…”
SNOWBIRD
For skiiers, NIPS, SNOWBIRD,…
and…”
AISTATS
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
shorter paths ~ earlier iterations
many paths ~ additional evidence
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph, approach 1:
–
–
–
–
Minkov & Cohen, EMNLP 08:
Graph ~ dependency-parsed corpus
Off-the-shelf distance metrics not great
With learning:
• Results significantly better than
state-of-the-art on small corpora
(e.g. a personal email corpus)
• Results competitive on 2M+ word
corpora
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph, approach 2:
– Wang & Cohen, ICDM 07, 08:
– Graph built on-the-fly with web queries
• A good graph matters!
– Off-the-shelf distance metrics work
• Differences are minimal for clean seeds
• Modest improvements from learning w/ clean seeds
– E.g., reranking (not described here)
• Bigger differences in similarity measures
with noisy seeds
Thanks to
• DARPA PAL program
Sponsored links:
– Minkov, Cohen, Wang
http://boowa.com
(Richard’s demo)
• Yahoo! Research Labs
– Minkov
• Google Research Grant program
– Wang
• The organizers for inviting me!
Descargar

Template for Tobacco Fund Settlement Review Talks