Extracting Lexical Reference Rules from Wikipedia
Eyal Shnarch, Libby Barak, Ido Dagan
Bar Ilan University, Israel
Eyal Shnarch, Libby Barak, Ido Dagan
1/18
Motivation: lexical inference
• Question Answering:
“Which British luxury car was the best selling of 2008?”
• Information Retrieval:
Abbey Road
George Harrison
“The Beatles”
Eyal Shnarch, Libby Barak, Ido Dagan
2/18
Lexical reference
• Lexical Reference (LR) – a term in a text implies concrete
reference to the meaning of a target term (Glickman et al., 2006)
– Narrower than
similarity / lexical association
– Wider than
the common lexical relations
• synonymy, hyponymy, meronymy etc.
The Beatles
Abbey
Road
Yellow
Submarine
Sgt.
Pepper
• Classical LR resources:
– WordNet: covers dictionary knowledge, costly built, partial
– Distributional similarity: low precision
– Indicative (Hearst) patterns: limited coverage, mostly IS-A
Current status: do not cover the full scope (Mirkin et al., 2009)
Eyal Shnarch, Libby Barak, Ido Dagan
3/18
Goals
• Automatically learn LR rules
– from a knowledge-base created for human consumption (Wikipedia)
– Focusing on generic information elements in a Web resource
– We did not utilize Wikipedia specific features:
• Info-boxes, category tags, lists pages, disambiguation pages
– Publicly available rule-base
• Improve inference applications by applying LR rules
– Application oriented evaluation
– As opposed to rule correctness isolated from their real utility
Eyal Shnarch, Libby Barak, Ido Dagan
4/18
Extraction methods
•Be-complement
noun in the position of a
complement of a verb ‘be’
•All-nouns
all nouns in the definition
•Redirect
various terms to canonical title
•Parenthesis
disambiguation mean
•Link
hyperlinks in the entire page
Eyal Shnarch, Libby Barak, Ido Dagan
5/18
Eyal Shnarch, Libby Barak, Ido Dagan
6/18
Output analysis
• 8 million candidate rules learnt
– Mostly Named Entities (as expected), but many common
words/terms
• Manual analysis:
– 800 rules sampled and annotated for LR (Kappa 0.7)
Extraction method
Per method
Precision %
Est. # correct rules
Redirect
87
1.8M (33%)
Be-complement
78
1.6M (29%)
Parenthesis
71
0.09M (2%)
Link
70
0.5M (9%)
All-nouns
49
1.5M (27%)
Total
Eyal Shnarch, Libby Barak, Ido Dagan
5.5M (100%)
7/18
Interesting relations in All-nouns
Relation
Rule
Text
Location
Lyon  France
Lyon city in France
Occupation
Thomas H. Cormen  computer science
Thomas H. Cormen professor of
computer science
Creation
The Da Vinci Code  Dan Brown
The Da Vinci Code novel by Dan Brown
Origin
Willem van Aelst  Dutch
Willem van Aelst Dutch artist
Alias
Dean Moriarty  Benjamin Linus
Dean Moriarty alias of Benjamin Linus
on Lost
Spelling
Egushawa  Agushaway
Egushawa, also spelled Agushaway...
Need to better utilize this method
Eyal Shnarch, Libby Barak, Ido Dagan
8/18
Ranking All-nouns rules
• Nouns in definition vary in their likelihood to be referred by
the title
– Depends greatly on the syntactic path connecting the title and the
noun.
film
subj
• Unsupervised reference likelihood
score for a path p:
vrel
<noun> directed
by-subj
by
score ( p ) 
Count
Title  p  Noun
Count
all
( p)
( p)
pcomp-n
<noun>
• Use score to rank All-nouns rules
Eyal Shnarch, Libby Barak, Ido Dagan
9/18
All-nouns analysis
Extraction method
Per method
Accumulated
Precision %
Est. # correct rules
Precision %
Correct rules %
Redirect
87
1,851,384
87
31
Be-complement
78
1,618,913
82
60
Parenthesis
71
94,155
82
60
Link
70
485,528
80
68
All-nounstop
60
684,238
76
83
All-nounsmiddle
46
380,572
72
90
All-nounsbottom
41
515,764
66
100
Eyal Shnarch, Libby Barak, Ido Dagan
10/18
Error analysis
Transparent head
11%
All-N pattern errors
13%
Technical errors
10%
Dates and Places
5%
Link errors
5%
Redirect errors
5%
Related but not
Referring
16%
Eyal Shnarch, Libby Barak, Ido Dagan
Wrong NP part
35%
11/18
Improve precision – rule filtering
• Incorrect rules tend to relate terms that are unlikely to cooccur together
• Filter by Dice coefficient threshold:
2  count ( LHS , RHS )
count ( LHS )  count ( RHS )
*Subset sum problem  cryptography
magic  cryptography
• Partially overcome Wrong NP part error by adjusting Dice:
2  [ count ( LHS , RHS )  count ( LHS , NP ( RHS ))]
count ( LHS )  count ( RHS )
*aerial tramway  car
Eyal Shnarch, Libby Barak, Ido Dagan
aerial tramway  cable car
12/18
Eyal Shnarch, Libby Barak, Ido Dagan
13/18
Two task-based evaluations
• Unsupervised Text Categorization:
– 20 News Group collection
– Given a category name expand it using LR rules:
cryptology
cryptographic
cryptographer
 Cryptography
decrypt adversary certificate digital signature cipher
– Compare document and expanded category name
• Cosine similarity score
• Classify to best-scoring category (single-class classification)
• Recognizing Textual Entailment (RTE)
– Usage within inference engine (Bar-Haim et al., 2008)
Eyal Shnarch, Libby Barak, Ido Dagan
14/18
Wikipedia’s contribution
• Rule-base utility by TC system:

Politics

Cryptography

Mac

Religion

Medicine
Michael Crichton

Jurassic Park
Gulf Cooperation Council

GCC
opposition
coalition
whip
Key exchange certificate cryptosystem digital signature
Radius
heaven
doctor
PowerBook
creation
belief
Grap
missionary
physician treatment clinical
MD
• Rule-base utility by RTE system:
Eyal Shnarch, Libby Barak, Ido Dagan
15/18
Results: text categorization
Rule base
Recall %
Precision %
F1
No Expansions
19
54
28
Kazama & Torisawa, 2007
19
53
28
Snow400K
19
54
28
Lin dependency similarity
25
39
30
WordNet
30
47
37
Redirect + Be-complement
22
55
31
All rules
31
38
34
All rules + Dice
31
49
38
WordNet + Wikiall rules+Dice
35
47
40
Baselines
Extraction
methods from
Wikipedia
Union
• Wikipedia’s performance comparable to WordNet
• Union works best (complementary)
Eyal Shnarch, Libby Barak, Ido Dagan
16/18
Results: RTE
System configuration
Accuracy %
Accuracy drop %
WordNet + Wikipedia
60.0
-
without Wikipedia
58.9
1.1
without WordNet
57.7
2.3
– External knowledge resources typically contribute around 0.5-2% in
accuracy for current RTE systems.
(Iftene and Balahur-Dobrescu, 2007; Dinu and Wang, 2009)
Eyal Shnarch, Libby Barak, Ido Dagan
17/18
Conclusions
• Future work:
– Improve rule ranking criteria
Adi Shamir  Cryptographer  Cryptography
– Exploit graph structure
• Large-scale resource of lexical reference rules
– proven beneficial within two application settings
• Automatically built resource comparable to WordNet and
provides complementary knowledge
– Combination of resources much more effective than each alone
• Use our resource (and cite us ) – will soon be publicly available
– Check: textual entailment resource pool (in ACL-wiki)
Eyal Shnarch, Libby Barak, Ido Dagan
18/18
Descargar

u.cs.biu.ac.il