Information Extraction from Wikipedia:
Moving Down the Long Tail
Fei Wu, Raphael Hoffmann, Daniel S. Weld
Department of Computer Science & Engineering
University of Washington
Seattle, WA, USA
Intelligence in Wikipedia:
Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James
Fogarty, Raphael Hoffmann, Kayur Patel,
Stef Schoenmackers & Michael Skinner
Motivating Vision

Next-Generation Search =
Information Extraction + Ontology + Inference
Which performing
artists were born
in Chicago?
…
Bob was born in
Northwestern
Memorial
Hospital. …
…
Bob Black is an
active actor who
was selected as
this year’s
…
…
Northwestern
Memorial
Hospital is one
of the country’s
leading hospitals
in Chicago
…
Next-Generation Search

Information Extraction




<Bob, Born-In, NMH>
<Bob Black, ISA, actor>
<NMH, in Chicago>
…
Ontology


…
Bob was born in
Northwestern
Memorial
Hospital. …
Actor ISA Performing Artist
…
Inference

Born-In(A) ^ PartOf(A,B)
=> Born-In(B)
…
…
Bob Black is an
active actor who
…
…
Northwestern
Memorial
Hospital is one
of the country’s
leading hospitals
in Chicago
…
Wikipedia – Bootstrap for the Web


Goal: search over the Web
Now: search over Wikipedia
 Comprehensive
 High-quality
 (Semi-)Structured data
Infoboxes

Infoboxes are designed to present summary
information about an article's subject, such that
similar subjects have a uniform look and in a
common format

An infobox is a generalization of a taxobox
(from taxonomy) which summarizes information
for an organism or group of organisms.
Infobox examples
Basic infobox
Taxobox –Plant species
More example
Infobox People - Actor
Infobox- Convention
Center
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Kylin: Autonomously Semantifying
Wikipedia
 Totally
efforts
autonomous with no additional human
 Form training dataset based on infoboxes
Extract semantic relations from Wikipedia articles
Kylin: a mythical hooved Chinese
chimerical creature that is said to appear
in conjunction with the arrival of a sage.
------Wikipedia
Kylin

It is a prototype of self-supervised, machine
learning system

It looks for classes of pages with similar
infoboxes

It determines common attributes

It creates training examples
Infobox Generation
Preprocessor
Schema Refinement



Free edit -> schema drift
Duplicate templates:U.S.County(1428), US County(574),
Counties(50), County(19)
Low usage of attribute
Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census
Est.”, “Census Year”
Kylin:


Strict name match
15% occurrences
Preprocessor
• Training Dataset Construction
Clearfield County was created on 1804 from parts
of Huntingdon and Lycoming Counties but was
administered as part of Centre County until 1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and
17 km² (7 mi²) of it (0.56%) is water.
As of 2005, the population density
was 28.2/km².
Classifier

Document Classifier
List and Category




Fast
Precision(98.5%)
Recall(68.8%)
Sentence Classifier
 Predicts which attribute value are contained in given sentence.
 It uses maximum entropy model.
 To decrease noisy and incomplete training dataset, Kylin apply bagging.
CRF Extractor
Conditional Random Fields Model
Attribute value extraction: sequential data labeling


CRF model for each attribute independently
Relabel–filter false negative training examples


2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water.
Preprocessor: Water_area
Classifier: Water_area; Land_area

Though Kylin is successful on popular classes, its performance decreases
on sparse classes where there is insufficient training data.
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Long-Tail 1: Sparse Infobox Class

Kylin Performs Well on Popular Classes:
Precision:
Recall:

mid 70% ~ high 90%
low 50% ~ mid 90%
Kylin Flounders on Sparse Classes – Little Training Data
e.g: for “US County class ” Kylin has 97.3% precision and 95.9%
recall while many other classes like “Irish Newspaper” contains very
small number of infobox containing articles
Long-Tail 2: Incomplete Articles

Desired Information Missing from Wikipedia
Among 1.8 millions pages [July 2007 of Wikipedia ] many are
short articles and almost 800,000 (44.2%) are marked as
stub pages indicating much needed information is
missing.
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Shrinkage

Attempt to improve Kylin’s performance using
shrinkage.

We use Shrinkage when training an extractor of an
instance-space infobox class by aggregating data
from its parent and children classes
Shrinkage
[McCallum et al., ICML98]
.birth_place
person
(1201)
.location
performer
(44)
actor
(8738)
comedian
(106)
.birthplace
.birth_place
.cityofbirth
.origin
Shrinkage

KOG (Kylin Ontology Generator) [Wu & Weld, WWW08]
person
(1201)
performer
(44)
actor
(8738)
comedian
(106)
.birth_place
.location
.birthplace
.birth_place
.cityofbirth
.origin
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Retraining
Complementary to Shrinkage:
Harvest extra training data from broader Web
Key:
• Identify relevant sentences given the sea of Web data?
Andrew Murray was born in
Scotland in 1828 ……
<Andrew Murray, was born in, Scotland>
<Andrew Murray, was born in, 1828>
Retraining
Kylin Extraction:
TextRunner Extraction:
Query TextRunner for relevant sentences:
t=< Ada Cambridge, location, “St Germans , Norfolk , England”>
• r1=<Ada Cambridge, was born in, England>
Ada Cambridge was born in England in 1844 and moved to Australia with
her curate husband in 1870.
• r2=<Ada Cambridge, was born in, “Norfolk , England”>
Ada Cambridge was born in Norfolk , England , in 1844 .
Effect of Shrinkage & Retraining
Effect of Shrinkage & Retraining
1755% improvement
for a sparse class
13.7% improvement
for a popular class
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Extraction from the Web


Idea: apply Kylin extractors trained on
Wikipedia to general Web pages
Challenge: maintain high precision

General Web pages are noisy

Many Web pages describe multiple objects

Key: retrieve relevant sentences

Procedure



Generate a set of search engine queries
Retrieve top-k pages from Google
Weight extractions from these pages
Choosing Queries
Example: get birth date attribute for article titled
“Andrew Murray (minister)”
“andrew
“andrew
“andrew
“andrew
murray”
murray” birth date
murray” was born in
murray” …
attribute name
predicates
from
TextRunner
Weighting Extractions
Which extractions are more relevant?
Features

: # sentences between sentence and
closest occurrence of title (‘andrew murray’)

: rank of page on Google’s result lists

: Kylin’s extractor confidence
Web Extraction Experiment


Extractor confidence alone performs poor
Weighted combination is the best
Combining Wikipedia & Web
Recall Benefit from Shrinkage / Retraining…
Combining Wikipedia & Web
Benefit from Shrinkage + Retraining + Web
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Problem
 Information
Extraction is Imprecise
› Wikipedians Don’t Want 90% Precision
 How
Improve Precision?
› People!
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Intelligence in Wikipedia

What is IWP?
› A project/system that aims to combine
 IE (Information Extraction)
 CCC (communal content creation)
Information Extraction

Examples:
› Zoominfo.com
› Fligdog.com
› Citeseer
› Google
 Advantage: Autonomy
 Disadvantage: Expensive
IE system contributors

Contributors in this room?
› Wikipedia
IE systems
› Citeseer
› Rexa
› DBlife
Communal Content Creation

Examples
› Wikipedia
› Ebay
› Netflix
› Advantage: more accuracy then IE
› Disadvantage: bootstrapping, incentives,
and management
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
Virtuous Cycle
Contributing as a Non-Primary Task
 Encourage
contributions
 Without annoying or abusing readers
› Compared 5 different interfaces
Results
• Contribution Rate
• 1.6%  13%
• 90% of positive labels were correct
Outline


Background: Kylin Extraction
Long-Tailed Challenges



Moving Down the Long Tails





Shrinkage
Retraining
Extracting from the Web
Problem with information Extraction
IWP (Intelligence in Wikipedia)




Sparse infobox classes
Incomplete articles
CCC and IE
Virtuous Cycle
IWP (Shrinkage, Retraining and Extracting from Web)
 Multilingual Extraction
Summary
IWP and Shrinkage, Retraining, and
Extracting from the Web



Shrinkage – improves IWP’s precision, and
recall
Retraining – improves the robustness of
IWP’s extractors
Extraction – further helps IWP’s performance
Multi-Lingual Extraction


Idea: Further leverage the virtuous feedback
cycle

Utilize IE methods to add or update missing
information by copying from one language to
another

Utilize CCC to validate and improve updates.
Example

Nombre = “Jerry Seinfeld” and Name = “Jerry
Seinfeld

Cónyuge = “Jessica Sklar” and Spouse = “Jessica
Sienfeld”
Summary

Kylin’s initial performance is unacceptable
 Methods for increasing recall
 Shrinkage
 Retraining
 Extraction
from the web
Summary

IWP – developing AI methods to facilitate the
growth, operation and use of Wikipedia
 Initial goal – extraction of a giant
knowledge bas of semantic triples
 Faceted browsing
 Input to reasoning based questionanswering system
 How
 IE
 CCC
Questions
Descargar

Slide 1