Term Informativeness for
Named Entity Detection
Jason D. M. Rennie
MIT
Tommi Jaakkola
MIT
Information Extraction
President Bush signed the Central America
Free Trade Agreement into law Tuesday…
Who
What
When
Named Entity Detection
President Bush signed the Central America
Free Trade Agreement into law Tuesday,
hailing the seven-nation pact as an opendoor policy that will benefit U.S. exporters
and seed prosperity and democracy in
Central America and the Dominican
Republic.
Informal Communication
• Other Sources of Information
– E-mail
– Web Bulletin Boards
– Mailing Lists
• More specialized, up-to-date information
• But, harder to extract
IE for Informal Comm.
SUBJECT: Two New Ipswich Seafood Joints
to Open Soon.
ALL HOUNDS ON DECK! #1 Across from the
new HS, at the old White Cap Seafood is a
renovated new joint and the sign says "Salt
Box". I suspect they are opening soon; they
look ready. Lets hope its great as there is too
much 'just average' around here. #2: In the…
NED for Informal Comm.
Subject: finale harvard square
has anyone been to the recently opened
finale in harvard square?
Restaurant Bulletin Board
• Gathered from a Restaurant BBoard
– 6 sets of ~100 posts
– 132 threads
– Applied Ratnaparki’s POS tagger
– Hand-labeled each token In/Out of restaurant
name
Detecting Named Entities
Named Entity
Named Entity
Informative
Informative
Bursty
Quantifying Informativeness
Document 1
the
Document 2
Brazil
Document 3
clandestine
A Little History…
Z-measure [Brookes,1968]
Inverse Doc. Freq. [Jones,1973]
xI [Bookstein & Swanson, 1974]
Residual IDF [Church & Gale, 1995]
Gain [Papenini, 2001]
Main Idea
• Informative words are:
– Rare (IDF)
– Modal (Mixture Score)
• Rarity and Modality are independent
qualities
• We quantify informativeness using a
product of IDF and Mixture Score
Binomial Distribution
Term Frequency Distributions
7
4
8
5
6
“the”
0
0
0
5
0
“Brazil”
Mixture Models
5
0
=90%
1=0.1%
10%
2=5%
Modality
• Modal words fit a mixture much better than
a single binomial
• We separately fit the binomial and mixture
models to each term frequency distribution
• We quantify modality by comparing the
fitness of the two models
Learning Mixture Parameters
 Use Gradient Descent to learn , 1, 2
Comparing Fitness
• Use log-odds to compare fitness of the two
models
Top Mixture Score Words
Token
sichaun
fish
was
speed
tacos
Score
99.62
50.59
48.79
44.69
43.77
Rest. Occur.
31/52
7/73
0/483
16/19
4/19
Independence
Rareness
(IDF)
?
Modality
(Mixture Score)
Correlation Coefficient
Score Pair
Corr. Coefficient
IDF/Mixture
IDF/RIDF
Mixture/RIDF
-.0139
.4113
.7380
Top Words Overlap Plot
• Two sorted lists
– Sorted by IDF
– Sorted by Mixture Score
• Look at % overlap among top N in both
lists
• Plot % overlap as we vary N
• Independent scores would produce line
along diagonal
Overlap Plot
Percent Overlap
IDF/RIDF
IDF/Mixture
# Top Words
Top IDF*Mixture Words
Token
sichaun
villa
tokyo
ribs
speed
Score
379.97
197.08
191.72
181.57
156.23
Rest. Occur.
31/52
10/11
7/11
0/13
16/19
Intro to NED Experiments
• Task: Identify Restaurant Names
• Use standard NED features (capitalization,
punctuation, POS) as “Baseline”
• Add informativeness score as an
additional feature
• Use F1 Breakeven as performance metric
NED Experiments
F1 Breakeven
Baseline
55.0%
IDF
56.0%
Mixture
56.0%
IDF,Mixture
56.9%
Residual IDF
57.4%
IDF*RIDF
58.5%
IDF*Mixture
59.3%
Better
Feature Set
Summary
• Traditional syntax-based features are not
enough for IE in e-mail & bulletin boards
• We used term occurrence statistics to
construct an informativeness score
(IDF*Mixture)
• We found IDF*Mixture to be useful for
identifying topic-centric words and named
entites
Discussion
•
•
•
•
Phrases
Foreign languages, Speech
Co-reference resolution, context tracking
Collaborative filtering
Descargar

Term Informativeness for Named Entity Detection