SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
October 11, 2004
1
Classifying at Different Granularies
Text Categorization:
Classify an entire document
Information Extraction (IE):
Identify and classify small units within documents
Named Entity Extraction (NE):
A subset of IE
Identify and classify proper names
– People, locations, organizations
2
Example: The Problem:
Looking for a Job
Martin Baker, a person
Genomics job
Employers job posting form
Adapted from slide by William Cohen
3
Example: A Solution: An
aggregator Web Site
Adapted from slide by William Cohen
4
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
Adapted from slide by William Cohen
5
Adapted from slide by William Cohen
6
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Job Openings:
Data Mining the Extracted Job Information
Adapted from slide by William Cohen
7
IE from Research Papers: CiteSeer
Adapted from slide by William Cohen
8
What is Information Extraction
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Adapted from slide by William Cohen
9
What is Information Extraction
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Adapted from slide by William Cohen
10
What is Information Extraction
As a family
of techniques:
Information Extraction =
segmentation + classification + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Adapted from slide by William Cohen
11
What is Information Extraction
A family
of techniques:
Information Extraction =
segmentation + classification + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Adapted from slide by William Cohen
12
What is Information Extraction
A family
of techniques:
Information Extraction =
segmentation + classification + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Adapted from slide by William Cohen
13
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Adapted from slide by William Cohen
Database
Query,
Search
Data mine
14
Landscape of IE Tasks:
Degree of Formatting
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Adapted from slide by William Cohen
Tables
15
Landscape of IE Tasks:
Intended Breadth of Coverage
Web site specific
Formatting
Amazon.com Book Pages
Adapted from slide by William Cohen
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
16
Landscape of IE Tasks”
Complexity
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Landscape of IE Tasks:
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named entity” extraction
Adapted from slide by William Cohen
18
State of the Art Performance: a
sample
Named entity recognition from newswire text
Person, Location, Organization, …
F1 in high 80’s or low- to mid-90’s
Binary relation extraction
Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
F1 in 60’s or 70’s or 80’s
Web site structure recognition
Extremely accurate performance obtainable
Human effort (~10min?) required on each site
Adapted from slide by William Cohen
19
Three generations of IE systems
Hand-Built Systems – Knowledge Engineering [1980s– ]
Rules written by hand
Require experts who understand both the systems and the
domain
Iterative guess-test-tweak-repeat cycle
Automatic, Trainable Rule-Extraction Systems [1990s– ]
Rules discovered automatically using predefined templates,
using automated rule learners
Require huge, labeled corpora (effort is just moved!)
Statistical Models [1997 – ]
Use machine learning to learn which features indicate
boundaries and types of entities.
Learning usually supervised; may be partially unsupervised
Slide by Chris Manning, based on slides by several others
20
Landscape of IE Techniques
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
Any
of these models can be used to capture words, formatting or both.
Adapted from slide by William Cohen
21
Trainable IE systems
Pros
Annotating text is simpler &
faster than writing rules.
Domain independent
Domain experts don’t need
to be linguists or
programers.
Learning algorithms ensure
full coverage of examples.
Slide by Chris Manning, based on slides by several others
Cons
Hand-crafted systems
perform better, especially at
hard tasks. (but this is
changing)
Training data might be
expensive to acquire
May need huge amount of
training data
Hand-writing rules isn’t that
hard!!
22
MUC: the genesis of IE
DARPA funded significant efforts in IE in the early to mid
1990’s.
Message Understanding Conference (MUC) was an
annual event/competition where results were presented.
Focused on extracting information from news articles:
Terrorist events
Industrial joint ventures
Company management changes
Information extraction of particular interest to the
intelligence community (CIA, NSA). (Note: early ’90’s)
Slide by Chris Manning, based on slides by several others
23
Message Understanding
Conference (MUC)
Named entity
– Person, Organization, Location
Co-reference
– Clinton  President Bill Clinton
Template element
– Perpetrator, Target
Template relation
– Incident
Multilingual
Adapted from slide by Lucian Vlad Lita
24
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint
venture in Taiwan with a local concern and a
Japanese trading house to produce golf clubs to be
shipped to Japan. The joint venture, Bridgestone
Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production of 20,000 iron
and “metal wood” clubs a month
Adapted from slide by Lucian Vlad Lita
25
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint
venture in Taiwan with a local concern and a
Japanese trading house to produce golf clubs to be
shipped to Japan. The joint venture, Bridgestone
Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production of 20,000 iron
and “metal wood” clubs a month
Adapted from slide by Lucian Vlad Lita
26
MUC Templates
Relationship
– tie-up
Entities:
– Bridgestone Sports Co, a local concern, a Japanese
trading house
Joint venture company
– Bridgestone Sports Taiwan Co
Activity
– ACTIVITY 1
Amount
– NT$2,000,000
Adapted from slide by Lucian Vlad Lita
27
MUC Templates
ATIVITY 1
Activity
– Production
Company
– Bridgestone Sports Taiwan Co
Product
– Iron and “metal wood” clubs
Start Date
– January 1990
Adapted from slide by Lucian Vlad Lita
28
Example of IE from FASTUS (1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993): Resolving anaphora
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Evaluating IE Accuracy
Always evaluate performance on independent, manually-annotated
test data not used during system development.
Measure for each test document:
Total number of correct extractions in the solution template: N
Total number of slot/value pairs extracted by the system: E
Number of extracted slot/value pairs that are correct (i.e. in
the solution template): C
Compute average value of metrics adapted from IR:
Recall = C/N
Precision = C/E
F-Measure = Harmonic mean of recall and precision
Slide by Chris Manning, based on slides by others
32
MUC Information Extraction:
State of the Art c. 1997
NE – named entity recognition
CO – coreference resolution
TE – template element construction
TR – template relation construction
ST – scenario template production
Slide by Chris Manning, based on slides by others
33
Finite State Transducers for IE
Basic method for extracting relevant information
IE systems generally use a collection of specialized
FSTs
– Company Name detection
– Person Name detection
– Relationship detection
Adapted from slide by Lucian Vlad Lita
34
Three Equivalent Representations
Regular
expressions
Finite
automata
Each
can
describe
the others
Regular
languages
Theorem:
For every regular expression, there is a deterministic
finite-state automaton that defines the same
language, and vice versa.
Adapted from Jurafsky & Martin 2000
35
Finite-Automata State Graphs
• A state
• The start state
• An accepting state
• A transition
a
36
Finite Automata
A FA is similar to a compiler in that:
A compiler recognizes legal programs
in some (source) language.
A finite-state machine recognizes legal strings
in some language.
Example: Programming Language Identifiers
sequences of one or more letters or digits,
starting with a letter:
letter | digit
letter
S
A
37
Finite Automata
Transition
s1 a s2
Is read
In state s1 on input “a” go to state s2
If end of input
If in accepting state => accept
Otherwise => reject
If no transition possible (got stuck) => reject
FSA = Finite State Automata
38
Language defined by FSA
The language defined by a FSA is the set of strings
accepted by the FSA.
in the language of the FSM shown below:
– x, tmp2, XyZzy, position27.
not in the language of the FSM shown below:
– 123, a?, 13apples.
letter | digit
letter
S
A
39
Example: Integer Literals
FSA that accepts integer literals with an optional
+ or - sign:
Note the two different edges from S to A
(+|-)?[0-9]+
digit
B
digit
digit
+
S
A
40
Example:
FSA that accepts three letter English words that begin with p
and end with d or t.
Here I use the convenient notation of making the state name
match the input that has to be on the edge leading to that
state.
a
t
p
i
o
d
u
41
Formal Definition
A finite automaton is a 5-tuple (, Q, , q, F) where:
An input alphabet 
A set of states Q
A start state q
A set of accepting states F  Q
 is the state transition function: Q x   Q
(i.e., encodes transitions state input state)
42
How to Implement an FSA
A table-driven approach:
table:
one row for each state in the machine, and
one column for each possible character.
Table[j][k]
which state to go to from state j on character k,
an empty entry corresponds to the machine getting stuck.
Note: when you use the re package in python, it converts your
regex’s into efficient FSMs
43
Terminology Fine Points
FSA: Finite State Automaton
FSM: Finite State Machine
FSA, FSM used interchangibly
FST: Finite State Transducer
The same as FSA, except each transition produces
output
44
Finite State Transducers for IE
FSTs are often compiled from regular expressions
Probabilistic (weighted) FSTs
FSTs mean different things to different IE
approaches:
Based on lexical items (words)
Based on statistical language models
Based on deep syntactic/semantic analysis
Several FSTs or a more complex FST can be used to
find one type of information (e.g. company names)
Adapted from slide by Lucian Vlad Lita
45
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
Text Analyzer:
Frodo
– Proper Name
Baggins – Proper Name
works
– Verb
for
– Prep
Hobbit
– UnknownCap
Factory – NounCap
Inc
– CompAbbr
Adapted from slide by Lucian Vlad Lita
46
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
A regular expression for finding company names:
“some capitalized words, maybe a comma,
then a company abbreviation indicator”
CompanyName
= (ProperName | SomeCap)+
Comma?
CompAbbr
Adapted from slide by Lucian Vlad Lita
47
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
ProperName works for CompanyName.
Adapted from slide by Lucian Vlad Lita
48
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
Company Name Detection FSA
CAB
word
1
(CAP | PN)
2
comma
3
CAB
4
word
(CAP| PN)
CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string
Adapted from slide by Lucian Vlad Lita
49
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
Company Name Detection FST
CAB  CN
word  word
1
(CAP | PN)  
2
comma  
(CAP| PN)  
3
CAB  CN
4
word  word
CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName
Adapted from slide by Lucian Vlad Lita
50
FSM’s for Pattern Recognition
Use regex’s at a higher level of description for pattern recognition
Determining which person holds what office in what organization
[person] , [office] of [org]
– Vuk Draskovic, leader of the Serbian Renewal Movement
[org] (named, appointed, etc.) [person] P [office]
– NATO appointed Wesley Clark as Commander in Chief
Determining where an organization is located
[org] in [loc]
– NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.)
– KFOR Kosovo headquarters
Slide by Chris Manning, based on slides by others
51
FASTUS: A Successful Hand-Built System
Based on finite state automata (FSA) transductions
set up
new Taiwan dollars
1.Complex Words:
a Japanese trading house
had set up
2.Basic Phrases:
production of
20, 000 iron and
metal wood clubs
3.Complex phrases:
Recognition of multi-words and proper names
Simple noun groups, verb groups and particles
Complex noun groups and verb groups
4.Domain Events:
[company]
[set up]
[Joint-Venture]
with
[company]
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Descargar

Information Visualization: Principles, Promise, and