Information Extraction
Technologies & Applications
Günter Neumann
LT-lab, DFKI
Source: G. Neumann
1999
The increased availability of electronic text data requires
new technologies for extracting relevant information
INFORMATION EXTRACTION (IE)
The goal of IE research is to build systems that find and link relevant information
from NL text while ignoring extraneous and irrelevant information
The core functionality of an IE system is quite simple:
Input:
1. Specification of the relevant information in form of templates
(feature structures), e.g., company information, product information,
management succession, meetings of important peoples
2. A set of real-world text documents
Output:
A set of instantiated templates filled with relevant text fragments (eventually
normalized to some canonical form)
1999
Source: G. Neumann
Example Information Extraction 1
Lübeck (dpa) - Die Lübecker Possehl-Gruppe, ein im Produktions-,
Handel- und Dienstleistungsbereich tätiger Mischkonzern, hat 1994
den Umsatz kräftig um 17 Prozent auf rund 2,8 Milliarden DM
gesteigert. In das neue Geschäftsjahr sei man ebenfalls „mit Schwung“
gestartet. Im 1. Halbjahr 1995 hätten sich die Umsätze des Konzerns
im Vergleich zur Vorjahresperiode um fast 23 Prozent auf rund 1,3
Milliarden erhöht.
type
=
c-name =
year
=
amount =
tendency=
diff
=
turnover
Possehl1
1994
2.8e+9DM
+
+17%
type
=
c-name =
year
=
amount =
tendency=
diff
=
turnover
Possehl1
1995/1
1.3e+9DM
+
+23%
1999
Source: G. Neumann
Example Information Extraction 2
Parts from RWE‘s Anual Report (1998):
Eine Schwerpunktregion im Rahmen der Internationalisierung im Energiebereich ist
Osteuropa. Hier haben wir unser Engagement im abgelaufenen Geschäftsjahr weiter
ausbauen können.Nach dem Kauf weiterer Anteile halten wir inzwischen jeweils
knapp über 50% an den ungarischen Energieversorgungsunternehmen ELMÜ,
ÉMÁSZ und MÁTRA. Im Falle von MÁTRA hat RWE Energie im April 1998
Anteile an Rheinbraun abgegeben. Die Präsenz in Polen wurde durch
Kooperationsvereinbarungen mit den Regionalversorgern Zaklad Energetyczny
Krakow S.A. (ZEK) und Stoleczny Zaklad Energetyczny S.A. (STOEN) ....im
Frühjahr 1998 weiter ausgebaut.
Group/Subs.
RWE
RWE
RWE
RWE Energie
YEAR
1998
1998
1998
1998
KIND
FROM
+
+
+
-
ELMÜ
ÉMÁSZ
MÁTRA
MÁTRA
TO
POT
AMOUNT
>50%
>50%
>50%
Rheinbraun
4.1998
1999
Source: G. Neumann
From the viewpoint of natural language processing (NLP),
IE is attractive for many reasons, including
• Extraction tasks are well defined
• IE uses real-world texts
• IE poses difficult and interesting NLP problems
• IE needs systematic interface specification between NL and domain knowledge
• IE performance can be compared to human performance on the same task
 IE systems are a key factor in encouraging NLP researchers to move from small-scale
systems and artificial data to large-scale systems operating on human language (Cowie &
Lehnert, 1996)
1999
Source: G. Neumann
IE has a high application impact
• IE and information retrieval:
construction of sensitive indices which are more
closely linked to the actual meaning of a particular text
• IE and text classification:
getting fine-grained decision rules
• IE and text mining:
improve quality of extracted structured information
• IE and data-base systems:
improve semi-structured DB approaches
• IE and knowledge-base systems: combine extracted information with KB
1999
Source: G. Neumann
The advanced IE technologies improve intelligent indexing
and retrieval
improved indexing:
text files
IE core system
marked text &
templates
indexing construction
(phrasal, concept indices)
improved retrieval:
query
search engine
IE core system
1999
Quelle: GN
Shallow text processing as a common pre-processing tool for TM & IE
Text documents
Domain
Linguistic
Entities
Complex
relations
to be discovered
STP:
Tokenization
Lex. Processing
Chunk Parsing
Linguistic
Entities
Information
Extraction
Data Mining
Complex
relations
are known
Instantiated
templates
1999
Source: G. Neumann
IE as core component in incremental knowledge
engineering systems
The core idea:
IE core system
< >
Statistical eval. &
visualization
Domain
lexicon
Construction of
ontology
Hand-craft construction of domain
ontology on basis of linguistic information
extracted from texts.
Simulatenously construct domain lexicon
which is used in next acquisition cycle.
Ontology
1999
Source: G. Neumann
From a system development point of view there exists
two approaches for IE
• Language technology approach (dem Ingeniör is nix zu schwör)
 linguistic knowledge specified manually by experts
 mapping between NL and domain knowledge hand-coded
 manual inspection of corpus to find out how specific domain knowledge is
expressed via NL
 still best approach to build reasonable complex systems
 development of tools for supporting application building
• Learning approach
– apply statistical methods where possible
– learn template filler rules from annotated corpora using Machine Learning
 shows promising results for IE subtasks (proper name recognition, flat slot
filler rules)
 mapping between NL and domain knowledge automatically induced
 still needs high amount of annotated corpora
1999
Source: G. Neumann
Common two both approaches is the use of shallow re-usable NL core
components (of course, differing in granularity)
• tokenization, text scanning/information wrapping (e.g., analysis of tables, head lines)
– in most cases simple, but very important
• Morphological & lexical processing
– high-coverage and fast morphological analysis
– processing of unknowns & compounds
– part-of-speech tagging
• Recognition of named entities (proper names, expressions for dates, values,
measurement); important issue here: new name creation, multilingual terms; Demo)
• shallow parsing: very long sentences (> 30 words), relclauses, coordination
– integration of specific subgrammars
– partial parsing
– very robust and efficient strategies needed (weighted finite state transducers)
• discourse analysis (NP-analysis, co-reference, relational links)
1999
Source: G. Neumann
IE poses difficult and interesting NLP problems, which has partially lead
to the renaissance of „old“ methods and their improvement
• High amount of robusntess and efficiency requested:
– finite state technology (weighted finite state transducers)
– text-skipping as a realistic approach for handling real-world text
• systematic specification of NL and domain knowledge
• short system application cycle because of „daily news“
• multilingual methods are required
• evaluation of the predicting power of IE systems
– recall
– precision
– f-measure
R = #correct found answers / #total possible correct
P = #correct found answers / #found answers
F = (2 + 1) R P / 2 R+P
0.6 barrier
1999
Source: G. Neumann
At DFKI´s LT-lab we have developed powerful domainindependent shallow text processing components in order
to support a fast IE system development cycle
Shallow Text Processor
Lexical DB
> 120.000 main stems;
> 12.000 verb frames;
special name lexica;
tagging rules;
Text Tokenization
Text
Lexical processor
• Morphology
• Compounds
• Tagging
Grammars (FST)
general (NPs, PPs, VG);
special (lexicon-poor,
Time/Date/Names);
general sentence patterns;
Chunk Parser
• sentence topology
• phrase recognition
• sentence structure
• grammatical fct.
Set of
Underspecified
Fct. Descr
1999
Source: G. Neumann
We have identified the needs for better chunk parsing
strategies in order to improve robustness and coverage on
the sentence level
Text (morph. analysed)
Phrase
recognition
Stream of phrases
Clause
recognition
grammatical
fct. recognition
Stream of sentences
Current chunk parser
bottom-up:
first phrases and then sentence structure
main problem:
even recognition of simple sentence structure depends on
performance of phrase recognition
example:
- complex NP (nominalization style)
- relative pronouns
[Die vom Bundesgerichtshof und den Wettbewerbern als Verstoss
gegen das Kartellverbot gegeisselte zentrale TV-Vermarktung] ist
gängige Praxis.
([central television marketing censured by the German Federal
High Court and the guards against unfair competition as an act of
contempt against the cartel ban] is common practice)
1999
Source: G. Neumann
A new chunk parser has been developed that increases
robustness and coverage on the sentence level
Text (morph. analysed)
Sentence
topology
Stream of sentence structure
phrase
recognition
New chunk parser
top-down/bottom-up:
first compute topological structure of sentence
second apply phrase recognition to the fields
[coord [core Diese Angaben konnte der Bundesgrenzschutz aber
nicht bestätigen], [core Kinkel sprach von Horrorzahlen, [relcl
denen er keinen Glauben schenke]].
(This information couldn‘t be verified by the Border Police, Kinkel
spoke of horrible figures that he didn‘t believe .)
Advantages
grammatical
fct. recognition
Stream of sentences
- simple (topology recognition based on keywords and verb
groups)
- resolution of critical ambiguities (coordination,
pronouns/determiners)
- no restriction on # of subclauses
- good coverage (first tests on 670 sentence: 85-90%, Urli)
1999
Source: G. Neumann
The Shallow Text Processor has several Important Characteristics
Modularity:
each subcomponent can be used in isolation;
Declarativity:
lexicon and grammar specification tools;
High coverage:
more than 93 % lexical coverage of unseen text;
high degree of subgrammars (70% general NPs, >90% sentence
topology);
Efficiency:
finite state technology in all components;
specialized constrained solvers
(e.g. agreement checks & grammatical functions);
Run-time example:
1-3 second per text page real time
1999
Source: G. Neumann
We´re using typed feature structures for modeling the
domain knowledge and its relationship to NL expressions
Phrase
STP
Fdesc
[process,
mods]
UFD
Type
Merge
DomainLex:
kill=Fight-L
Np
Pp
LocPp
LocNp
DatePP
trans
[subj,
obj]
intrans
[subj]
Fill
Templ
[process=1,
subj=2,
obj=3,
dateMod = < ... =4 ...>,
locMod = < ... =5...>,
templ = [ action kill=1,
attacker soldier=2,
attacked general=3,
date 3/8/93=4,
Loc Mostar=5]]
Templ
[action,date]
Move-T
[from, to, unit]
Loc-T
[action]
Fight-T
[victim,
attacker,
attacked]
Meeting
[visitor,
visitee]
Fight-L
[subj=1, obj=2,
temp=[attacker=1,
attacked=2]
1999
Source: G. Neumann
The model as several advantages
• New applications are defined basically through
– template hierarchy (independently from linguistic knowledge)
– integration of domain and linguistic knowledge only through
linked types between abstract linguistic categories
• Support of faster adaptation to new applications
• The focus of the next research periods:
– systematic integration of lexical semantics (in particular nominal entities)
– template merging
– methods for automatic knowledge acquisition
– learning of domain lexicon
– learning of linked types
1999
Source: G. Neumann
Recently, the problem of using machine learning methods
to induce IE routines has received more attention
• The majority of current approaches are variants of inductive supervised learning
• Goal: given a set of annotated text documents induce automatically template filler rules by
successively generalizing the initialized instantiated rules computed from the tagged
examples
• Usually, the documents are preprocessed by NL components
–
–
–
–
tokenization (Freitag, 98)
POS tagging (Califf&Mooney,98)
phrase recognition (Hufman,96)
shallow sentence parsing (Riloff,96a;Soderland,97)
•
most approaches learn slot-filler rules, some newer learn relational structures (Califf&Mooney,98)
•
current trend is towards minimally supervised strategies (Riloff,96b)
1999
Source: G. Neumann
The majority of current approaches are variants of
inductive supervised learning
• Example
<PNG> Sue Smith </PNG>, 39, of Menlo Park, was appointed <TNG> president </TNG> of <CNG>
Foo Inc. </CNG>
n_was_named_t_by_c:
noun-group(PNG, head(isa(person-name))),
noun-group(TNG, head(isa(title))),
noun-group(CNG,head(isa(company-name))),
verb-group(VG, type(passive), head(named or elected or appointed)),
prep(PREP, head(of or at or by)),
subject(PNG,VG), object(VG,TNG), post_nominal_prep(TNG,PREP), prep_obj(PREP,CNG)
 management_appointment(M,person(PNG), title(TNG), company(CNG)) supervised and unsupervised
methods
• Current approaches show already impressive results (for flat sentence-based templates):
– Huffman,96:
– Califf&Mooney,98 :
management changes task
=> 85.2% F (89.4%)
computer-related job postings => 87.1% P & 58.8% R
1999
Source: G. Neumann
In case of multilingual name recognition learning
approaches are very promising
• Nymble, a high performance learning name-finder (Bikel et al. 97, BBN)
– hidden Markov model
– word-features (e.g., allCaps, twoDigitNum)
– Results for F-measure
• English:
93% (best MUC-6: 96%)
• Spanish: 90% (93%)
• Gallippi, 96:
– data-driven knowledge acquisition strategy based on decision trees (ID3)
– features: POS, Abbrev., list of known names, word-features
– Results for F-measure (av. across companies, persons, locations, date)
• English:
94%
• Spanish: 89.2%
• Japanese: 83.1%
1999
Source: G. Neumann
In summary: IE is a very attractive research area for
building application systems
• IE is an interdisciplinary approach
–
–
–
–
–
–
language technology
statistical methods
machine learning
knowledge representation
software engeenering
expert knowledge
• What mix of methods is best depends on complexity of target information
– hand-crafted vs. automatic or mixed strategies
• Next system generation will
–
–
–
–
be more intelligent in adapting itself to new domains
learn from experience being minimally supervised
support shorter development cycle (dynamic environment)
understand several languages
1999
Source: G. Neumann
Descargar

PARADIME Parametrizable Domain