How feasible is the reuse of grammars for
Named Entity Recognition?
Katerina Pastra, Diana Maynard, Oana Hamza,
Hamish Cunningham and Yorick Wilks
Department of Computer Science,
Natural Language Processing Group,
University of Sheffield, U.K.
Pastra et al., LREC 2002
The paradox
NER results: close to human performance
Reuse of NER resources: minimal
We will focus on:
 Traditional rule-based NER systems
 NER in text
 Reuse of grammars for NER
 Manual adaptation of grammars
Pastra et al., LREC 2002
What is it that hinders grammar reuse?
1) Grammar Formalism
2) Application Domain 3) Natural Language
The use of Flexible System Architectures
guarantees
reusability of resources
>>> But
is this a “sine qua non” solution ?
Does the lack of such architectures render
reusability simply “not feasible” ?
Pastra et al., LREC 2002
Grammar Formalism (1)
• Translating formalisms: a time-effective solution?
• Time gained-information lost: is there a trade-off?
>> Current Practice: No standardised formalism
>> Traditional pattern-matching languages:
inappropriate for NER
>> Norm: Use of AV notations (allow for reference
to token attributes from multiple analysis levels).
Pastra et al., LREC 2002
Grammar Formalism (2)
The need: NER for SOCIS (not main task – limited time)
The problem:Existing grammar in another formalism
>> NEA – JAPE Similarities:
Declarative, context-sensitive, non-det PM…
>> NEA – JAPE Differences:
BU rule invocation – FST cascades
Appelt control mechanism - Appelt, First, Brill
Rules augmented with PROLOG – JAVA
Wildcards, “don’t care sequ”: not common
Iterations, (!=) : different mechanisms
Pastra et al., LREC 2002
Grammar Formalism (3)
The experiment:
From the NEA notation to JAPE
NEA notation: A => B\C/D
JAPE: (B)(C) :label (D)  :label.EntityType = {attr}
• one’s LHS another’s RHS
• same things handled in different ways
• differences in modules run before NER affect rules
STILL:
Original set in 2 months – SOCIS set in 1 week
Pastra et al., LREC 2002
Application Domain (1)
Is there a core set of grammar rules that are always
domain independent ?
General purpose NER grammars:
• Developed to serve grammar reuse, but originated
themselves from specific applications
• They separate specific from general information.
• MUSE: automatic resource switches ~ text features
• HaSIE: company reports on health and safety
Pastra et al., LREC 2002
issues
Application Domain (2)
From newswire text on Biotechnology
to … Crime Scene Police Reports
The experiment:
• The gazetteers were enriched with police and crime
related information
• All original domain-specific rules were deleted
• Original results with no modifications to the
grammar : close to 90%
• Only 1 change to the core set and addition of rules
Pastra et al., LREC 2002
Natural Language (1)
NER Grammar in language (A) + linguistic
knowledge of NE in (B) = NER grammar for (B) ?
Parameters to consider:
• The relation of A and B (close related or not)
 determines the extent of reuse
• Nature of NEs (formation, syntagmatic relations)
 unpredictable behaviour and structure
 finite set
Pastra et al., LREC 2002
Natural Language (2)
The experiment:
Run NER grammar for English on Romanian text
Romanian NE (compared to English):
• Rich inflection
• Flexible word order
• Different word order (e.g modifier follows noun)
Pastra et al., LREC 2002
Natural Language (3)
Corpus: 1MB of Romanian newspaper texts
Manual marking of NEs – Romanian NER (3
weeks)
1st experiment: Romanian Gaz + English grammar
>> Overall Results: P = 0.82, R = 0.67
• Low recall even for entity types rec with high P
(e.g. Org 0.75P – 0.39R)
2nd experiment: Romanian Gaz + Adapted grammar
>> Overall Results: P = 0.95, R = 0.94
Pastra et al., LREC 2002
Natural Language (3)
Entity Type Precision Recall
Entity Type Precision Recall
Address
Date
Location
Money
0.81
0.67
0.88
0.82
0.81
0.77
0.96
0.47
Address
Date
Location
Money
0.96
0.95
0.92
0.98
0.93
0.94
0.97
0.92
Organisation
Percent
Person
0.75
1
0.68
0.39
0.82
0.78
Organisation
Percent
Person
0.95
1
0.88
0.89
0.99
0.92
Identifier
0.94
0.38
Identifier
0.99
0.96
Overall
0.82
0.67
Overall
0.95
0.94
Pastra et al., LREC 2002
Conclusions
Reuse of existing NER grammars is time effective
and should be attempted even when the formalisms,
applications and languages involved are different
Further issues to be addressed:
• Reuse of NER grammars for spoken NEs
• Reuse in statistical/ML NER approaches
• Automating grammar reuse
Pastra et al., LREC 2002
Descargar

Document