Aspect-driven summarization
TAC’10 Guided Summarization Task
Background and motivation
• We have developed multilingual text mining software for EMM (Europe
Media Monitor; http://emm.jrc.it/overview.html), that automatically:
• Gathers 100K news articles per day in 50 languages from about
2.5K news sources (cf. Fig. 1),
• clusters all these articles into major news stories and tracks them
over time,
• detects events (NEXUS) – event type, victims, perpetrators etc.
 Need for multilingual multi-document/update summarization
• To produce a 100-word summary for a set of 10 newswire articles
for a given topic, where the topic falls into a predefined set of
categories.
• The participants were given a list of important aspects for each
category and the summaries should cover them.
• Example – aspects for category attacks: 2.1 WHAT, 2.2 WHEN, 2.3
WHERE, 2.4 PERPETRATORS, 2.5 WHY, 2.6 WHO AFFECTED, 2.7 DAMAGES,
2.8 COUNTERMEASURES.
Overall results
• Run 25 – co-occurrence + aspect coverage
• Run 31 – only co-occurrence
Run ID
Overall Responsiveness
Linguistic quality
Pyramid score
16 (the best run in Overall resp. )
3.17 (1)
3.46 (2)
0.40 (4)
22 (the best run in Pyramid score)
3.13 (2)
3.11 (13)
0.43 (1)
25 (co-occurrence+aspects)
2.98 (10)
3.35 (4)
0.37 (18)
31 (co-occurrence only)
2.89 (19)
3.28 (6)
0.38 (13)
2 (baseline - MEAD)
2.50 (27)
2.72 (29)
0.30 (26)
1 (baseline - LEAD)
2.17 (32)
3.65 (1)
0.23 (32)
Table 1. Results of initial summaries: score (rank, out of 43 runs).
Category-focused results
Figure 1. News clusters in EMM’s NewsExplorer. and extracted events in NEXUS
Information extraction for aspect capturing
• Entity Recognition and Disambiguation – used in LSA input
representation and in capturing person/organization-related aspects
• Event extraction system (NEXUS)
Category
Overall Responsiveness
Linguistic quality
Pyramid score
1. Disasters
3.00 (23) - 3.57 (2)
3.43 (3) - 3.29 (5)
0.38 (23) - 0.43 (10)
2. Attacks
3.71 (3) - 2.86 (22)
3.29 (4) - 3.00 (16)
0.56 (6) - 0.49 (18)
3. Health
2.75 (6) - 2.42 (21)
3.33 (6) - 3.25 (9)
0.30 (9) - 0.31 (7)
2.50 (25) - 2.60 (21)
3.60 (3) - 3.40 (6)
0.24 (29) - 0.27 (23)
3.20 (6) - 3.30 (2)
3.10 (10) - 3.40 (2)
0.45 (14) - 0.47 (5)
4. Resources
5. Investigations
Table 2. Scores and ranks of our runs for each category (run 25 – run 31):
positive – top 6, negative – average or worse.
“All the 20 people taken hostage by armed pirates were safe.”
Results focussed on IE-based aspects
Extracted slots:
event type (kidnapping), victims (20 people), perpetrator (pirates)
• Automatically learnt Lexica (Ontopopulis):
Sample from lexicon for countermeasures:
operation, rescue operation, rescue, evacuation, treatment, assistance,
relief, military operation, police operation, security operation, aid
Our summarization approach within the new task
• Guided task: a summary must include aspects defined for its category.
• Our idea: the summary should contain the frequently mentioned
topics with the cluster, however, it should also be rich in the aspects.
Aspect
IE tool
Run 25
Run 31
Best
1.1 WHAT
NEXUS
0.60 (24)
0.79 (3)
0.89
1.5 WHO AFFECTED
NEXUS
0.36 (25)
0.41 (23)
0.68
1.6 DAMAGES
ONTOPOPULIS
0.13 (26)
0.38 (10)
1.25
1.7 COUNTERMEASURES
ONTOPOPULIS
0.34 (7)
0.19 (29)
0.39
2.1 WHAT
NEXUS
0.74 (21)
0.79 (12)
0.88
2.4 PERPETRATORS
NEXUS
0.48 (18)
0.34 (24)
0.69
2.6 WHO AFFECTED
NEXUS
0.65 (2)
0.54 (11)
0.66
2.7 DAMAGES
ONTOPOPULIS
0.50 (4)
0 (30)
0.75
2.8 COUNTERMEASURES
ONTOPOPULIS
0.34 (18)
0.20 (32)
0.65
3.1 WHAT
NEXUS
0.33 (17)
0.36 (14)
0.58
3.2 WHO AFFECTED
NEXUS
0.29 (6)
0.31 (4)
0.39
ONTOPOPULIS
0.31 (1)
0.24 (7)
0.31
ONTOPOPULIS
0.49 (19)
0.46 (25)
0.81
ONTOPOPULIS
0.36 (5)
0.29 (12)
0.50
5.1 WHO
NEXUS
0.67 (17)
0.65 (19)
0.96
5.3 REASONS
NEXUS
0.46 (19)
0.59 (6)
0.67
5.4 CHARGES
ONTOPOPULIS
0.33 (27)
0.47 (11)
0.72
• Summarizer based on co-occurrence (LSA) and aspect coverage.
3.5 COUNTERMEASURES
• LSA: lexical features and named entities (as in TAC’09).
4.1 WHAT
• 1st step: Creation of SVD input matrix A and aspect matrix P.
4.4 COUNTERMEASURES
•
2nd
step: Singular Value Decomposition
A=USVT.
• 3rd step: Sentence selection based on values in F=SVT and P (fig. 2):
iteratively:
• select the sentence with the largest overall score,
Table 3. Pyramid scores and ranks of our runs for each aspect:
positive score or influence of the IE tool, negative score or influence of the IE tool.
• subtract its information from F and P
(select more diverse information, avoid redundancy).
Conclusions
• The approach can easily be applied to many languages
(multilingual entity disambiguation and latent semantic analysis).
• Great results of IE-based run for the central topic of the event
extraction system – criminal/terrorist attacks.
• NEXUS detects too many event aspects, including those of past
events (background information). Co-occurrence alone works well.
• We thus need to work on distinguishing the main event from
mentions of past events, through temporal analysis or by preferring
the first event mention.
• Good results for aspects treated by lexical learning with Ontopopulis.
© European Communities, 2011
Figure 2. Sentence score computation from co-occurrence (LSA) and aspect info.
• Event aspect information helps if it is of high quality.
Contributors: Josef & Ralf Steinbergers, Hristo Tanev, Mijail Kabadjov
Unit for
Global Security
and Crisis
Management
European Commission • Joint Research Centre
Institute for the Protection and the Security of the Citizen
Tel. +39 0332 785648 • Fax +39 0332 785154
E-mail Format: [email protected]
Related publications
Steinberger, Ralf, Bruno Pouliquen & Erik van der Goot (2009). An Introduction to the Europe Media Monitor Family of
Applications. In: Information Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR'09), 2009.
Steinberger, Josef, Mijail Kabadjov, Bruno Pouliquen, Ralf Steinberger & Massimo Poesio. WB-JRC-UTs Participation in TAC
2009: Update Summarization and AESOP Tasks. In: Proceedings of the Text Analysis Conference (TAC’09), 2010.
Steinberger, Ralf & Bruno Pouliquen. Cross-lingual Named Entity Recognition. In: Named Entities - Recognition, Classification
and Use, Benjamins Current Topics, Volume 19, 2009.
Tanev, Hristo, Jakub Piskorski & Martin Atkinson. Real-time News Event Extraction for Global Crisis Monitoring. In:
Proceedings of 13th International Conference on Applications of Natural Language to Information Systems (NLDB’08), 2008.
Tanev, Hristo & Bernardo Magnini. Weakly Supervised Approaches for Ontology Population. In: Proceedings of the 11th
conference of the European Chapter of the Association for Computational Linguistics (EACL’06), 2006.
Descargar

Multilingual Named Entity Recognition and Name Variant …