Main Mono and Bilingual Tasks:
Track Organisation and Results Analysis
Giorgio M. Di Nunzio
Nicola Ferro
Carol Peters
University of Padua
Italy
[email protected]
University of Padua
Italy
[email protected]
ISTI-CNR, Area di Ricerca Pisa
Italy
[email protected]
CLEF 2007 Workshop
Budapest, Hungary, 19–21 September 2007
Outline
①
•CLEF Infrastructure: DIRECT
②
•Track Overview
③
•Monolingual Tasks
④
•Bilingual Tasks
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
2
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
3
Information Hierarchy
Wisd
om
Pape
rs
Know
ics
t
s
ledg
i
e Stat
Infor
Data
mati
on
ur
Meas
es
and ons
s
t
i
n
rime Collect
e
p
x
E
tal
n
e
m
ri
Expe
 experimental collections and the experiments are data, since they are the raw, basic
elements needed for any further investigation
 performance measurements are information, since they are the result of
computations and processing on the data,
 descriptive statistics and the hypothesis tests are knowledge, since they are a further
elaboration of the information carried by the performance measurements
 theories, models, algorithms, and techniques are wisdom, since they provide
interpretation, explanation, and formalization of the content of the previous levels.
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
4
Approach to the Evaluation (1/2)
 Introduce a conceptual model
 it makes clear what are the entities entailed by the information space of
an evaluation campaign, their features, and their relationships
 logical models can be derived from it to manage and preserve the
experimental data
 commonly agreed data formats for exchanging information can be
derived from it
 Develop common metadata formats
 they provide meaning to the data, and thereby enable their sharing and
re-use
 they allow to keep track of the lineage of the managed information
 Adopt a unique identification mechanism
 it allows for explicit citation and easy access to the scientific data and it
supports the enrichement of the scientific data
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
5
Approach to the Evaluation (2/2)
 Provide common tools for statistical analyses
 they allow for judging whether measured differences between retrieval methods
can be considered statistically significant
 a uniform way of performing statistical analyses on experiments make the analysis
and assessment of the experiments comparable too
 Design and develop a Digital Library System (DLS) for IR scientific data
 it is well suited for managing and making accessible the scientific data and the
experiments produced during the course of an evaluation campaign
 it also provides tools for analyzing, comparing, and citing the scientific data of an
evaluation campaign, as well as curating, preserving, annotating, enriching, and
promoting the re-use of them
 Give to organizations responsible for evaluation initiatives an active role in
this process
 they should take a leadership role in developing a comprehensive strategy for longlived digital data collections and drive the research community through this process
in order to improve the way of doing research
 they should take care also of defining guiding principles, policies, best practices for
making use of the scientific data produced during the evaluation campaign itself
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
6
Internationalization of the User Interface
Bulgarian
Petya Osenova, Kiril Simov
Czech
Pavel Pecina
English
Marco Dussin
French
Jacques Savoy
German
Thomas Mandl
Indonesian
Mirna Adriani
Italian
Marco Dussin
Portuguese
Paulo Rocha, Diana Santos
Spanish
Julio Villena Román
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
7
Identification: Digital Object Identifiers (DOI)
10.2415/AH-BILI-X2BG-CLEF2007.JHU-APL.APLBIENBGTD4
 DOIs
 allow us to uniquely identify a digital object
 are persistent and actionable
 aim especially at the intellectual property
 We assign DOIs to:
 collections − prefix 10.2453
 topics − prefix 10.2452
 experiments − prefix 10.2415
 pools − prefix 10.2454
 statistical tests − prefix 10.2455
http://www.medra.org
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
8
DOI Resolution
http://dx.doi.org
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
9
Experiment Metrics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
10
Experiment Statistics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
11
Experiment Plots
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
12
Task Statistics
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
13
Task Plots
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
14
Appendices (1/2)
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
15
Appendices (2/2)
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
16
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
17
Participation
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
18
Participation by Country
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
19
Tasks and Collections
 Monolingual and bilingual tasks have principally offered for
Central European languages: Bulgarian, Czech and Hungarian
Language
Task
Collection
Bulgarian
Monolingual BG, Bilingual X2BG
Sega 2002, Standart 2002, Novinar 2002*
Cezch*
Monolingual CS, Bilingual X2CS
Mlada fronta DNES 2002, Lidové Noviny 2002
Hungarian
Monolingual HU, Bilingual X2HU
Magyar Hirlap 2002
English
Bilingual X2EN (Indian sub-task)
LA Times 2002*
 Topics in 16 languages
 European languages: Bulgarian, Czech, English, French, Hungarian,
Italian and Spanish
 non-European languages (for X2EN): Amharic, Chinese, Indonesian,
Oromo
 Indian sub-task: Bengali, Hindi, Marathi, Tamil and Telugu
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
20
Participation by Task
172 submitted runs
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
21
Runs by Source Language
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
22
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
23
Monolingual Bulgarian
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
24
Monolingual Czech
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
25
Monolingual Hungarian
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
26
Monolingual English*
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
27
Approaches to Monolingual Retrieval
Linguistic
Stemmers:
both light and
aggressive
Morphological Lemmatizer
StemmingFeed-back:
vs 4-grams
Main emphasis: Relevance
 NLP techniques
 stemming probabilistic
impact on individual
topics
RF
Indexing:
 Named
Entity
Recognition
but
not
on
average
 morphological
analysis
 mutual information
RF or
word-based
 blind relevance 4-grams
feedback
 relevance feed-back
can be detrimental
word
decompounding
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
28
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
29
Bilingual X  English
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
30
Approaches to Bilingual X2EN
Main emphasis:
 bilingual dictionaries
 bilingual dictionaries and
 machine translation
pivot languages
Best Bilingual
English
 coverage
of lexicons

translation
ambiguity
about
query expansion
with RF
system
is
 use of pivot languages
resolution
Afaan Oromo
withstemmer
a graph

parallel
corpora
Bilingual Hungarian to English
88% of the best
based approach

stop list creation
monolingual
system
 bilingual dictionary
 lexicon coverage with a
 bilingual Oromo-English
pattern-based approach
 exploiting Wikipedia to remove
dictionary creation
improbable translations
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
31
Bilingual X2EN: Indian Subtask
limited
bilingual
linguistic
dictionary

Hindi-English
and resources
Telugu-English
statistical
MT system

bilingual
dictionaries
dictionaries
in oneapproach

phoneme-based
OOVtrained
usingcreated
a on
rule-based
transliterations
parallel
aligned

stop
list
creation
week
to for
generate
transliteration
equivalent
and
English
edit
sentences

stemming
andcombined
n-gram with
queries
distances
 TFIDF
approach
 language models
boolean
operators

stemmers
translation
anddisambiguation
morphological via a
analyzers
page-rank
if available
style algorithm
CLEF 2007
Budapest, Hungary, 19–21 September 2007
G.M. Di Nunzio, N. Ferro, and C. Peters
32
Descargar

SAPIR Kick-off Meeting