Automatic Indexing with the
EuroVoc Thesaurus
Enabling Cross-lingual Search
Marie Francine Moens
Katholieke Universiteit Leuven, Belgium
Frane Šarić
University of Zagreb, FER, Croatia
18-19 November 2010, Luxembourg – Kirchberg
1
CADIAL project
 Computer Aided Document Indexing for
Accessing Legislation
 A joint Flemish-Croatian project
 Partners:
 Katholieke Universiteit Leuven
(prof. Marie-Francine Moens)
 University of Zagreb & Hidra
(prof. Bojana Dalbelo Bašić, prof. Marko Tadić)
 Goal: publicly accessible service for automatic
indexing of the official documentation of the
Republic of Croatia
2
CADIAL project (cont.)
1. Manually index 10.000 documents
 eCADIS – semi-automatic document indexing
2. Use that data to train automatic indexers
 Trained automatic classifiers for every EuroVoc
descriptor
3. Provide indexed data to custom search engine
 CADIAL search engine
3
eCADIS
 Computer Aided Document Indexing System
 Provides useful information that helps indexers
index documents more quickly
 Counts n-grams
 Includes word normalization
 Extracts collocations
 Suggests appropriate descriptors
 Uses automatically trained classifiers
4
eCADIS (cont.)
5
Morphological normalization
 Croatian = morphologically complex language
 Inflectional variation
 Derivational variation
6
Morphological normalization
 Lexicon-based normalization [Snajder et al.
2009]:
 Inflectional and derivational rules
 String transformation functions
 Higher order functional representation of Croatian
inflectional morphology:
 Inflectional rules
 Transformations: higher-order functions
7
Named entity recognition
 Named entity recognition = semantic classification of
entity name (usually proper name) [Bekavac & Tadic
2009]:
 Person, location, organization, date, ...
 Use of lists of names and use of finite state
automata
8
Lexical association metrics
 Collocation: meaning of a compound term cannot be inferred
from meaning its individual terms
 Collocations are valuable index terms
 Several methods were developed:
 Based on extraction of terms in Wikipedia that are linked
filtered by acceptable Part-Of-Speech patterns [Bekavac &
Tadic 2009]
 Terme-X: use of lexical association measures to build a
dictionary of collocations filtered by acceptable Part-OfSpeech patterns (e.g., chi-square, log likelihood ration for a
binomial distribution, pointwise mutual information statistic
[Delac et al. 2009]
 Using a genetic programming algorithm for learning a
language adapted lexical association measure [Snajder et al.
2009]
9
Text categorization
 = Assignment of terms of the EUROVOC thesaurus
 Currently done at the statute level
 Problem
 Large number of features (terms) and often few
training examples
 => feature selection: chi square, frequent item
sets, linear classifier weights, ... [Boiy & Moens
2009]
 Use of common classification algorithms: support
vector machines, logistic regression, ... [Saric et al. in
preparation]
 EUROVOC = multilingual => terms can be used in crosslingual retrieval
10
11
Text categorization
 Core of the CADIAL project
 System suggests index terms to the human
indexers
 High performance of the categorization: e.g.,
in the 80% F1 measure
 As number of categorized documents grow, we
hope to learn better classification models
 Possibility to exploit the hierarchical
organization of the thesaurus term to improve
accuracy of the categorization
12
13
[Bennett & Nguyen 2009]
14
TMT: Object-oriented text classification library
15
Comparing document classification schemes
 Problem: discrepancy of classification scheme
(e.g. EUROVOC thesaurus) and natural clusters
formed by the documents
 How to find this discrepancy so that the
classification scheme can be adapted? [Silic et
al. 2009]
 Finding an optimal clustering and comparison with
the clusters formed by the documents classified
built by ground truth categories of documents
 Dimensionality reduction with principal component
analysis (PCA): visualization of the clusters
16
17
CADIAL Search Engine
 http://cadial.hidra.hr
 Full text search over a collection of 20,000 legal
documents
 Documents are automatically indexed using
EuroVoc descriptors
 Hidra assures that additional metadata is
correct:
 Regulation status (valid / invalid)
 Area of activity
 EU accession chapter
18
The CADIAL search engine
 Possibility to search:





Full text
Titles
EUROVOC thesaurus terms
Historical versions
...
 Legislation: semi-structured documents: possibility to
take the structure into account when computing the
relevance of article, section etc.
 Successful participation at the INEX competition 2008
[Mijic et al. 2009]
19
CADIAL Search – live demo
20
CADIAL Search – document metadata
21
Towards cross-lingual search
22
Cross-lingual indexing
 Classification/indexing of documents = supervised
machine learning of the classification patterns based
on annotated training examples
 When multilingual documents are not linked:
Demands manual annotation in different languages
 Can be important manual effort:
 Changing collections, taxonomies
 Many official languages in the EU
Transfer learning can be solution
23
Cross-lingual indexing
 Potential of transfer learning techniques [Pan & Yang
IEEE TKDE 2010]
 Co-training and co-regularization techniques for
learning classification patterns from documents in
multiple languages [Amini et al. SIGIR 2010]
24
Conclusions
 CADIAL = valuable example of automatic indexing
enabling cross-lingual search
 EUROVOC thesaurus is a valuable resource
 Many different future tracks of research aiming at
more flexible and accurate indexing
http://www.cadial.org/
25
 The CADIAL project has received the 2009
Prime Minister Award for special achievements
in the field of e-Government in Croatia and the
2009 "Golden Tesla's Egg" Award of the VIDI
publishing house for the best innovative
solution in ICT for the category Academic
Institutions. The project was invited to
participate at CeBIT 2010, the world's foremost
tradeshow for the digital industry.
26
Descargar

Slide 1