Managing Morphologically
Complex Languages in
Information Retrieval
Kal Järvelin & Many Others
University of Tampere
1. Introduction

Morphologically complex languages
unlike English, Chinese
 rich inflectional and derivational morphology
 rich compound formation


U. Tampere experiences 1998 - 2008
monolingual IR
 cross-language IR
 focus: Finnish, Germanic languages, English

Methods for Morphology
Variation
Management
Reductive
Methods
Stemming
Generative
Methods
Lemmatization
Infl stem
Generation
Word Form
Generation
Rules +
Dict
Rules +
Dict
Inflectional
Stems
FCG
Rulebased
Rulebased
Inflectional
Stems
Generating
All Forms
enhanced
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
2. Normalization

Reductive methods, conflation







stemming
lemmatization
+ conflation -> simpler searching
+ smaller index
+ provides query expansion
Stemming available for many languages (e.g. Porter
stemmer)
Lemmatizers less available and more demanding
(dictionary requirement)
Alkula 2001

Boolean environment, inflected index, Finnish:




Boolean environment, infl vs. lemma index, Finnish:




manual truncation vs. automatic stemming
stemming improves P and hurts R
many derivatives are lost
manual truncation vs. lemmatization
lemmatization improves P and hurts R
many derivatives are lost, others correctly avoided
Differences not great between automatic methods
Kettunen & al 2005


Ranked retrieval, Finnish:
Three problems




how lemmatization and inflectional stem generation compare
in a best-match environment?
is a stemmer realistic for the handling Finnish morphology?
feasibility of simulated truncation in a best-match system?
Lemmatized vs inflected form vs. stemmed index.
Kettunen & al. 2005







Method
FinTWOL
Inf Stem Gen
Porter
Raw
Index
lemmas
inflform
stemmed
inflform
MAP
35.0
34.2
27.7
18.9
Change %
-- 2.3
- 20.9
- 46.0
But very long queries for inflectional stem generation &
expansion (thousands of words); weaker generations
shorter but progressively deteriorating results.
(InQuery/TUTK/graded-35/regular; )
Kettunen & al. 2005
QuickTime™ and a
decompressor
are needed to see this picture.
MonoIR: Airio 2006
L an g u ag e
Index t ype
En g lish
In flected
L emm as
Ste mm ed
43 .4
45 .6
46 .3
2.2
2.9
In flected
L emm as
Ste mm ed
31 .0
47 .0
48 .5
16 .0
17 .5
In flected
L emm as
Ste mm ed
30 .2
31 .4
33 .5
1.2
3.3
In flected
L emm as
Ste mm ed
30 .2
31 .9
35 .7
1.7
5.5
Fin nish
S w edish
G er m an
Av era g e P %
InQuery/CLEF/TD/TWOL&Porter&Raw
Di ff Base %
CLIR: Inflectional Morphology




NL queries contain inflected form source keys
Dictionary headwords are in basic form (lemmas)
Problem significance varies by language
Stemming

stem both the dictionary and the query words


but may cause all too many translations
Stemming in dictionary translation best applied after
translation.
Lemmatization in CLIR

Lemmatization
easy to access dictionaries
 but tokens may be ambiguous
 dictionary translations not always in basic form
 lemmatizer’s dictionary coverage

insufficient -> non-lemmatized source keys, OOVs
 too broad coverage -> too many senses provided

CLIR Findings: Airio 2006
English -> X
T ar g et X
Index t ype
Fin nish
L emm as
29 .0
-6.5
Ste mm ed
20 .8
-14 .7
L emm as
17 .4
-9.7
Ste mm ed
19 .0
-8.1
L emm as
26 .4
-4.6
Ste mm ed
25 .7
-5.3
S w edish
G er m an
Av era g e P %
Di ff to S plit%
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
3. Compounds

Compounds, compound word types
determinative: Weinkeller, vinkällare, life-jacket
 copulative: schwartzweiss, svartvit, black-and-white
 compositional: Stadtverwaltung, stadsförvaltning
 non-compositional: Erdbeere, jordgubbe, strawberry


Note on spelling : compound word components
are spelled together (if not -> phrases)
Compound Word Translation

All compounds are not in dictionary
some languages are very productive
 small dictionaries: atomic words, old noncompositional compounds
 large dictionaries: many compositional compounds
added


Compounds remove phrase identification
problems, but cause translation and query
formulation problems
Joining Morphemes


Joining morphemes
complicate compound
analysis & translation
Joining morpheme types in
Swedish






<omission> flicknamn
-s rättsfall
-e flickebarn
-a gästabud
-u gatubelysning
-o människokärlek

Joining morpheme types in
German








-s Handelsvertrag
-n Affenhaus
-e Gästebett
-en Fotographenausbildung
-er Gespensterhaus
-es Freundeskreis
-ens Herzensbrecher
<omission> Sprachwissenschaft
Suggestive finding that the treatment of joining morphemes improves MAP by 2 %
- Hedlund 2002, SWE->ENG, 11 Qs
Compound Processing, 2

A Finnish natural language
query:





lääkkeet sydänvaivoihin
(medicines for heart problems)


The output of morphological
analysis


lääke
sydänvaiva, sydän, vaiva
Dictionary translation and the
output of component tagging:


lääke ---> medication drug
sydänvaiva - ”not in dict”
sydän ---> heart
vaiva ---> ailment, complaint,
discomfort, inconvenience, trouble,
vexation
Many ways to combine
components in query
Compound Processing, 3

Sample English CLIR query:



#sum( #syn( medication drug ) heart #syn( ailment,
complaint, discomfort, inconvenience, trouble, vexation ))
i.e. translating as if source compounds were phrases
Source compound handling may vary here:



#sum( #syn( medication drug ) #syn(#uw3( heart ailment )
#uw3( heart complaint ) #uw3( heart discomfort ) #uw3(
heart inconvenience ) #uw3( heart trouble ) #uw3( heart
vexation )))
#uw3 = proximity operator for three intervening words, free
word order
i.e. forming all proximity combinations as synonym sets.
Compound Processing, 4



No clear benefits seen from using proximity
combinations.
We did neither observe a great effect in
changing the proximity operator (OD vs. UW)
Some monolingual results follow (Airio 2006)
L an g u ag e
Index t ype
En g lish
In flected
L emm as
Ste mm ed
43 .4
45 .6
46 .3
2.2
2.9
In flected
L emm a & deco m p
L emm as
Ste mm ed
31 .0
50 .5
47 .0
48 .5
19 .5
16 .0
17 .5
In flected
L emm a & deco m p
L emm as
Ste mm ed
30 .2
38 .8
31 .4
33 .5
8.6
1.2
3.3
In flected
L emm a & deco m p
L emm as
Ste mm ed
30 .2
36 .2
31 .9
35 .7
6.0
1.7
5.5
Fin nish
S w edish
G er m an
Av era g e Di ff to
P%
Baseline
InQuery/CLEF/Raw&TWOL&Porter
English
QuickTime™ and a
decompressor
are needed to see this picture.
Swedish
QuickTime™ and a
decompressor
are needed to see this picture.
Finnish
QuickTime™ and a
decompressor
are needed to see this picture.
Hedlund 2002

Compound translation as compounds:
47 German CLEF 2001 topics, English docs
collection.
 comprehensive dictionary (many compounds) vs.
small dict (no compounds)
 mean AP 34.7% vs. 30.4%
 dictionary matters ...


Alternative approach: if not translatable, split
and translate components
CLEF Ger -> Eng
1. best manually translated
0,4465
2. large dict, no comp splitting
0,3520
3. limited dict, no comp splitting
0,3057
4. large dictionary & comp splitting
0,3830
5. limited dict & comp splitting
0,3547
InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n
CLIR Findings: Airio 2006
English ->
T ar g et
lan g u ag e
Index t ype
Fin nish
L emm a & deco m p
L emm as
Ste mm ed
L emm a & deco m p
L emm as
Ste mm ed
L emm a & deco m p
L emm as
Ste mm ed
S w edish
G er m an
Av era g e Di ff to
P%
Baseline%
35 .5
29 .0
20 .8
27 .1
17 .4
19 .0
31 .0
26 .4
25 .7
-6.5
-14 .7
-9.7
-8.1
-4.6
-5.3
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
Eng->Fin
QuickTime™ and a
decompressor
are needed to see this picture.
Eng->Ger
QuickTime™ and a
decompressor
are needed to see this picture.
Eng->Swe
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
4. Generative Methods
Variation
handling
Reductive
Methods
Stemming
Generative
Methods
Lemmatization
Infl stem
Generation
Word Form
Generation
Rules +
Dict
Rules +
Dict
Inflectional
Stems
FCG
Rulebased
Rulebased
Inflectional
Stems, ench
Generating
All Forms
Generative Methods: inf stems

Instead of normalization, generate inflectional
stems for an inflectional index.
then using stems harvest full forms from the index
 long queries ...

... OR ...

Instead of normalization, generate full
inflectional forms for an inflectional index.
Long queries? Sure!
 Sounds absolutely crazy ...

... BUT!



Are morphologically complex languages that
complex in IR in practice?
Instead of full form generation, only generate
sufficient forms -> FCG
In Finnish, 9-12 forms cover 85% of all
occurrences of nouns
Kettunen & al 2006: Finnish
IR
MAP for relevance level
Method Liberal Normal Stringent
TWOL
37.8
35.0
24.1
FCG12
32.7
30.0
21.4
FCG6
30.9
28.0
21.0
Snowball 29.8
27.7
20.0
Raw
19.6
18.9
12.4
... monolingual ...
Kettunen & al 2007: Other Langs
IR
Method
TWOL
FCG
FCG
Snowball
Raw
MAP for Language
Swe
Ger
Rus
32.6
39.7
..
30.6 /4
38.0 /4
32.7 /2
29.1 /2
36.8 /2
29.2 /6
28.5
39.1
34.7
24.0
35.9
29.8
Results for long queries ... monolingual ...
CLIR Findings: Airio 2008
Language
pairs
Raw
transl
Fi-FCG_9
Fi-FCG_12
Sv-FCG_4
Sv-FCG_7
Lemmatized
Fin -> Eng
11.2
32.4
32.5
39.6
Fin -> Swe
14.3
22.6
23.9
35.2
Eng -> Swe
18.1
25.1
27.3
34.1
Swe -> Fin
11.7
28.0
27.9
37.6
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
5. Query Structures

Translation ambiguity such as ...

Homonymy: homophony, homography


Examples: platform, bank, book
Inflectional homography
Examples: train, trains, training
 Examples: book, books, booking


Polysemy


Examples: back, train
... a problem in CLIR.
Ambiguity Resolution

Methods
Part-of-speech tagging (e.g. Ballesteros & Croft ‘98)
 Corpus-based methods Ballesteros & Croft ‘96; ‘97;
Chen & al. ‘99)

Query Expansion
 Collocations


Query structuring - the Pirkola Method (1998)
Query Structuring

Concepts?
From weak to strong
query structures by
recognition of ...
concepts
 expression weights
 phrases, compounds
no
Weighting ?


Queries may be
combined ... query
fusion
yes
no
Weighting ?
yes
no
~~
~~
yes
Phrases ?
no
Phrases ?
yes
no
~~
~~
#sum(a b c d
e)
yes
#wsum(1 3 #syn(a #3(b
c)) 1 #syn(d e))
Structured Queries in CLIR

CLIR performance (Pirkola 1998, 1999)
English baselines, manual Finnish translations
 Automatic dictionary translation FIN -> ENG

natural language queries (NL) vs. concept queries (BL)
 structured vs. unstructured translations
 single words (NL/S) vs. phrases marked (NL/WP)
 general and/or special dictionary translation


500.000 document TREC subcollection
probabilistic retrieval (InQuery)
 30 health-related requests

The Pirkola Method


All translations of all senses provided by the
dictionary are incorporated in the query
All translations of each source language word
are combined by the synonym operator,
synonym groups by #and or #sum

this effectively provides disambiguation
An Example

Consider the Finnish natural language query:
lääke sydänvaiva [= medicine heart_problem]
Sample English CLIR query:
 #sum( #syn( medication drug ) heart #syn( ailment,
complaint, discomfort, inconvenience, trouble,
vexation ) )



Each source word forming a synonym set
Query Translation Test Set-up
English Request
Translated Finnish Request
Finnish NL Query
General Dict
Baseline Queries
TREC
Finnish BL Query
Med. Dict. General Dict
Translated English Queries
InQuery
Unix-server
Med. Dict.
Unstructured NL/S Queries
40
P
r
e
c
i
s
i
o
n
35
30
Baseline
25
20
15
10
5
0
10
20
30
40
50
60
70
80
90
100
Recall
GD
SD -- > GD
SD and GD
BL
#sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)
Only 38%
of the
average
baseline
precision
(sd&gd)
Structured Queries w/ Special
Dictionary
77% of the
average
baseline
precision
(sd & gd)
40
P
r
e
c
i
s
i
o
n
35
Baseline
30
25
20
15
10
5
0
10
20
30
40
50
60
70
80
Recall
GD
SD --> GD
SD and GD
BL
90
100
Structure
doubles
precision in
all cases
#and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))
Query Structuring, More Results
CLEF
Topic set 2000
(N = 33)
Unstructured
queries
no s-grams
Structured
queries
no s-grams
Change %
Finnish - Eng
0.1609
0.2150
33.6
German - Eng
0.2097
0.2639
25.8
Swedish - Eng
0.2015
0.2242
11.3
Topic set 2001
(N = 47)
Finnish - Eng
0.2407
0.3443
43.0
German - Eng
0.3242
0.3830
18.1
Swedish - Eng
0.3466
0.3465
0.0
Transit CLIR – Query Structures
TRA N S LA TIO N T Y PE
Averag e
D if fer e n c e
preci s ion
% u nit s
ENG m onoli ngu al
45.2
SWE -ENG s tru ctured
34.6
SWE -ENG u ns tru ctured
28.8
- 5.8
SWE -F IN -E N G s tru ctured
36.1
+ 1.5
SWE -F IN -E N G uns tru c tured
18.3
- 16.3
FI N -E N G
36.4
FI N -SW E -ENG
33.5
-2.9
Average precision for the transitive, bilingual and
monolingual runs of CLEF 2001 topics (N = 50)
Transitive CLIR Results, 2
Transitive CLIR Effectiveness
Regular
Relevance
MAP
N=35
% monolingual
performance
Swe-Eng-Fin
21.8
59**
68**
Eng-Swe-Fin
27.5
75
88
Ger-Eng-Fin
24.3
66**
83
Ger-Swe-Fin
29.3
79
100
Mean
25.7
70
85
Lehtokangas & al 2008
% direct
performance
TransCLIR + pRF effectiveness
Regular
Relevance
%
MAP
monolingual +
N=35
pRF
%
monolingual
%
direct
%
pRF exp direct
Swe-Eng-Fin 28.7
68**
78
90
78*
Eng-Swe-Fin 32.3
76*
88
103
87
Ger-Eng-Fin 33.6
79
91
115
104
Ger-Swe-Fin 34.5
81
94
118
107
76
88
107
94
Mean
32.3
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
6. OOV Words

Low coverage -- non-translated words

Domain-specific terms in general dictionaries






e.g. dystrophy
Covered in domain-specific dictionaries
Compound words
Proper names: persons, geographical names, …
Often central for query effectiveness
Large dictionaries solution? BUT:



Excessive number of senses and words for each
Increases ambiguity problems
and still many words remain OOVs
OOV Processing

Proper names are often spelled differently between
languages




Transliteration variation
Brussels, Bryssel; Chernobyl, Tshernobyl; Chechnya,
Tsetsenia
Non-translated keys are often used as such in CLIR
queries -- simplistic approach, not optimal
In some languages, proper names may inflect

In Finnish ”in Cairo” = Kairossa
Approximate String Matching

Means for
correcting spelling errors
 matching variant word forms (e.g. proper names
between languages)


Methods
edit-distance, n-grams, skip-grams
 soundex, phonix, based of phone similarity
 transliteration

Sample digram Matching

Sample words Bryssel, Brussels, Bruxelles
Bryssel --> N1= {br, ry, ys, ss, se, el}
 Brussels --> N2 = {br, ru, us, ss, se, el, ls}
 Bruxelles --> N3 = {br, ru, ux, xe, el, ll, le, es}




sim(N1, N2) = | N1  N2| / | N1  N2|
= 4 / 9 = 0.444
sim(N1, N3) = 1/6 = 0.167
Skip-grams






Generalizing n-gram matching
The strings to be compared are split into
substrings of length n
Skipping characters is allowed
Substrings produced using various skip lengths
n-grams remain a special case: no skips
(Pirkola & Keskustalo & al 2002)
An Example

katalyyttinen – catalytic

skip-0:



skip-1:



{_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_}
{_a, ct, aa, tl, ay, lt, yi, tc, i_}
skip-2:



{_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_}
{_c, cs, at, ta, al, ly, yt, ti, ic, c_}
{_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_}
{_t, ca, al, ty, at, lc, yc, t_}
calculate similarity over different skip-gram sets
Skip-gram effectiveness

Several n-gram methods tested
n-grams & skip-grams
 with & without padding


The relative improvement for some s-grams vs.
n-grams in X -> Finnish name matching:
18.2% (Eng medical)
49.7% (Eng geograph)
 20.7% (Ger geograph)
17.1% (Swe geograph)
 Statistically significant, p = 0.01-0.001

CLIR Findings: Järvelin & al 2008

Closely related languages


Norwegian and Swedish
Translation by string matching alone
no dictionary at all, no other vocabulary source
 encouraging results

CLIR Findings: Järvelin & al 2008
QuickTime™ and a
decompressor
are needed to see this picture.
CLIR Findings: Airio 2008
Language pairs
Raw
transl
*)
Fin -> Eng
Fi-sgram_9
Fi-sgram_12
Sv-sgram_4
Sv-sgram_7
*)
Lemmatized
*)
11.2
29.2
31.0
39.6
Eng -> Swe 18.1
25.7
25.3
34.1
Swe -> Fin- 11.7
26.1
26.7
37.6
Fin -> Swe
26.7
22.6
35.2
14.3
*) target index not normalized
Rule-based Transliteration

Between languages, the originally ‘same’ words have
different spellings:


Proper names, technical terms
Transliteration rules are based on regular variations in
the spelling of equivalent word forms between
languages





construction
somatology
universiti
- konstruktio,
- somatologia,
- university,
c -> k
y -> ia
i -> y
Transliteration rules mined in a bilingual word list
Their frequency and reliability are recorded
Sample Rules German-toEnglish






Source
string
Target
string
Location
of the rule
Confidence
factor
ekt
m
akt
ko
ect
ma
act
co
middle
end
middle
beginning
89.2
21.1
86.7
80.7
TRT Translation – 2 Steps
TRT Rule
Production
TRT
Rule
Base
Target
Index
OOV
Source
Word
TRT
Translation
Skipgram matching
Translation
Candidates
Identified
Target
Word(s)
Recall
Precision
1
2
Evaluation
TRT – rule collections

Rule collection
# Rules
Spanish-English
Spanish-German
Spanish-French
German-English
French-English
8800
5412
9724
8609
9873






# Rules
CF>4.0%, Fr. >2
1295
984
1430
1219
1170
TRT Effectiveness

Finnish-to-English translation (HCF)

Term type
Bio terms
Place names
Economics
Technology
Miscellaneous





Digrams TRT + digrams
61,4
72,0
30,0
35,9
32,2
38,0
31,6
53,7
33,8
40,6
% chg
+17,3
+19,7
+18,0
+69,9
+20,1
A Problem with TRT


TRT gives many target word forms for a source word
(even tens of thousands) but does not indicate the
correct one
For example, in Spanish-English translation TRT gives
the following forms of for a Spanish word biosintesis:


biosintesis, biosintessis, biosinthesis, biosinthessis,
biosyntesis, biosyntessis, biosynthesis, biosynthessis
To identify the correct equivalent, we use FITE

frequency-based identification of equivalents
Sample Frequency Pattern


The frequency pattern associated with the English
target word candidates for the Spanish word biosintesis:
Target candidate
Doc Freq







biosynthesis
biosintesis
biosyntesis
biosinthesis
biosynthessis
biosintessis
...
2 230 000
909
634
255
3
0
...
FITE-TRT Translation
TRT Rule
Production
OOV
Source
Word
TRT
Rule
Base
Frequency
Statistics
Identified
Target
Word
TRT
Translation
Translation
Candidates
1
FITE
Identification
2
Native
Source
Word
Evaluation
Frequency
pattern Ok
Relative
Frequency Ok
Length
Difference Ok
Recall
Precision
FITE-TRT effectiveness




Spanish-English biological and medical spelling
variants (n = 89)
Finnish-English biological and medical spelling
variants (n = 89)
Translation toward English by TRT
English equivalent identification by FITE
FITE-TRT Effectiveness

Translation Recall and Precision for bio-terms

Source
Language
Spanish
Web
Freq List
Finnish
Web
Freq List





Translation
Recall %
91.0
82.0
71.9
67.4
Translat
Prec %
98.8
98.8
97.0
97.3
FITE-TRT effectiveness

UTACLIR TREC Genomics Track 2004 topics
Spanish-English actual OOV words (n = 93+5)
 Finnish-English actual OOV words (n = 48+5)



Translation toward English by TRT
English equivalents identification by FITE
FITE-TRT Effectiveness

Translation Recall and Precision for actual OOV words
Source
Language
Spanish
Finnish
Web
Freq List
Web
Freq List
Translation
Recall %
89.2
87.1
72.9
79.2
Translation
Precision %
97.6
97.6
97.2
95.0
Genomics CLIR experiment

German - English CLIR in genomics

TREC Genomics Track data, 50 topics (20
training + 30 testing)
MedLine 4.6 M medical abstracts
Baseline: raw German as such + Dict transl
Test queries - FITE-TRT



Genomics CLIR experiment
QuickTime™ and a
decompressor
are needed to see this picture.
Agenda







1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
8. Conclusions

Monolingual retrieval

morphological complexity
who owns the index?
 reductive and generative approaches
 skewed distributions; surprisingly little may be enough


compound handling perhaps not critical in
monolingual retrieval
More Conclusions

Cross-language retrieval
Query structuring simple and effective
 Closely related languages / dialects



Observe what happens between compounding vs
isolating languages


compound splitting seems essential
Inflected form indices


simple language independent techniques maybe enough
skip-grams or FCG after translation may be a solution
Specific domains: Dom dictionaries or aligned corp
Thanks!
Descargar

Users, Interaction and Information Seeking/Retrieval