Cross-Language Information Retrieval
Applied Natural Language Processing
October 29, 2009
Douglas W. Oard
What Do People Search For?
• Searchers often don’t clearly understand
– The problem they are trying to solve
– What information is needed to solve the problem
– How to ask for that information
• The query results from a clarification process
Need
• Dervin’s “sense making”:
Gap
Bridge
Q1
Visceral Need
Q2
Conscious Need
Q3
Formalized Need
Q4
Compromised Need
(Query)
Intermediated Search
End-user Search
Taylor’s Model of Question Formation
Design Strategies
• Foster human-machine synergy
– Exploit complementary strengths
– Accommodate shared weaknesses
• Divide-and-conquer
– Divide task into stages with well-defined interfaces
– Continue dividing until problems are easily solved
• Co-design related components
– Iterative process of joint optimization
Human-Machine Synergy
• Machines are good at:
– Doing simple things accurately and quickly
– Scaling to larger collections in sublinear time
• People are better at:
– Accurately recognizing what they are looking for
– Evaluating intangibles such as “quality”
• Both are pretty bad at:
– Mapping consistently between words and concepts
Process/System Co-Design
Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Predict
Nominate
Choose
Query
Search
Query Reformulation
and
Relevance Feedback
Ranked List
Selection
Document
Examination
Source
Reselection
Document
Delivery
Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Query
Search
Ranked List
Selection
Indexing
Document
Index
Examination
Acquisition
Document
Collection
Delivery
Search Component Model
Utility
Human Judgment
Document
Query Formulation
Query
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
Document Processing
Query Processing
Information Need
Relevance
• Relevance relates a topic and a document
– Duplicates are equally relevant, by definition
– Constant over time and across users
• Pertinence relates a task and a document
– Accounts for quality, complexity, language, …
• Utility relates a user and a document
– Accounts for prior knowledge
“Okapi” Term Weights
TF i , j
wi, j 
1 .5
Li
L
 TF i , j
 N  DF j  0 . 5 

* log 
 DF  0 . 5 
j


 0 .5
TF component
IDF component
1.0
6.0
5.8
0.8
5.6
5.4
0.5
1.0
ID F
Okapi TF
L/L
0.6
C las s ic
5.2
O k api
2.0
0.4
5.0
4.8
0.2
4.6
0.0
4.4
0
5
10
15
Raw TF
20
25
0
5
10
15
Ra w DF
20
25
A Ranking Function: Okapi BM25
term frequency
document frequency
 [log
e Q
( N  df ( e )  0 . 5 )
( df ( e )  0 . 5 )
][
( 2 . 2 * tf ( e , d k ))
8 * qtf ( e )
dl ( d k )
7  qtf ( e )
( 0 .3  0 .9 *
avdl
query term
query
 tf ( e , d k ))
document length
average document length
]
document
term frequency in query
Estimating TF and DF for Query Terms
tf ( e i , d k ) 

p ( e i  f j ) * tf ( f j , d k )
fj
df ( e i ) 

p ( e i  f j ) * df ( f j )
fj
fj
e1
f1
f2
f3
f4
0.4f1
tf ( f j , d k )
20
5
2
50
0.3f2
0.2
0.1f3
df ( f j )
50
40
30
200
p ( ei  f j )
0.4
0.3
0.2
0.1
f4
tf ( e i , d k )
0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9
df ( e i )
0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58
Learning to Translate
• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections
– Translations (“parallel”)
– Similar topics (“comparable”)
• Similarity
– Similar pronunciation, similar users
• People
Hieroglyphic
Demotic
Greek
Statistical Machine Translation
Señora Presidenta , había pedido a la administración del Parlamento que garantizase
Madam President , I had asked the administration to ensure that
Bidirectional Translation
wonders of ancient world (CLEF Topic 151)
Unidirectional:
se//0.31
demande//0.24
demander//0.08
peut//0.07
merveilles//0.04
question//0.02
savoir//0.02
on//0.02
bien//0.01
merveille//0.01
pourrait//0.01
Bidirectional:
si//0.01
sur//0.01
me//0.01
t//0.01
emerveille//0.01
ambition//0.01
merveilleusement//0.01
veritablement//0.01
cinq//0.01
hier//0.01
merveilles//0.92
merveille//0.03
emerveille//0.03
merveilleusement//0.02
Experiment Setup
• Test collections
CLEF’01-03
TREC-5,6
Query language
English
English
Document language
French
Chinese
151
54
# of documents
87,191
139,801
Avg # of rel docs
23
95
Source
# of topics
• Document processing
- Stemming, accent-removal (CLEF French)
- Word segmentation, encoding conversion (TREC Chinese)
- Stopword removal (all collections)
• Training statistical translation models (GIZA++)
Parallel corpus
Europarl
FBIS et al.
Languages
English-French
English-Chinese
# of sentence pairs
672,247
1,583,807
Models (iterations)
M1(10), HMM(5), M4(5)
M1(10)
Pruning Translations
Translations
Cumulative Probability Threshold
0.0
f1 (0.32)
f2 (0.21)
f3 (0.11)
f4 (0.09)
f5 (0.08)
f6 (0.05)
f7 (0.04)
f8 (0.03)
f9 (0.03)
f10 (0.02)
f11 (0.01)
f12 (0.01)
f1
0.1
f1
0.2
f1
0.3
f1
0.4
f1
f2
0.5
f1
f2
0.6
f1
f2
f3
0.7
f1
f2
f3
f4
0.8
f1
f2
f3
f4
f5
0.9
f1
f2
f3
f4
f5
f6
f7
1.0
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
Unidirectional without Synonyms (PSQ)
CLEF French
TREC-5,6 Chinese
110%
100%
100%
MAP: CLIR / Monolingual
MAP: CLIR / Monolingual
110%
90%
80%
70%
90%
80%
70%
60%
50%
60%
40%
50%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cumulative Probability Threshold
40%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cumulative Probability Threshold
Statistical significance vs monolingual (Wilcoxon signed rank test)
• CLEF French: worse at peak
• TREC-5,6 Chinese: worse at peak
Q
D
1
Bidirectional with Synonyms (DAMM)
CLEF French
110%
TREC-5,6 Chinese
110%
IMM
100%
MAP: CLIR/Monolingual
DAMM
PSQ
MAP: CLIR/Monolingual
DAMM
90%
80%
70%
IMM
PSQ
100%
90%
80%
70%
60%
50%
60%
40%
0.0
0.1
0.2
50%
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cumulative Probability Threshold
40%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cum ulative Probability Threshold
• DAMM significantly outperformed PSQ
• DAMM is statistically indistinguishable from monolingual at peak
• IMM: nearly as good as DAMM for French, but not for Chinese
(Q)
(D) v.s. Q
D
1.0
Indexing Time
Indexing time (sec)
500
monolingual
cross-language
400
300
200
100
0
0
10
15
20
25
35
40
45
Thousands of documents
Dictionary-based vector translation, single Sun SPARC in 2001
The Problem Space
• Retrospective search
– Web search
– Specialized services (medicine, law, patents)
– Help desks
• Real-time filtering
– Email spam
– Web parental control
– News personalization
• Real-time interaction
– Instant messaging
– Chat rooms
– Teleconferences
Key Capabilities
Map across languages
– For human understanding
– For automated processing
Making a Market
• Multitude of potential applications
– Retrospective search, email, IM, chat, …
– Natural consequence of language diversity
• Limiting factor is translation readability
– Searchability is mostly a solved problem
• Leveraging human translation has potential
– Translation routing, volunteers, cacheing
Descargar

Text Retrieval Issues - Courses | UC Berkeley School of