```Statistical Translation Language
Model
mkarimz2@illinois.edu
University of Illinois at Urbana-Champaign
1
Outline
• Motivation & Background
– Language model (LM) for IR
– Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual
– Motivation
– IBM Model 1
• Statistical Translation Language Model –
Monolingual
– Synthetic Queries
– Mutual Information-based approach
– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language
Model
2
The Basic LM Approach
([Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99])
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
Food nutrition
paper
…
food ?
nutrition ?
healthy ?
diet ?
…
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
p(q| dN)
dN
dN
q
Retrieval as LM Estimation
• Document ranking based on query
likelihood m
|V |
log p(q | d )   log p(qi | d )   c( wi , q) log p( wi | d )
i 1
where, q  q1q2 ...qm
i 1
Document language model
• Retrieval problem  Estimation of p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a word
that has not been observed?
• If we want to assign non-zero probabilities to such
words, we’ll have to discount the probabilities of
observed words
• This is what “smoothing” is about …
6
Language Model Smoothing
P(w)
Max. Likelihood Estimate
p ML ( w ) 
count of w
count of all words
Smoothed LM
w
Smoothing Methods for IR
(Zhai & Lafferty 01)
• Method 1(Linear interpolation, JelinekMercer):
c( w, d )
p( w | d )  (1   )
  p( w | REF )
|d |
ML estimate
parameter
• Method 2 (Dirichlet Prior/Bayesian):
p (w | d ) 
c ( w;d )   p ( w| REF )
|d | 

|d |
|d | 
c( w, d )
 |d |  p( w | REF )
|d |
parameter
Outline
• Motivation & Background
– Language model (LM) for IR
– Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual
– Motivation
– IBM Model 1
• Statistical Translation Language Model –
Monolingual
– Synthetic Queries
– Mutual Information-based approach
– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language
Model
9
A Brief History
• Machine translation was one of the first
applications envisioned for computers
• Warren Weaver (1949): “I have a text in front of me
which is written in Russian but I am going to pretend that it is
really written in English and that it has been coded in some
strange symbols. All I need to do is strip off the code in order to
retrieve the information contained in the text.”
• First demonstrated by IBM in 1954 with a
basic word-for-word translation system
10
Interest in Machine Translation
• Commercial interest:
– U.S. has invested in MT for intelligence
purposes
– MT is popular on the web—it is the most used
– EU spends more than \$1 billion on translation
costs each year.
– (Semi-)automated translation could lead to
huge savings
11
Interest in Machine Translation
– One of the most challenging problems in NLP
research
– Requires knowledge from many NLP sub-areas, e.g.,
lexical semantics, parsing, morphological analysis,
statistical modeling,…
– Being able to establish links between two languages
allows for transferring resources from one language
to another
12
Word-Level Alignments
• Given a parallel sentence pair we can link
(align) words or phrases that are
translations of each other:
13
Machine Translation -- Concepts
• We are trying to model P(e|f)
– I give you a French sentence
– You give me back English
• How are we going to model this?
– The maximum likelihood estimation of P(e | f)
is: freq(e,f)/freq(f).
– Way too specific to get any reasonable
frequencies! Vast majority of unseen data will
have zero counts!
Machine Translation – Alternative
way
• We could use Bayes rule
P (e | f ) 
P ( f | e) P (e)
 P ( f | e) P (e)
P( f )
Given a French sentence f, we could do a search for
an e that maximizes p(e|f).
e  arg max P ( f | e ) P ( e )
e
• Why using Bayes rule and not directly estimating p(e|f) ?
It is important that our model for p(e|f) concentrates its probability as much as
possible on well-formed English sentences. But it is not important that our
model for P(f|e) concentrate its probability on well-formed French sentences.
Statistical Machine Translation
• The noisy channel model
Language Model
P e 
e
f
Translation Model
P f e
Decoder
eˆ  arg max
e
P e f

eˆ
e: English f: French
eak
fk
ea j
|e|=l
f
j
|f|=m
– Assumptions:
• An English word can be aligned with multiple French words
while each French word is aligned with at most one English
word
• Independence of the individual word-to-word translations
16
Estimation of Probabilities -- IBM
Model 1
• Simplest of the IBM models. (There are 5
models)
• Does not consider word order (bag-ofwords approach)
• Does not model one-to-many alignments
• Computationally inexpensive
• Useful for parameter estimations that are
passed on to more elaborate models
17
IBM Model 1
• Three important components involved
– Language model
• Give the probability p(e).
– Translation model
• Estimate the Translation Probability p(f|e).
– Decoder
eˆ  arg max P e f   arg max
e
e
p e  p  f e 
p f

 arg max p e  p  f e 
e
18
IBM Model 1- Translation Model
• Joint probability of P(F=f, E=e, A=a) where A is an
alignment between two sentences.
=
(, |)

•
Assume each French word has exactly one connection.
19
IBM Model 1- Translation Model
•
•
Assume, |e|=l and |f|=m, then the alignment can be
represented by a series a = a1a2…am
Each alignment is between 0 and l such that if the word
in position j of the French sentence is connected to the
word in position i of the English sentence, then aj=i and
if it not connected to any English word, then aj=0.
1
(+1)

=1 ( | )
•
,   =
•
The alignment is determined by specifying the values of
aj for j from 1 to m, each of which can take any value
from 0 to l.
20
IBM Model 1 – Translation Model
=
(, |)

1
,   =
( + 1)
–  =
1
(+1)

1 =0 …

( | )
=1

=0

=1 (
all possible alignments
(the English word that a French
word fj is aligned with)
.
translation
probability
EM algorithm is used to
estimate the translation
probabilities.
21
Outline
• Motivation & Background
– Language model (LM) for IR
– Smoothing methods for IR
• Statistical Machine Translation – Cross-Lingual
– Motivation
– IBM Model 1
• Statistical Translation Language Model –
Monolingual
– Synthetic Queries
– Mutual Information-based approach
– Regularization of self-translation probabilities
• Smoothing in Statistical Translation Language
Model
22
The Problem of Vocabulary Gap
Query = auto wash
P(“auto”) P(“wash”)
d1
d2
d3
auto
wash
…
auto
…
auto
car
wash
vehicle
How to support inexact matching?
{“car” , “vehicle”} == “auto”
P(“auto”) P(“wash”)
23
Translation Language Models for IR
[Berger & Lafferty 99]
Query = auto wash
“translate”
d1
d2
auto
wash
…
Query = car wash
“car”
p(w
| d ) |d3)=

p ml (ux p| (“auto”|
d ) p t “car”)
(w | u )
P(“auto”
p(“car”|d3)
auto
auto
t
u
+ p(“vehicle”|d3)
x pt(“auto”| “vehicle”)
P(“car”|d3)
d3
“auto”
“auto”
car
wash
vehicle
“car”
How to estimate?
Pt(“auto”| “car”)
P(“auto”) P(“wash”)
“auto”
“vehicle”
P(“vehicle”|d3)
P (“auto”| “vehicle”)
t
24
Estimation of Translation Model: pt(w|u)
• When relevance judgments are available, (q,d)
serves as data to train the translation model
• Without relevance judgments, we can use
synthetic data [Berger & Lafferty 99], <title,
body>[Jin et al. 02]
pt ( w | d ) 
Basic translation
model

p t ( w | u ) p (u | d )
w d
Translation model
Regular doc LM
Estimation of Translation Model –
Synthetic Queries
([Berger & Lafferty 99])
• Select words that are representative of a
document.
• Calculate a Mutual information for each word in
(,)
a document: I(w,d) = p(w,d)log
(|)
• Synthetic queries are sampled based on
normalized mutual information.
• The resulting (d,q) of documents and synthetic
queries are used to estimate the probabilities
using EM algorithm (IBM Model 1).
Estimation of Translation Model –
Synthetic Queries Algorithm
([Berger & Lafferty 99])
Limitations:
1.Can’t translate into words not seen
in the training queries
2.Computational complexity
Training
data
A simpler and more efficient method for
estimating pt(w|u) with higher coverage
was proposed in:
M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation
Models Based on Mutual Information for Ad Hoc Information Retrieval.
ACM SIGIR, pages 323-330, 2010
28
Estimation of Translation Model Based
on Mutual Information
1. Calculate Mutual information for each pair of two
words in the collection (measuring co-occurrences)
I(X w; X u ) 


p ( X w , X u ) log
X w { 0 ,1} X u { 0 ,1}
p( X w , X u )
p( X w ) p( X u )
presence/absence of word w in a document
2. Normalize mutual information score to obtain a
translation probability:
p mi ( w | u ) 
I(X w; X u )
 I(X
w'
;Xu)
w'
29
Computation Detail
I(X w; X u ) 
 
p ( X w , X u ) log
X w  0 ,1 X u  0 , 1
p ( X w  1) 
Xw
D1
D2
D3
….
…
DN
Xu
0
1
1
0
1
0
0
0
p( X w , X u )
p( X w ) p( X u )
c ( X w  1)
N
p ( X u  1) 
c ( X u  1)
N
N
Xw=1
p ( X w  1, X u  1) 
Xu=1
c ( X w  1, X u  1)
N
Exploit index to speed up computation
30
Sample Translation Probabilities (AP90)
p(w| “everest”)
Mutual Information
Synthetic Query
q
p(q|w)
q
p(q|w)
everest
0.1051
everest
0.079
climber
0.0423
climber
0.042
mount
0.0339
climb
0.0365
028
0.0308
mountain
0.0359
expedit
0.0303
mount
0.033
peak
0.0155
reach
0.0312
himalaya
0.01532
expedit
0.0314
nepal
0.015
summit
0.0253
sherpa
0.01431
whittak
0.016
hillari
0.01431
peak
0.0149
31
Regularizing Self-Translation Probability
• Self-translation probability can be underestimated p t ( w | u )  p t ( w | w )
An exact match would be counted less than an
exact match
• Solution: Interpolation with “1.0 self-translation”
  (1   ) p ( u | u ) w = u
pt ( w | u )  
wu
 (1   ) p ( w | u )
= 1  basic query likelihood model
= 0  original MI estimate
32
Query Likelihood and Translation
Language Model
• Document ranking based on query
likelihood m
|V |
log p(q | d )   log p(qi | d )   c( wi , q) log p( wi | d )
i 1
i 1
where, q  q1q2 ...qm
Document language model
• Translation Language Model
pt ( w | d ) 

p t ( w | u ) p (u | d )
w d
Do you see any problem?
Further Smoothing of Translation
Model for Computing Query Likelihood
• Linear interpolation (Jelinek-Mercer):
p t ( w | d )  (1   )[  p t ( w | u ) p ml ( u | d )]   p ( w | C )
ud
pml(w|d)
• Bayesian interpolation (Dirichlet prior):
pt ( w | d ) 
|d |
| d | 
[  p t ( w | u ) p ml ( u | d )] 
ud

| d | 
p(w | C )
pml(w|d)
34
Experiment Design
• MI vs. Synthetic query estimation
– Data Sets: Associated Press (AP90) and San Jose
Mercury News (SJMN) + TREC topics 51-100
– Relatively small data sets in order to compare our results
with Synthetic queries in [Berger& Lafferty 99].
• MI Translation model vs. Basic query likelihood
– Larger Data Sets: TREC7, TREC8 (plus AP90, SJMN)
– TREC topics 351-400 for TREC7 and 401-450 for
TREC8
– Regularization of self-translation?
– Influence of smoothing on translation models?
– Translation model + pseudo feedback?
35
Mutual information outperforms synthetic
queries in both MAP and P@10
AP90 + queries 51-100, Dirichlet Prior Smoothing
MI
Syn. Query
MI
Syn. Query
36
Upper Bound Comparison of Mutual
Information and Synthetic Queries
Dirichlet Prior Smoothing
Data
Precision @10
MAP
Mutual Info
Syn. Query
Mutual Info.
Syn. Query
AP90
0.272*
0.251
0.423
0.404
SJMN
0.2*
0.195
0.28
0.266
JM Smoothing
Data
Precision @10
MAP
Mutual Info
Syn. Query
Mutual Info.
Syn. Query
AP90
0.264*
0.25
0.381
0.357
SJMN
0.197*
0.189
0.252
0.267
37
Mutual information translation model
outperforms basic query likelihood
Data
Dir. Prior
Smoothing
MAP
Basic QL
MI Trans.
Basic QL
MI Trans.
AP90
0.248
0.272*
0.398
0.423
SJMN
0.195
0.2*
0.266
0.28
TREC7
0.183
0.187*
0.412
0.404
TREC8
0.248
0.249
0.452
0.456
Data
JM
Smoothing
Precision @10
MAP
Precision @10
Basic QL
MI Trans.
Basic QL
MI Trans.
AP90
0.246
0.264*
0.357
0.381
SJMN
0.188
0.197*
0.252
0.267
TREC7
0.165
0.172
0.354
0.362
TREC8
0.236
0.244*
0.428
0.436
38
Translation model appears to need less
collection smoothing than basic QL
Translation model
Basic query likelihood
39
Translation model and pseudo feedback
exploit word co-occurrences differently
JM Smoothing
Data
MAP
Precision @10
BL
PFB
PFB+TM
BL
PFB
PFB+TM
AP90
0.246
0.271
0.298
0.357
0.383
0.411
SJMN
0.188
0.229
0.234
0.252
0.316
0.313
TREC7
0.165
0.209
0.222
0.354
0.38
0.384
TREC8
0.236
0.240
0.281
0.428
0.4
0.452

p ( w |  q ) log p t ( w | d )
p ( w | q )
Query model
from pseudo FB
Smoothed
Translation
Model 40
Regularization of self-translation is
beneficial
AP Data Set, Dirichlet Prior
41
Summary
• Statistical Translation language model are effective for
bridging the vocabulary gap.
• Mutual information is more effective and more efficient
than synthetic queries for estimating translation model
probabilities.
• Regularization of self-translation is beneficial
• Translation model outperforms basic query likelihood on
small and large collections and is more robust
• Translation model and pseudo feedback exploit word cooccurrences differently and can be combined to further
improve performance
42
References
• [1] A. Berger and J. Lafferty. Information Retrieval as Statistical
Translation. ACM SIGIR, pages 222–229, 1999.
• [2] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The
mathematics of statistical machine translation: Parameter
estimation. Computational Linguistics, 19(2):263–311, 1993.
• [3] M. Karimzadehgan and C. Zhai. Estimation of Statistical
Translation Models Based on Mutual Information for Ad Hoc
Information Retrieval. ACM SIGIR, pages 323-330, 2010.
43
```