```Language Models
Hongning Wang
CS@UVa
Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
CS@UVa
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Generative Model
Doc
generation
Query
generation
Different
inference system
Prob. concept Inference
network
space model
(Wong & Yao, 95) model
(Turtle & Croft, 91)
Classical
LM approach
prob. Model
(Ponte & Croft, 98)
(Robertson &
(Lafferty & Zhai, 01a)
Sparck Jones, 76)
CS 6501: Information Retrieval
2
What is a statistical LM?
• A model specifying probability distribution
over word sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• It can be regarded as a probabilistic
mechanism for “generating” text, thus also
called a “generative” model
CS@UVa
CS 6501: Information Retrieval
3
Why is a LM useful?
• Provides a principled way to quantify the
uncertainties associated with natural language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
– Given that we observe “baseball” three times and “game” once
in a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would
the user use “baseball” in a query?
(information retrieval)
CS@UVa
CS 6501: Information Retrieval
4
Source-Channel framework [Shannon 48]
Source
X
Transmitter
(encoder)
P(X)
Noisy
Channel
Y
p ( X | Y )  arg max
X
Destination
X’
P(X|Y)=?
P(Y|X)
Xˆ  arg max
(decoder)
p (Y | X ) p ( X )
(Bayes Rule)
X
When X is text, p(X) is a language model
Many Examples:
Speech recognition:
Machine translation:
OCR Error Correction:
Information Retrieval:
Summarization:
CS@UVa
X=Word sequence
X=English sentence
X=Correct word
X=Document
X=Summary
CS 6501: Information Retrieval
Y=Speech signal
Y=Chinese sentence
Y= Erroneous word
Y=Query
Y=Document
5
Unigram language model
• Generate a piece of text by generating each
word independently
– (1 2 … ) = (1)(2) … ()
– . .

=1
,

= 1 ,   ≥ 0
• Essentially a multinomial distribution over the
A Unigram Language Model
vocabulary
0.12
0.1
0.08
The simplest and
most popular choice!
0.06
0.04
0.02
CS@UVa
w1
w2
w3
w4
w5
w6
w7
w8
w9
w10
w11
w12
w13
w14
w15
w16
w17
w18
w19
w20
w21
w22
w23
w24
w25
0
CS 6501: Information Retrieval
6
More sophisticated LMs
• N-gram language models
– In general, (1 2 … ) =
(1)(2|1) … ( |1 … −1 )
– n-gram: conditioned only on the past n-1 words
– E.g., bigram: (1 … ) =
(1)(2|1) (3|2) … ( |−1 )
• Remote-dependence language models (e.g.,
Maximum Entropy model)
• Structured language models (e.g., probabilistic
context-free grammar)
CS@UVa
CS 6501: Information Retrieval
7
Why just unigram models?
• Difficulty in moving toward more complex models
– They involve more parameters, so need more data to
estimate
– They increase the computational complexity
significantly, both in time and space
• Capturing word order or structure may not add
so much value for “topical inference”
• But, using more sophisticated models can still be
expected to improve performance ...
CS@UVa
CS 6501: Information Retrieval
8
Generative view of text documents
(Unigram) Language Model 
p(w| )
Topic 1:
Text mining
…
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
…
Topic 2:
Health
…
text 0.01
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
…
CS@UVa
Sampling
Document
Text mining
document
Food nutrition
document
CS 6501: Information Retrieval
9
Estimation of language models
• Maximum likelihood estimation
Unigram Language Model 
p(w| )=?
10/100
5/100
3/100
3/100
1/100
CS@UVa
Estimation
…
text ?
mining ?
assocation ?
database ?
…
query ?
…
Document
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
CS 6501: Information Retrieval
A “text mining” paper
(total #words=100)
10
Language models for IR
[Ponte & Croft SIGIR’98]
Document
Text mining
paper
Food nutrition
paper
CS@UVa
Language Model
…
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
…
food ?
nutrition ?
healthy ?
diet ?
…
Query
?
CS 6501: Information Retrieval
“data mining algorithms”
Which model would most
likely have generated this
query?
11
Ranking docs by query likelihood
Doc LM
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
……
……
CS@UVa
Query likelihood
dN
dN
q
p(q| dN)
Justification: PRP
O ( R  1 | Q , D )  P (Q | D , R  1)
CS 6501: Information Retrieval
12
Justification from PRP
O(R  1 | Q, D) 

P (Q , D | R  1)
P (Q , D | R  0 )
Query generation
P (Q | D , R  1) P ( D | R  1)
P (Q | D , R  0 ) P ( D | R  0 )
 P (Q | D , R  1)
P ( D | R  1)
P ( D | R  0)
Query likelihood p(q| d)
( Assume
P ( Q | D , R  0 )  P ( Q | R  0 ))
Document prior
Assuming uniform document prior, we have
O ( R  1 | Q , D )  P (Q | D , R  1)
CS@UVa
CS 6501: Information Retrieval
13
Retrieval as language model estimation
• Document ranking based on query likelihood
log p ( q | d ) 
 log
p ( wi | d )
i
where , q  w1 w 2 ... w n
Document language model
– Retrieval problem  Estimation of (|)
– Common approach
• Maximum likelihood estimation (MLE)
CS@UVa
CS 6501: Information Retrieval
14
Problem with MLE
• What probability should we give a word that
has not been observed in the document?
– log0?
• If we want to assign non-zero probabilities to
such words, we’ll have to discount the
probabilities of observed words
• This is so-called “smoothing”
CS@UVa
CS 6501: Information Retrieval
15
General idea of smoothing
• All smoothing methods try to
1. Discount the probability of words seen in a
document
2. Re-allocate the extra counts such that unseen
words will have a non-zero count
CS@UVa
CS 6501: Information Retrieval
16
Illustration of language model smoothing
P(w|d)
Max. Likelihood Estimate
p ML ( w ) 
count of w
count of all words
Smoothed LM
w
Discount from the seen words
CS@UVa
Assigning nonzero probabilities
to the unseen words
CS 6501: Information Retrieval
17
Smoothing methods
– Add a constant  to the counts of each word
Counts of w in d
c( w, d )  1
p( w | d ) 
| d |  |V |
Vocabulary size
Length of d (total counts)
– Problems?
• Hint: all words are equally important?
CS@UVa
CS 6501: Information Retrieval
18
Refine the idea of smoothing
• Should all unseen words get equal
probabilities?
• We can use a reference model to discriminate
unseen words
Discounted ML estimate
if w is seen in d
 pseen ( w | d )
p( w | d )  
otherwise
 d p( w | REF )
 p (w | d )
 p(w | REF )
1
d 
Reference language model
seen
w is seen
w is unseen
CS@UVa
CS 6501: Information Retrieval
19
Smoothing methods (cont)
• Method 2: Absolute discounting
– Subtract a constant  from the counts of each
word
# uniq words
p (w | d ) 
max( c ( w;d )  ,0)  |d |u p ( w| REF )
|d |
– Problems?
• Hint: varied document length?
CS@UVa
CS 6501: Information Retrieval
20
Smoothing methods (cont)
• Method 3: Linear interpolation, JelinekMercer
– “Shrink” uniformly toward p(w|REF)
p( w | d )  (1   )
– Problems?
c( w, d )
  p( w | REF )
|d |
MLE
parameter
• Hint: what is missing?
CS@UVa
CS 6501: Information Retrieval
21
Smoothing methods (cont)
• Method 4: Dirichlet Prior/Bayesian
– Assume pseudo counts p(w|REF)
p (w | d ) 
c ( w;d )   p ( w| REF )
|d | 

|d |
|d | 
c( w, d )
 |d |  p( w | REF )
|d |
parameter
– Problems?
CS@UVa
CS 6501: Information Retrieval
22
Dirichlet prior smoothing
likelihood of doc given the model
• Bayesian estimator
prior over models
– Posterior of LM:    ∝    ()
• Conjugate prior
• Posterior will be in the same form as prior
• Prior can be interpreted as “extra”/“pseudo” data
• Dirichlet distribution is a conjugate prior for
multinomial distribution
Dir ( |  1 ,  , 
N
) 
 ( 1    
 ( 1 )   (
N
N
)
)
N
i
 i 1
i 1
“extra”/“pseudo” word counts, we set i= p(wi|REF)
CS@UVa
CS 6501: Information Retrieval
23
Some background knowledge
• Conjugate prior
– Posterior dist in the same
family as prior
Gaussian -> Gaussian
Beta -> Binomial
Dirichlet -> Multinomial
• Dirichlet distribution
– Continuous
– Samples from it will be the
parameters in a
multinomial distribution
CS@UVa
CS 6501: Information Retrieval
24
Dirichlet prior smoothing (cont)
Posterior distribution of parameters:
p ( | d )  Dir ( | c ( w 1 )   1 ,  , c ( w N )  
Property
: If  ~ Dir ( |  ), then E(  )  {
N
i
i
)
}
The predictive distribution is the same as the mean:
p(w
i
| ˆ ) 

 p(w
i
|  ) Dir ( |  ) d 
c(w i )   i
N

c ( w i )   p ( w i | REF )
|d | 
| d |   i
i1
Dirichlet prior smoothing
CS@UVa
CS 6501: Information Retrieval
25
Estimating  using leave-one-out
[Zhai & Lafferty 02]
w1
P(w1|d- w1)
Leave-one-out
log-likelihood
N
L 1 (  | C ) 
w2
P(w2|d- w2)

c ( w , d i ) log(
i  1 w V
c(w, d i )  1  p (w | C )
| d i | 1  
Maximum Likelihood Estimator
...
μˆ  argmax
L 1 (μ | C )
μ
wn
P(wn|d- wn)
CS@UVa
CS 6501: Information Retrieval
26
)
Why would “leave-one-out” work?
20 word by author1
abc abc ab c d d
abc cd d d
abd ab ab ab ab
cd d e cd e
Now, suppose we leave “e” out…
 doesn’t have to be big
p m l (" e " | author 1) 
20 word by author2
abc abc ab c d d
abe cb e f
acf fb ef aff abef
cdc db ge f s
p m l (" e " | author 2)
1
19
0
19
p sm ooth (" e " | author 1) 
p sm ooth (" e " | author 2) 
20
1
20   19
20
0
20   19



20  

20  
p (" e " | R E F )
p (" e " | R E F )
 must be big! more smoothing
The amount of smoothing is closely related to
the underlying vocabulary size
CS@UVa
CS 6501: Information Retrieval
27
Understanding Content
smoothing
words
Query = “the
pML(w|d1):
0.04
pML(w|d2):
0.02
algorithms
0.001
0.001
for
0.02
0.01
p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
data
0.002
0.003
mining”
0.003 4.8 × 10−12
0.004 2.4 × 10−12
Intuitively, d2 should have
a higher score,
but p(q|d1)>p(q|d2)…
So we should make p(“the”) and p(“for”) less different for all docs,
and smoothing helps to achieve this goal…
After smoothing
Query
4.53 × 10−13
with p ( w | d )  0 . 1 p DML ( w | d )  0 . 9 p ( w | REF ), p ( q | d 1)  p ( q | d 2 )!
= “the
P(w|REF)
Smoothed p(w|d1):
Smoothed p(w|d2):
CS@UVa
2.35 × 10−13
0.2
0.184
0.182
algorithms
0.00001
0.000109
0.000109
for
data
0.2
0.182
0.181
0.00001
0.000209
0.000309
CS 6501: Information Retrieval
mining”
0.00001
0.000309
0.000409
28
Understanding
 p (w | d )
 
smoothing
 p(w | REF )
1
seen
w is seen
d
w is unseen
• Plug in the general smoothing scheme to the
query likelihood retrieval formula, we obtain
TF weighting
log p ( q | d ) 

wi  d  q
[log
p seen ( w i | d )
 d p ( wi | C )
Doc length normalization
(longer doc is expected to have a smaller d)
]  n log  d 
IDF weighting
 log
p ( wi | C )
wi q
Ignore for ranking
• Smoothing with p(w|C)  TF-IDF + doclength normalization
CS@UVa
CS 6501: Information Retrieval
29
Smoothing & TF-IDF weighting
Smoothed ML estimate
if w is seen in d
 p Seen ( w | d )
p(w | d )  
  d p ( w | C ) otherw ise
Retrieval formula using the
general smoothing scheme
1
d 
log p(q | d ) 

p Seen ( w | d )
w is seen


c( w, q) log p( w | d )

c( w, q) log pSeen ( w | d ) 

c( w, q ) log pSeen ( w | d ) 

c( w, q ) log
Reference language model
p(w | C )
w is unseen
wV ,c ( w, q )  0

wV ,c ( w , d )  0,
c ( w, q )  0

wV ,c ( w, d )  0
,  > 0

wV ,c ( w, d )  0
c ( w, q )  0

wV ,c ( w, q )  0,c ( w, d ) 0

wV ,c ( w, q )  0
c( w, q) log  d p( w | C )
c( w, q) log  d p( w | C ) 

wV ,c ( w, q )  0,c ( w, d )  0
c( w, q) log  d p( w | C )
pSeen ( w | d )
 | q | log  d   c( w, q) p( w | C )
 d p( w | C )
wV ,c ( w, q )  0
Key rewriting step (where did we see it before?)
Similar rewritings are very common when
using probabilistic models for IR…
CS@UVa
CS 6501: Information Retrieval
30
What you should know
• How to estimate a language model
• General idea and different ways of smoothing
• Effect of smoothing
CS@UVa
CS 6501: Information Retrieval
31
```