Cumulative Progress in Language
Models for Information Retrieval
Antti Puurula
6/12/2013
Australasian Language Technology Workshop
University of Waikato
Ad-hoc Information Retrieval
• Ad-hoc Information Retrieval (IR) forms the basic task in IR:
• Given a query, retrieve and rank documents in a collection
• Origins:
• Cranfield 1 (1958-1960), Cranfield 2 (1962-1966), SMART (1961-1999)
• Major evaluations:
• TREC Ad-hoc (1990-1999), TREC Robust (2003-2005), CLEF (2000-2009), INEX
(2009-2010), NTCIR (1999-2013), FIRE (2008-2013)
Illusionary Progress in Ad-hoc IR
• TREC ad-hoc evaluations stopped in 1999, as progress plateaued
• More diverse tasks became the foci of research
• “There is little evidence of improvement in ad-hoc retrieval
technology over the past decade” (Armstrong et al. 2009)
• Weak baselines, non-cumulative improvements
• ⟶ “no way of using LSI achieves a worthwhile improvement in retrieval
accuracy over BM25” (Atreya & Elkan, 2010)
• ⟶ “there remains very little room for improvement in ad hoc search”
(Trotman & Keeler, 2011)
Progress in Language Models for IR?
• Language Models (LM) form one of the main approaches to IR
• Many improvements to LMs not adopted generally or evaluated
systematically
• TF-IDF feature weighting
• Pitman-Yor Process smoothing
• Feedback models
• Are these improvements consistent across standard datasets,
cumulative, and do they improve on a strong baseline?
Query Likelihood Language Models
• Query Likelihood (QL) (Kalt 1996, Hiemstra 1998, Ponte & Croft
1998) is the basic application of LMs for IR
• Unigram case: using count vectors to represent documents  and
queries , rank documents  given a query according to ( |)
• Assuming a generative model    = ( , )/(), and
uniform priors over : ( |) ≈ (| )
Query Likelihood Language Models 2
• The unigram QL-score for each document  becomes:
• where () is the Multinomial coefficient, and document models
 () are given by the Maximum Likelihood estimates:
Pitman-Yor Process Smoothing
• Standard methods for smoothing in IR LMs are Dirichlet Prior (DP)
and 2-Stage Smoothing (2SS) (Zhai & Lafferty 2004, Smucker &
Allan 2007)
• Recent suggested improvement is Pitman-Yor Process smoothing
(PYP), an approximation to inference on a Pitman-Yor Process
(Momtazi & Klakow 2010, Huang & Renals 2010)
• All methods interpolate unsmoothed parameters with a
background distribution. PYP additionally discounts the
unsmoothed counts
Pitman-Yor Process Smoothing 2
• All methods share the form:
• DP:
• 2SS:
• PYP:
,
and
Pitman-Yor Process Smoothing 2
• All methods share the form:
• DP:
• 2SS:
• PYP:
,
,
and
Pitman-Yor Process Smoothing 3
• The background model  () is most commonly estimated by
concatenating all collection documents into a single document:
• Less commonly, a uniform background model is used:
TF-IDF Feature Weighting
• Multinomial modelling assumptions of text can be corrected with
TF-IDF weighting (Rennie et al. 2003, Frank & Bouckaert 2006)
• Traditional view: IDF-weighting unnecessary with IR LMs (Zhai &
Lafferty 2004)
• Recent view: combination is complementary (Smucker & Allan
2007, Momtazi et al. 2010)
TF-IDF Feature Weighting 2
• Dataset documents can be weighted by TF-IDF:
• , where ’’ is the unweighted count vector,  the number of
documents, and  number of documents where word  occurs
• First factor is TF log transform using unique length normalization
(Singhal et al. 1996)
• Second factor is Robertson-Walker IDF(Robertson & Zaragoza 2009)
TF-IDF Feature Weighting 3
• IDF has a overlapping function to collection smoothing (Hiemstra &
Kraaij 1998)
• Interaction taken into account by replacing collection model by a
uniform model in smoothing:
Model-based Feedback
• Pseudo-feedback is a traditional method in Ad-hoc IR:
• Using the retrieved documents for original query ’, construct and rank
using a new query 
• With LMs two different formalizations enable model-based
feedback:
• Kl-Divergence Retrieval (Zhai & Lafferty 2001)
• Relevance Models (Lavrenko & Croft 2001)
• Both enable replacing the original query counts ’ by a model
Model-based Feedback 2
• Many modeling choices exist for the feedback models, such as:
•
•
•
•
Using top  retrieved documents (commonly  = 50)
Truncating the word vector to words present in the original query
Weighting the feedback documents using (|’)
Interpolating the feedback model with the original query
• These modeling choices are combined here
Model-based Feedback 3
• The interpolated query model  is estimated for the query words
′ > 0 from the top  = 50 document models  ():
• , where  is the interpolation weight and  is normalizer:
Experimental Setup
• Ad-hoc IR experiments conducted on
13 standard datasets
• TREC1-5 split according to data source
• OHSU-TREC
• FIRE 2008-2011 English
• Preprocessing: stopword & short
word(< 3) removal, Porter-stemming
• Each dataset split into development
and evaluation subsets
Experimental Setup 2
• Software used for experiments was the SGMWeka 1.44 toolkit:
• http://sourceforge.net/projects/sgmweka/
• Smoothing parameters optimized on development sets using
Gaussian Random Searches (Luke 2009)
• Evaluation performed on evaluation sets, using Mean Average
Precision of top 50 documents ([email protected])
• Significance tested with paired one-tailed t-tests between the
datasets, with  < 0.05
Results
• Significant differences:
• PYP > DP
• PYP+TI > 2SS
• PYP+TI+FB > PYP+TI
• PYP+TI+FB improves on 2SS by
4.07 [email protected] absolute, a
17.1% relative improvement
Discussion
• The 3 evaluated improvements in language models for IR:
• require little additional computation
• can be implemented with small modifications to existing IR systems
• are substantial, significant and cumulative across 13 standard datasets,
compared to DP and 2SS baselines (4.07 [email protected] absolute, 17.1% relative)
• Improvements requiring more computation possible
• document neighbourhood smoothing, word correlation models, passagebased LMs, bigram LMs, …
• More extensive evaluations needed for confirming progress
Descargar

Cumulative Progress in Language Models for Information