Evaluation
Information retrieval
Web

Purposes of Evaluation

System Performance Evaluation



Retrieval Evaluation


efficiency of data structures and methods
operational profile
how well can system “guess” what is needed from
range of queries (including ill-specified and vague)
Comparison – Who’s best

how does new algorithm/data structure compare
to its predecessors; what is its contribution?
Canonical Retrieval &
Comparison Experiment
1. For each method M being compared,
a.
For each parameter setting P of M,
1)
2)
Train M(P) on a training data set D’.
For each testing data set D,
a)
b)
c)
b.
Run M on D
Compare actual results to expected results
Compute performance metrics
Compare performance on set P for method M
2. Compare performance across set M using
statistical tests for significance
Key Questions

To what methods should you compare?
What parameters should be checked?

What data sets should be used?


How should you divide into training and
testing?

What metrics should be used?

What statistical tests should be run?
Choosing Methods and
Parameters


Depends on your hypothesis…
Experiments are based on hypothesis,:



My system has significantly better performance
than state of the art.
Heuristics significantly improve performance.
Negative feedback makes little difference to
performance.
IR Data Sets


TREC (Text Retrieval Conference)
http://trec.nist.gov/





yearly conference organized by NIST
groups compare IR systems on designated tasks and
data
data sets are large and include human relevance and
topic judgments
tasks: usually IR (e.g., retrieval, filtering), recently
Web
Largest,most widely used collections
TREC Data Collections
(>800,000 docs)
1.
2.
3.
4.
5.
Wall Street Journal (1987, 1988, 1989), the Federal Register
(1989), Associated Press (1989), Department of Energy
abstracts, and Information from the Computer Select disks
(1989, 1990)
Wall Street Journal (1990, 1991, 1992), the Federal Register
(1988), Associated Press (1988) and Information from the
Computer Select disks (1989, 1990) copyrighted by Ziff-Davis
San Jose Mercury News (1991), the Associated Press (1990),
U.S. Patents (1983-1991), and Information from the Computer
Select disks (1991, 1992) copyrighted by Ziff-Davis
Financial Times Limited (1991, 1992, 1993, 1994), the
Congressional Record of the 103rd Congress (1993), and the
Federal Register (1994).
Foreign Broadcast Information Service (1996) and the Los
Angeles Times (1989, 1990).
TREC Relevance Judgments
Definition of relevance: “If you were writing a
report on the subject of the topic and would
use the information contained in the
document in the report, then the document
is relevant.”
 binary judgments, relevant if any part is
 pooled (based on returns of actual systems)
subset of docs are judged by topic author
TREC Topics
<top>
<head> Tipster Topic Description
<num> Number: 066
<dom> Domain: Science and Technology
<tide> Topic: Natural Language Processing
<desc> Description:
Document will identify a type of natural language processing technology which
is being developed or marketed in the U.S.
<narr> Narrative:
A relevant document will identify a company or institution developing or
marketing a natural language processing technology, identify the technology,
and identify one or more features of the company's product.
<con> Concept(s): 1. natural language processing 2. translation, language, dictionary, font
applications
<fac> Factor(s):
<nat> Nationality: U.S.
<fac>
<del> Definition(s):
<top>
3. software
TREC Tasks









ad hoc retrieval
routing (query stream from profile)
confusion (docs with scanning errors)
Database merging (docs from separate collections)
spoken documents
filtering (construct profile to filter incoming stream)
question and answer
web ad hoc
web homepage finding
TREC 2003 Tasks
Cross-language: retrieve documents in multiple languages
Filtering: user’s information need is stable; check incoming
stream for which docs should be disseminated
Genome: gene sequences as well as supporting docs
High Accuracy Retrieval from Documents (HARD): leverage
knowledge about user and/or goals
Interactive
Novelty: retrieve new,non-redundant data
Question answering
Video: Content-based retrieval of digital video
Web: search on a snapshot of the web
Web Topics
<top>
<num> Number: 501
<title> deduction and induction in English?
<desc> Description:
What is the difference between deduction and induction in
the process of reasoning?
<narr> Narrative:
A relevant document will contrast inductive and deductive
reasoning. A document that discusses only one or the
other is not relevant.
</top>
Metrics: Precision and Recall
Precision: were all retrieved docs relevant?
hits/(hits+fp)
Recall: were all relevant docs retrieved?
hits/(hits+fn)
Retrieved
~Retrieved
Relevant
hit
false
negative
~Relevant
false
positive
miss
Precision/Recall Tradeoff
100
90
80
70
60
50
40
30
20
10
0
sys1
sys2
0
10
20
30
40
50
60
70
80
90 100
Performance for Each Query



Hold recall fixed (through parameters)
and allow precision to vary
Average precision at seen relevant
docs: running average
R-precision: precision by rth ranking
where r=|relevant docs|
Metrics: P&R Combination


Precision and recall tend to balance each
other. Improve one at cost of the other.
F is harmonic mean, comparing recall and
precision in single metric.
F (i, j ) 
2 * Recall ( i , j ) * Precision ( i , j )
Precision ( i , j )  Recall ( i , j )
(   1) Recall * Precision
2
F 
 * Precision
2
 Recall
, 1
Experiment Pitfalls








Ill specified hypothesis
Reliance on flaky and/or too few users
Bias in the user base
Poorly constructed user instructions
Inappropriate comparison methods
Too much varying simultaneously
Biased evaluation metric (confounding) or
data selection
Too many or too few statistical tests
Strategies for Avoiding Pitfalls




Fully specify experiment before running it
Run pilot experiment
Put serious thought into “right” hypothesis
Think about what you hope to say when
experiment concludes… will experiment
support your saying it?
Web Experiment Example
[O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of
21st International SIGIR Conference on Research and
Development in Information Retrieval, 1998.]
Developed incremental, linear time clustering algorithm which
clusters based on shared phrases in document snippets:
Suffix Tree Clustering
Hypotheses:
 STC has higher precision than standard clustering methods
for this domain.
 Clustering based on snippets does not significantly degrade
precision for this domain.
 STC is faster than standard clustering methods for this
domain.
Zamir & Etzioni Experiment

Data set:




Thought up 10 queries
Generated 10 collections of 200 snippets from
MetaCrawler query results (retrieved original docs as
well)
Manually assigned relevance (mean=40 relevant
docs/query)
Methods:

Single-Pass, K-Means, Buckshot, Fractionation, GAHC
Z&E Exp:
STC has higher precision.
Z&E Exp: Snippet Clustering
does not unduly affect precision.
Z&E Exp: STC is faster.
Descargar

Evaluation - Computer Science Building, Colorado State