Evaluation Information retrieval Web Purposes of Evaluation System Performance Evaluation Retrieval Evaluation efficiency of data structures and methods operational profile how well can system “guess” what is needed from range of queries (including ill-specified and vague) Comparison – Who’s best how does new algorithm/data structure compare to its predecessors; what is its contribution? Canonical Retrieval & Comparison Experiment 1. For each method M being compared, a. For each parameter setting P of M, 1) 2) Train M(P) on a training data set D’. For each testing data set D, a) b) c) b. Run M on D Compare actual results to expected results Compute performance metrics Compare performance on set P for method M 2. Compare performance across set M using statistical tests for significance Key Questions To what methods should you compare? What parameters should be checked? What data sets should be used? How should you divide into training and testing? What metrics should be used? What statistical tests should be run? Choosing Methods and Parameters Depends on your hypothesis… Experiments are based on hypothesis,: My system has significantly better performance than state of the art. Heuristics significantly improve performance. Negative feedback makes little difference to performance. IR Data Sets TREC (Text Retrieval Conference) http://trec.nist.gov/ yearly conference organized by NIST groups compare IR systems on designated tasks and data data sets are large and include human relevance and topic judgments tasks: usually IR (e.g., retrieval, filtering), recently Web Largest,most widely used collections TREC Data Collections (>800,000 docs) 1. 2. 3. 4. 5. Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990) Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994). Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990). TREC Relevance Judgments Definition of relevance: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.” binary judgments, relevant if any part is pooled (based on returns of actual systems) subset of docs are judged by topic author TREC Topics <top> <head> Tipster Topic Description <num> Number: 066 <dom> Domain: Science and Technology <tide> Topic: Natural Language Processing <desc> Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. <narr> Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product. <con> Concept(s): 1. natural language processing 2. translation, language, dictionary, font applications <fac> Factor(s): <nat> Nationality: U.S. <fac> <del> Definition(s): <top> 3. software TREC Tasks ad hoc retrieval routing (query stream from profile) confusion (docs with scanning errors) Database merging (docs from separate collections) spoken documents filtering (construct profile to filter incoming stream) question and answer web ad hoc web homepage finding TREC 2003 Tasks Cross-language: retrieve documents in multiple languages Filtering: user’s information need is stable; check incoming stream for which docs should be disseminated Genome: gene sequences as well as supporting docs High Accuracy Retrieval from Documents (HARD): leverage knowledge about user and/or goals Interactive Novelty: retrieve new,non-redundant data Question answering Video: Content-based retrieval of digital video Web: search on a snapshot of the web Web Topics <top> <num> Number: 501 <title> deduction and induction in English? <desc> Description: What is the difference between deduction and induction in the process of reasoning? <narr> Narrative: A relevant document will contrast inductive and deductive reasoning. A document that discusses only one or the other is not relevant. </top> Metrics: Precision and Recall Precision: were all retrieved docs relevant? hits/(hits+fp) Recall: were all relevant docs retrieved? hits/(hits+fn) Retrieved ~Retrieved Relevant hit false negative ~Relevant false positive miss Precision/Recall Tradeoff 100 90 80 70 60 50 40 30 20 10 0 sys1 sys2 0 10 20 30 40 50 60 70 80 90 100 Performance for Each Query Hold recall fixed (through parameters) and allow precision to vary Average precision at seen relevant docs: running average R-precision: precision by rth ranking where r=|relevant docs| Metrics: P&R Combination Precision and recall tend to balance each other. Improve one at cost of the other. F is harmonic mean, comparing recall and precision in single metric. F (i, j ) 2 * Recall ( i , j ) * Precision ( i , j ) Precision ( i , j ) Recall ( i , j ) ( 1) Recall * Precision 2 F * Precision 2 Recall , 1 Experiment Pitfalls Ill specified hypothesis Reliance on flaky and/or too few users Bias in the user base Poorly constructed user instructions Inappropriate comparison methods Too much varying simultaneously Biased evaluation metric (confounding) or data selection Too many or too few statistical tests Strategies for Avoiding Pitfalls Fully specify experiment before running it Run pilot experiment Put serious thought into “right” hypothesis Think about what you hope to say when experiment concludes… will experiment support your saying it? Web Experiment Example [O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of 21st International SIGIR Conference on Research and Development in Information Retrieval, 1998.] Developed incremental, linear time clustering algorithm which clusters based on shared phrases in document snippets: Suffix Tree Clustering Hypotheses: STC has higher precision than standard clustering methods for this domain. Clustering based on snippets does not significantly degrade precision for this domain. STC is faster than standard clustering methods for this domain. Zamir & Etzioni Experiment Data set: Thought up 10 queries Generated 10 collections of 200 snippets from MetaCrawler query results (retrieved original docs as well) Manually assigned relevance (mean=40 relevant docs/query) Methods: Single-Pass, K-Means, Buckshot, Fractionation, GAHC Z&E Exp: STC has higher precision. Z&E Exp: Snippet Clustering does not unduly affect precision. Z&E Exp: STC is faster.