Lecture 5: Boolean and Extended Boolean Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information IS 240 – Spring 2010 2010.02.03 - SLIDE 1 Today • Review – IR Components – Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc. IS 240 – Spring 2010 2010.02.03 - SLIDE 2 Structure of an IR System Search Line Interest profiles & Queries Formulating query in terms of descriptors Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Documents & data Indexing (Descriptive and Subject) Storage of profiles Store1: Profiles/ Search requests Storage Line Storage of Documents Comparison/ Matching Store2: Document representations Potentially Relevant Documents Adapted from Soergel, p. 19 IS 240 – Spring 2010 2010.02.03 - SLIDE 3 Document Processing Steps 2010.02.03 - SLIDE 4 Boolean Implementation: Inverted Files • We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 t1 1 1 0 1 1 1 0 0 0 0 IS 240 – Spring 2010 t2 0 0 1 0 1 1 1 1 0 1 t3 1 0 1 0 1 0 0 0 1 1 Terms D1 t1 t2 t3 D2 1 0 1 D3 1 0 0 D4 0 1 1 D5 1 0 0 D6 1 1 1 … D7 1 1 0 0 1 0 2010.02.03 - SLIDE 5 How Are Inverted Files Created • Documents are parsed to extract words (or stems) and these are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight IS 240 – Spring 2010 Text Proc Steps Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2010.02.03 - SLIDE 6 How Inverted Files are Created • After all document have been parsed the inverted file is sorted IS 240 – Spring 2010 Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2010.02.03 - SLIDE 7 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created • Multiple term entries for a single document are merged and frequency information added IS 240 – Spring 2010 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 2010.02.03 - SLIDE 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Inverted Files • The file is commonly split into a Dictionary and a Postings file Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # IS 240 – Spring 2010 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2010.02.03 - SLIDE 9 Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: – country: d1, d2 – manor: d2 – country and manor: d2 IS 240 – Spring 2010 2010.02.03 - SLIDE 10 Inverted Files • Lots of alternative implementations – E.g.: Cheshire builds within-document frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB Btree index keyed by the term. IS 240 – Spring 2010 2010.02.03 - SLIDE 11 Btree (conceptual) F || P || Z| B || D || F| H || L || P| R || S || Z| Devils Aces Boilers Cars IS 240 – Spring 2010 Flyers Hawkeyes Hoosiers Minors Panthers Seminoles 2010.02.03 - SLIDE 12 Btree with Postings F || P || Z| B || D || F| H || L || P| R || S || Z| Devils Aces Boilers Cars Flyers 2,4,8,12 2,4,8,12 IS 240 – Spring 20102,4,8,12 Hawkeyes Hoosiers Minors Panthers 2,4,8,12 2,4,8,12 5, 7, 200 Seminoles 2,4,8,12 2,4,8,12 2,4,8,12 8,120 2010.02.03 - SLIDE 13 Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: – country: d1, d2 – manor: d2 – country and manor: d2 IS 240 – Spring 2010 2010.02.03 - SLIDE 14 Today • Review – IR Components – Inverted Files • IR Models • The Boolean Model • Fuzzy sets, Rubric, P-norm, etc. IS 240 – Spring 2010 2010.02.03 - SLIDE 15 IR Models • Set Theoretic Models – Boolean – Fuzzy – Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic) • Others (e.g., neural networks, etc.) IS 240 – Spring 2010 2010.02.03 - SLIDE 16 Boolean Model for IR • Based on Boolean Logic (Algebra of Sets). • Fundamental principles established by George Boole in the 1850’s • Deals with set membership and operations on sets • Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term) IS 240 – Spring 2010 2010.02.03 - SLIDE 17 Boolean Operations on Sets • Intersection – Boolean ‘AND’ -- -• Union – Boolean ‘OR’ -- -- • Negation – Boolean ‘NOT’ --- X – Usually means “AND NOT” in IR • Exclusive OR – ‘XOR’ – seldom used, – Instead AxorB ( A B) ( A B) IS 240 – Spring 2010 2010.02.03 - SLIDE 18 Boolean Logic CA CA C A B C A B DeMorgan's Law : A B A B A B A B A B IS 240 – Spring 2010 2010.02.03 - SLIDE 19 Query Languages • A way to express the query (formal expression of the information need) • Types: – Boolean – Natural Language – Stylized Natural Language – Form-Based (GUI) IS 240 – Spring 2010 2010.02.03 - SLIDE 20 Simple query language: Boolean • Terms + Connectors – terms • • • • words normalized (stemmed) words phrases thesaurus terms – connectors • AND • OR • NOT – parentheses (for grouping operations) IS 240 – Spring 2010 2010.02.03 - SLIDE 21 Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat AND Dog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash) IS 240 – Spring 2010 2010.02.03 - SLIDE 22 Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) – Each of the following combinations works: Doc # 1 2 CAT X X DOG X COLLAR X LEASH X IS 240 – Spring 2010 3 4 5 6 7 X X X X X X X X X X X X X X 2010.02.03 - SLIDE 23 Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) – None of the following combinations works: Doc # 1 2 3 4 5 6 CAT X X DOG X X COLLAR X X LEASH X X IS 240 – Spring 2010 7 2010.02.03 - SLIDE 24 Boolean Queries • Usually expressed as INFIX operators in IR – ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator – ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators – (a AND b AND c AND d) • Some rules - (De Morgan revisited) – NOT(a) AND NOT(b) = NOT(a OR b) – NOT(a) OR NOT(b)= NOT(a AND b) – NOT(NOT(a)) = a IS 240 – Spring 2010 2010.02.03 - SLIDE 25 Boolean Searching Cracks Width measurement Beams Prestressed concrete IS 240 – Spring 2010 Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Relaxed Query: Relaxed Query: (C AND B AND P) OR (C AND B AND P) OR (C AND B AND W) OR (C AND B AND W) OR (C AND W AND P) OR (C AND W AND P) OR (B AND W AND P) (B AND W AND P) 2010.02.03 - SLIDE 26 Boolean Logic t1 t2 D9 D2 D1 m5 m3 D11 D10 m1 m2 m1 = t1 t2 t3 D4 D5 D3 m8 m6 m2 = t1 t2 t3 m3 = t1 t2 t3 D6 m4 m4 = t1 t2 t3 m5 = t1 t2 t3 m6 = t1 t2 t3 m7 m7 = t1 t2 t3 D8 D7 m8 = t1 t2 t3 t3 IS 240 – Spring 2010 2010.02.03 - SLIDE 27 Precedence Ordering • In what order do we evaluate the components of the Boolean expression? – Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) – Usually start from the left and work right (in case of ties) – Usually (if there are no parentheses) • NOT before AND • AND before OR IS 240 – Spring 2010 2010.02.03 - SLIDE 28 Faceted Boolean Query • Strategy: break query into facets (polysemous with earlier meaning of facets) – conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) AND – each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) IS 240 – Spring 2010 AND 2010.02.03 - SLIDE 29 Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: – order chronologically – order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term? • Fancier methods have been investigated – p-norm is most famous • usually impractical to implement • usually hard for user to understand IS 240 – Spring 2010 2010.02.03 - SLIDE 30 Faceted Boolean Query • Query still fails if one facet missing • Alternative: – Coordination level ranking – Order results in terms of how many facets (disjuncts) are satisfied – Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: – Assign weights to facets IS 240 – Spring 2010 2010.02.03 - SLIDE 31 Boolean Processing • Boolean Processing (classic Boolean) – Data structures for Query representation and Boolean Operations • Boolean processing logic and algorithms • Extended Boolean Models – Fuzzy Logic – Others IS 240 – Spring 2010 2010.02.03 - SLIDE 32 Boolean Processing • All processing takes place on postings lists • Different methods can be used for sorted or unsorted postings lists IS 240 – Spring 2010 2010.02.03 - SLIDE 33 Boolean Query Processing • The query must be parsed to determine what the: – Search Words – Optional field or index qualifications – Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen – These produce code to be compiled into programs. – Example… IS 240 – Spring 2010 2010.02.03 - SLIDE 34 Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 type-2 [1] IMPLICIT RPNQuery, [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING} IS 240 – Spring 2010 2010.02.03 - SLIDE 35 Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet rpn IS 240 – Spring 2010 AttributeSetId, RPNStructure} 2010.02.03 - SLIDE 36 RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } } IS 240 – Spring 2010 2010.02.03 - SLIDE 37 Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes} IS 240 – Spring 2010 2010.02.03 - SLIDE 38 Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox IS 240 – Spring 2010 [3] IMPLICIT ProximityOperator} 2010.02.03 - SLIDE 39 Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Operator: AND left Operand: Index = Title Value = XXX IS 240 – Spring 2010 right Operand: Index = Subject Value = YYY 2010.02.03 - SLIDE 40 Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY IS 240 – Spring 2010 Oper: Index: Author Value: ZZZ 2010.02.03 - SLIDE 41 Boolean AND (Sorted) Algorithm • Choose the shortest list • Create new list the same length as the short list – For each item in the short list • Compare next item in longer list – If greater than – go to next item in longer list – If equal - add to new list and go to next item in both lists – If less than - go to next item in short list IS 240 – Spring 2010 2010.02.03 - SLIDE 42 Boolean AND Algorithm 2 5 7 8 15 29 35 100 135 140 155 189 190 195 198 IS 240 – Spring 2010 AND 2 8 9 12 15 22 28 50 68 77 84 100 120 128 135 138 141 150 155 188 189 195 = 2 8 15 100 135 155 189 195 2010.02.03 - SLIDE 43 Boolean OR (Sorted) Algorithm • Choose the longer list • Create new list the same length both lists combined – For each item in the longer list • If less than or equal to the first item in the short list – Add to new list • Otherwise – Add item from short list – Compare next items in short and long lists » If long item less then short item add long item and go to next long item » Otherwise – add from short list and go to next short item – Once the short list runs out, add the remaining items in the long list IS 240 – Spring 2010 2010.02.03 - SLIDE 44 Boolean OR Algorithm 2 8 9 12 15 22 28 50 68 77 84 100 120 128 135 138 141 150 155 188 189 195 IS 240 – Spring 2010 OR 2 5 7 8 15 29 35 100 135 140 155 189 190 195 198 = 2 5 7 8 9 12 15 22 28 29 35 50 68 77 84 100 120 128 135 138 141 150 155 188 189 190 195 198 2010.02.03 - SLIDE 45 Boolean AND NOT(Sorted) Algorithm Create new list the same length as the lefthand list – For each item in the left-hand list • Compare next item in not list – If greater than – add to new list and go to next item in not list – If equal - go to next item in both lists – If less than - go to next item in not list IS 240 – Spring 2010 2010.02.03 - SLIDE 46 Boolean AND NOTAlgorithm 2 5 7 8 15 29 35 100 135 140 155 189 190 195 198 IS 240 – Spring 2010 AND NOT 2 8 9 12 15 22 28 50 68 77 84 100 120 128 135 138 141 150 155 188 189 195 = 5 7 15 29 35 140 190 198 2010.02.03 - SLIDE 47 Hashed Boolean AND (unsorted) • Put each item in shortest list into hash table – For each item in other lists • If hash entry exists, set flag in hash table entry (or increment counter) • Scan hash table contents – If flag set (or counter == number of lists) add to new list IS 240 – Spring 2010 2010.02.03 - SLIDE 48 Hashed Boolean OR (unsorted) • Put each item in EACH list into hash table – If match increment counter (optional) • Scan hash table contents and add to new list IS 240 – Spring 2010 2010.02.03 - SLIDE 49 Hashed Boolean AND NOT (unsorted) • Put each item in left-hand list into hash table – For each item in NOT list • If hash entry exists, remove it • Scan hash table contents and add to new list IS 240 – Spring 2010 2010.02.03 - SLIDE 50 Boolean Summary • Advantages – simple queries are easy to understand – relatively easy to implement • Disadvantages – difficult to specify what is wanted, particularly in complex situations – too much returned, or too little – ordering not well determined • Dominant IR model in commercial systems until the WWW IS 240 – Spring 2010 2010.02.03 - SLIDE 51 Basic Concepts for Extended Boolean • Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property) • Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership. • Degree of set membership can be used to rank the results of a query IS 240 – Spring 2010 2010.02.03 - SLIDE 52 Fuzzy Sets • Introduced by Zadeh in 1965. • If set {A} has value v(A) and {B} has value v(B), where 0 v 1 • v(AB) = min(v(A), v(B)) • v(AB) = max(v(A), v(B)) • v(~A) = 1-v(A) IS 240 – Spring 2010 2010.02.03 - SLIDE 53 Fuzzy Sets • If we have three documents and three terms… – D1=(.4,.2,1), D2=(0,0,.8), D3=(.7, .4,0) For search: t1t2 t3 For search: t1t2 t3 v(D1) = max(.4, .2, 1) = 1 v(D2) = max(0, 0, .8) = .8 v(D3) = max(.7, .4, 0) = .7 v(D1) = min(.4, .2, 1) = .2 v(D2) = min(0, 0, .8) = 0 v(D3) = min(.7, .4, 0) = 0 IS 240 – Spring 2010 2010.02.03 - SLIDE 54 Fuzzy Sets • Fuzzy set membership of term to document is f(A)[0,1] • D1 = {(mesons, .8), (scattering, .4)} • D2 = {(mesons, .5), (scattering, .6)} • Query = MESONS AND SCATTERING • RSV(D1) = MIN(.8,.4) = .4 • RSV(D2) = MIN(.5,.6) = .5 • D2 is ranked before D1 in the result set. IS 240 – Spring 2010 2010.02.03 - SLIDE 55 Fuzzy Sets • The set membership function can be, for example, the relative term frequency within a document, the IDF or any other function providing weights to terms • This means that the fuzzy methods use sets of criteria for term weighting that are the same or similar to those used in other ranked retrieval methods (e.g., vector and probabilistic methods) IS 240 – Spring 2010 2010.02.03 - SLIDE 56 Robertson’s Critique of Fuzzy Sets • • • • • • D1 = {(mesons, .4), (scattering, .4)} D2 = {(mesons, .39), (scattering, .99)} Query = MESONS AND SCATTERING RSV(D1) = MIN(.4,.4) = .4 RSV(D2) = MIN(.39,.99) = .39 However, consistent with the Boolean model: – Query = t1t2t3…t100 – If D not indexed by t1 then it fails, even if D is indexed by t2,…,t100 IS 240 – Spring 2010 2010.02.03 - SLIDE 57 Robertson’s critique of Fuzzy • Fuzzy sets suffer from the same kind of lack of discrimination among the retrieval results almost to the same extent as standard Boolean • The rank of a document depends entirely on the lowest or highest weighted term in an AND or OR operation IS 240 – Spring 2010 2010.02.03 - SLIDE 58 Other Fuzzy Approaches • As described in the Modern Information Retrieval (optional) text, a keyword correlation matrix can be used to determine set membership values, and algebraic sums and products can be used in place of MAX and MIN • Not clear how this approach works in real applications (or in tests like TREC) because the testing has been on a small scale IS 240 – Spring 2010 2010.02.03 - SLIDE 59 Extended Boolean (P-Norm) • Ed Fox’s Dissertation work with Salton • Basic notion is that terms in a Boolean query, and the Boolean Operators themselves can have weights assigned to them • Binary weights means that queries behave like standard Boolean • 0 < Weights < 1 mean that queries behave like a ranking system • The system requires similarity measures IS 240 – Spring 2010 2010.02.03 - SLIDE 60 Probabilistic Inclusion of Boolean • Most probabilistic models attempt to predict the probability that given a particular query Q and document D, that the searcher would find D relevant • If we assume that Boolean criteria are to be ANDed with a probabilistic query… P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D) 1: if Boolean eval successful for D P(R | Qbool ,D) 0 : Otherwise IS 240 – Spring 2010 2010.02.03 - SLIDE 61 Rubric – Extended Boolean • Scans full text of documents and stores them • User develops a hierarchy of concepts which becomes the query • Leaf nodes of the hierarchy are combinations of text patterns • A “fuzzy calculus” is used to propagate values obtained at leaves up through the hierarchy to obtain a single retrieval status value (or “relevance” value) • RUBRIC returns a ranked list of documents in descending order of “relevance” values. IS 240 – Spring 2010 2010.02.03 - SLIDE 62 RUBRIC Rules for Concepts & Weights • Team | event => World_Series • St._Louis_Cardinals | Milwaukee_Brewers => Team • “Cardinals” => St._Louis_Cardinals (0.7) • Cardinals_full_name => St._Louis_Cardinals (0.9) • Saint & “Louis” & “Cardinals” => Cardinals_full_name • “St.” => Saint (0.9) • “Saint” => Saint • “Brewers” => Milwaukee_Brewers (0.5) IS 240 – Spring 2010 2010.02.03 - SLIDE 63 RUBRIC Rules for Concepts & Weights • “Milwaukee Brewers” => Milwaukee_Brewers (0.9) • “World Series” => event • Baseball_championship => event (0.9) • Baseball & Championship => Baseball_championship • “ball” => Baseball (0.5) • “baseball” => Baseball • “championship” => Championship (0.7) IS 240 – Spring 2010 2010.02.03 - SLIDE 64 RUBRIC combination methods V(V1 or V2) = MAX(V1, V2) V(V1 and V2) = MIN(V1, V2) i.e., classic fuzzy matching, but with the addition… V(level n) = Cn*V(level n-1) IS 240 – Spring 2010 2010.02.03 - SLIDE 65 Rule Evaluation Tree World_Series (0) Event (0) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0) 0.9 “Milwaukee Brewers” (0) Championship (0) Cardinals_full_name (0) Baseball (0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (0) “St.” (0) 0.7 “baseball” (0) “championship” (0) “Saint” (0) IS 240 – Spring 2010 2010.02.03 - SLIDE 66 Rule Evaluation Tree World_Series (0) Event (0) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0) 0.9 “Milwaukee Brewers” (0) Championship (0) Cardinals_full_name (0) Baseball (0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (1.0) “St.” (0) “Saint” (0) IS 240 – Spring 2010 0.7 “baseball” (1.0)“championship” (1.0) Document containing “ball”, “baseball” & “championship” 2010.02.03 - SLIDE 67 Rule Evaluation Tree World_Series (0) Event (0) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0) 0.9 “Milwaukee Brewers” (0) Championship (0.7) Cardinals_full_name (0) Baseball (1.0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (1.0) “St.” (0) 0.7 “baseball” (1.0)“championship” (1.0) “Saint” (0) IS 240 – Spring 2010 2010.02.03 - SLIDE 68 Rule Evaluation Tree World_Series (0) Event (0) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0.7) 0.9 “Milwaukee Brewers” (0) Championship (0.7) Cardinals_full_name (0) Baseball (1.0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (1.0) “St.” (0) 0.7 “baseball” (1.0)“championship” (1.0) “Saint” (0) IS 240 – Spring 2010 2010.02.03 - SLIDE 69 Rule Evaluation Tree World_Series (0) Event (0.63) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0.7) 0.9 “Milwaukee Brewers” (0) Championship (0.7) Cardinals_full_name (0) Baseball (1.0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (1.0) “St.” (0) 0.7 “baseball” (1.0)“championship” (1.0) “Saint” (0) IS 240 – Spring 2010 2010.02.03 - SLIDE 70 Rule Evaluation Tree World_Series (0.63) Event (0.63) Team (0) 0.9 St._Louis_Cardinals (0) 0.7 0.5 0.9 “Cardinals” (0) “World Series” Milwaukee_brewers (0) “Brewers” (0) Baseball_championship (0.7) 0.9 “Milwaukee Brewers” (0) Championship (0.7) Cardinals_full_name (0) Baseball (1.0) Saint (0) “Louis” (0) 0.9 “Cardinals” (0) 0.5 “ball” (1.0) “St.” (0) 0.7 “baseball” (1.0)“championship” (1.0) “Saint” (0) IS 240 – Spring 2010 2010.02.03 - SLIDE 71 RUBRIC Terrorism Query Terrorism 0.7 0.2 Event 0.1 0.3 Actor Reason effect Encounter Takeover Specific General actor actor Killing Slaying Shooting Bombing 0.8 Device 0.6 Explosion IS 240 – Spring 2010 Kidnapping 0.7 ransom 0.5 Kidnap event 2010.02.03 - SLIDE 72 Non-Boolean IR • Need to measure some similarity between the query and the document • The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query • We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level IS 240 – Spring 2010 2010.02.03 - SLIDE 73 Similarity Measures (Set-based) Assuming that Q and D are the sets of terms associated with a Query and Document: |QD| |QD| 2 |Q|| D| Simple matching (coordination level match) Dice’s Coefficient |QD| |QD| |QD| 1 Jaccard’s Coefficient 1 |Q | | D | |QD| min(|Q |, | D |) 2 IS 240 – Spring 2010 2 Cosine Coefficient Overlap Coefficient 2010.02.03 - SLIDE 74 Next Week • Moving beyond Boole… • The vector space model IS 240 – Spring 2010 2010.02.03 - SLIDE 75

Descargar
# Document