Lecture 5: Boolean and Extended
Boolean
Principles of Information
Retrieval
Prof. Ray Larson
University of California, Berkeley
School of Information
IS 240 – Spring 2010
2010.02.03 - SLIDE 1
Today
• Review
– IR Components
– Inverted Files
• IR Models
• The Boolean Model
• Fuzzy sets, Rubric, P-norm, etc.
IS 240 – Spring 2010
2010.02.03 - SLIDE 2
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Documents
& data
Indexing
(Descriptive and
Subject)
Storage of
profiles
Store1: Profiles/
Search requests
Storage
Line
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Potentially
Relevant
Documents
Adapted from Soergel, p. 19
IS 240 – Spring 2010
2010.02.03 - SLIDE 3
Document Processing Steps
2010.02.03 - SLIDE 4
Boolean Implementation: Inverted Files
• We will look at “Vector files” in detail later.
But conceptually, an Inverted File is a
vector file “inverted” so that rows become
columns and columns become rows
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
IS 240 – Spring 2010
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
…
D7
1
1
0
0
1
0
2010.02.03 - SLIDE 5
How Are Inverted Files Created
• Documents are parsed to extract
words (or stems) and these are
saved with the Document ID.
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
IS 240 – Spring 2010
Text
Proc
Steps
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2010.02.03 - SLIDE 6
How Inverted Files are Created
• After all document have
been parsed the inverted
file is sorted
IS 240 – Spring 2010
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2010.02.03 - SLIDE 7
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted Files are Created
• Multiple term
entries for a single
document are
merged and
frequency
information added
IS 240 – Spring 2010
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
2010.02.03 - SLIDE 8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Inverted Files
• The file is commonly split into a Dictionary
and a Postings file
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
IS 240 – Spring 2010
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
2010.02.03 - SLIDE 9
Inverted files
• Permit fast search for individual terms
• Search results for each term is a list of
document IDs (and optionally, frequency
and/or positional information)
• These lists can be used to solve Boolean
queries:
– country: d1, d2
– manor: d2
– country and manor: d2
IS 240 – Spring 2010
2010.02.03 - SLIDE 10
Inverted Files
• Lots of alternative implementations
– E.g.: Cheshire builds within-document
frequency using a hash table during
document parsing. Then Document IDs and
frequency info are stored in a BerkeleyDB Btree index keyed by the term.
IS 240 – Spring 2010
2010.02.03 - SLIDE 11
Btree (conceptual)
F || P || Z|
B
|| D || F|
H || L || P|
R || S || Z|
Devils
Aces
Boilers
Cars
IS 240 – Spring 2010
Flyers
Hawkeyes
Hoosiers
Minors
Panthers
Seminoles
2010.02.03 - SLIDE 12
Btree with Postings
F || P || Z|
B || D || F|
H || L || P|
R || S || Z|
Devils
Aces
Boilers
Cars
Flyers
2,4,8,12
2,4,8,12
IS 240 – Spring 20102,4,8,12
Hawkeyes
Hoosiers
Minors
Panthers
2,4,8,12
2,4,8,12
5, 7, 200
Seminoles
2,4,8,12
2,4,8,12
2,4,8,12
8,120
2010.02.03 - SLIDE 13
Inverted files
• Permit fast search for individual terms
• Search results for each term is a list of
document IDs (and optionally, frequency
and/or positional information)
• These lists can be used to solve Boolean
queries:
– country: d1, d2
– manor: d2
– country and manor: d2
IS 240 – Spring 2010
2010.02.03 - SLIDE 14
Today
• Review
– IR Components
– Inverted Files
• IR Models
• The Boolean Model
• Fuzzy sets, Rubric, P-norm, etc.
IS 240 – Spring 2010
2010.02.03 - SLIDE 15
IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
– Extended Boolean
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
• Others (e.g., neural networks, etc.)
IS 240 – Spring 2010
2010.02.03 - SLIDE 16
Boolean Model for IR
• Based on Boolean Logic (Algebra of Sets).
• Fundamental principles established by
George Boole in the 1850’s
• Deals with set membership and operations
on sets
• Set membership in IR systems is usually
based on whether (or not) a document
contains a keyword (term)
IS 240 – Spring 2010
2010.02.03 - SLIDE 17
Boolean Operations on Sets

• Intersection – Boolean ‘AND’ --  -• Union – Boolean ‘OR’ --  -- 
• Negation – Boolean ‘NOT’ --- X

– Usually means “AND NOT” in IR
• Exclusive OR – ‘XOR’ – seldom used,
– Instead
AxorB  ( A  B)  ( A  B)
IS 240 – Spring 2010
2010.02.03 - SLIDE 18
Boolean Logic
CA
CA
C  A B
C  A B
DeMorgan's Law :
A
B
A B  A B
A B  A B
IS 240 – Spring 2010
2010.02.03 - SLIDE 19
Query Languages
• A way to express the query (formal
expression of the information need)
• Types:
– Boolean
– Natural Language
– Stylized Natural Language
– Form-Based (GUI)
IS 240 – Spring 2010
2010.02.03 - SLIDE 20
Simple query language: Boolean
• Terms + Connectors
– terms
•
•
•
•
words
normalized (stemmed) words
phrases
thesaurus terms
– connectors
• AND
• OR
• NOT
– parentheses (for grouping operations)
IS 240 – Spring 2010
2010.02.03 - SLIDE 21
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
IS 240 – Spring 2010
2010.02.03 - SLIDE 22
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
– Each of the following combinations works:
Doc #
1 2
CAT
X X
DOG
X
COLLAR X
LEASH
X
IS 240 – Spring 2010
3 4 5 6 7
X
X X
X X
X X
X X
X X
X
X
X
2010.02.03 - SLIDE 23
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
– None of the following combinations works:
Doc #
1 2 3 4 5 6
CAT
X
X
DOG
X
X
COLLAR
X
X
LEASH
X
X
IS 240 – Spring 2010
7
2010.02.03 - SLIDE 24
Boolean Queries
• Usually expressed as INFIX operators in IR
– ((a AND b) OR (c AND b))
• NOT is UNARY PREFIX operator
– ((a AND b) OR (c AND (NOT b)))
• AND and OR can be n-ary operators
– (a AND b AND c AND d)
• Some rules - (De Morgan revisited)
– NOT(a) AND NOT(b) = NOT(a OR b)
– NOT(a) OR NOT(b)= NOT(a AND b)
– NOT(NOT(a)) = a
IS 240 – Spring 2010
2010.02.03 - SLIDE 25
Boolean Searching
Cracks
Width
measurement
Beams
Prestressed
concrete
IS 240 – Spring 2010
Formal Query:
cracks AND beams
AND Width_measurement
AND Prestressed_concrete
Relaxed Query:
Relaxed Query:
(C AND B AND P) OR
(C AND B AND P) OR
(C AND B AND W) OR
(C AND B AND W) OR
(C AND W AND P) OR
(C AND W AND P) OR
(B AND W AND P)
(B AND W AND P)
2010.02.03 - SLIDE 26
Boolean Logic
t1
t2
D9
D2
D1
m5
m3
D11
D10
m1
m2
m1 = t1 t2 t3
D4
D5
D3
m8
m6
m2 = t1 t2 t3
m3 = t1 t2 t3
D6
m4
m4 = t1 t2 t3
m5 = t1 t2 t3
m6 = t1 t2 t3
m7
m7 = t1 t2 t3
D8
D7
m8 = t1 t2 t3
t3
IS 240 – Spring 2010
2010.02.03 - SLIDE 27
Precedence Ordering
• In what order do we evaluate the
components of the Boolean expression?
– Parenthesis get done first
• (a or b) and (c or d)
• (a or (b and c) or d)
– Usually start from the left and work right (in
case of ties)
– Usually (if there are no parentheses)
• NOT before AND
• AND before OR
IS 240 – Spring 2010
2010.02.03 - SLIDE 28
Faceted Boolean Query
• Strategy: break query into facets
(polysemous with earlier meaning of facets)
– conjunction of disjunctions
(a1 OR a2 OR a3)
(b1 OR b2)
(c1 OR c2 OR c3 OR c4)
AND
– each facet expresses a topic
(“rain forest” OR jungle OR amazon)
(medicine OR remedy OR cure)
(Smith OR Zhou)
IS 240 – Spring 2010
AND
2010.02.03 - SLIDE 29
Ordering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:
– order chronologically
– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to one of each term or many of one term?
• Fancier methods have been investigated
– p-norm is most famous
• usually impractical to implement
• usually hard for user to understand
IS 240 – Spring 2010
2010.02.03 - SLIDE 30
Faceted Boolean Query
• Query still fails if one facet missing
• Alternative:
– Coordination level ranking
– Order results in terms of how many facets (disjuncts)
are satisfied
– Also called Quorum ranking, Overlap ranking, and
Best Match
• Problem: Facets still undifferentiated
• Alternative:
– Assign weights to facets
IS 240 – Spring 2010
2010.02.03 - SLIDE 31
Boolean Processing
• Boolean Processing (classic Boolean)
– Data structures for Query representation and
Boolean Operations
• Boolean processing logic and algorithms
• Extended Boolean Models
– Fuzzy Logic
– Others
IS 240 – Spring 2010
2010.02.03 - SLIDE 32
Boolean Processing
• All processing takes place on postings lists
• Different methods can be used for sorted
or unsorted postings lists
IS 240 – Spring 2010
2010.02.03 - SLIDE 33
Boolean Query Processing
• The query must be parsed to determine what
the:
– Search Words
– Optional field or index qualifications
– Boolean Operators
• Are and how they relate to one-another
• Typical parsing uses lexical analysers (like lex or
flex) along with parser generators like YACC,
BISON or Llgen
– These produce code to be compiled into programs.
– Example…
IS 240 – Spring 2010
2010.02.03 - SLIDE 34
Z39.50 Query Structure (ASN-1 Notation)
-- Query Definitions
Query ::= CHOICE{
type-0 [0] ANY,
type-1
type-2
[1] IMPLICIT RPNQuery,
[2] OCTET STRING,
type-100 [100] OCTET STRING,
type-101 [101] IMPLICIT RPNQuery,
type-102 [102] OCTET STRING}
IS 240 – Spring 2010
2010.02.03 - SLIDE 35
Z39.50 RPN Query
(ASN-1 Notation)
-- Definitions for RPN query
RPNQuery ::= SEQUENCE{
attributeSet
rpn
IS 240 – Spring 2010
AttributeSetId,
RPNStructure}
2010.02.03 - SLIDE 36
RPN Structure
RPNStructure ::= CHOICE{
op
[0] Operand,
rpnRpnOp [1] IMPLICIT SEQUENCE{
rpn1
RPNStructure,
rpn2
RPNStructure,
op
Operator }
}
IS 240 – Spring 2010
2010.02.03 - SLIDE 37
Operand
Operand ::= CHOICE{
attrTerm AttributesPlusTerm,
resultSet ResultSetId,
-- If version 2 is in force:
-- - If query type is 1, one of the above two must
be
chosen;
-- - resultAttr (below) may be used only if query
type is
101.
resultAttr ResultSetPlusAttributes}
IS 240 – Spring 2010
2010.02.03 - SLIDE 38
Operator
Operator ::= [46] CHOICE{
and
[0] IMPLICIT NULL,
or
[1] IMPLICIT NULL,
and-not [2] IMPLICIT NULL,
-- If version 2 is in force:
-- - For query type 1, one of the above three must be
chosen;
-- - prox (below) may be used only if query type is 101.
prox
IS 240 – Spring 2010
[3] IMPLICIT ProximityOperator}
2010.02.03 - SLIDE 39
Parse Result (Query Tree)
• Z39.50 queries…
Title XXX and Subject YYY
Operator: AND
left
Operand:
Index = Title
Value = XXX
IS 240 – Spring 2010
right
Operand:
Index = Subject
Value = YYY
2010.02.03 - SLIDE 40
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Oper:
Index: Subject
Value: XXX
Op: AND
Oper:
Index: Title
Value: YYY
IS 240 – Spring 2010
Oper:
Index: Author
Value: ZZZ
2010.02.03 - SLIDE 41
Boolean AND (Sorted) Algorithm
• Choose the shortest list
• Create new list the same length as the
short list
– For each item in the short list
• Compare next item in longer list
– If greater than – go to next item in longer list
– If equal - add to new list and go to next item in both lists
– If less than - go to next item in short list
IS 240 – Spring 2010
2010.02.03 - SLIDE 42
Boolean AND Algorithm
2
5
7
8
15
29
35
100
135
140
155
189
190
195
198
IS 240 – Spring 2010
AND
2
8
9
12
15
22
28
50
68
77
84
100
120
128
135
138
141
150
155
188
189
195
=
2
8
15
100
135
155
189
195
2010.02.03 - SLIDE 43
Boolean OR (Sorted) Algorithm
• Choose the longer list
• Create new list the same length both lists
combined
– For each item in the longer list
• If less than or equal to the first item in the short list
– Add to new list
• Otherwise
– Add item from short list
– Compare next items in short and long lists
» If long item less then short item add long item and go to
next long item
» Otherwise – add from short list and go to next short item
– Once the short list runs out, add the remaining items
in the long list
IS 240 – Spring 2010
2010.02.03 - SLIDE 44
Boolean OR Algorithm
2
8
9
12
15
22
28
50
68
77
84
100
120
128
135
138
141
150
155
188
189
195
IS 240 – Spring 2010
OR
2
5
7
8
15
29
35
100
135
140
155
189
190
195
198
=
2
5
7
8
9
12
15
22
28
29
35
50
68
77
84
100
120
128
135
138
141
150
155
188
189
190
195
198
2010.02.03 - SLIDE 45
Boolean AND NOT(Sorted) Algorithm
Create new list the same length as the lefthand list
– For each item in the left-hand list
• Compare next item in not list
– If greater than – add to new list and go to next item in not
list
– If equal - go to next item in both lists
– If less than - go to next item in not list
IS 240 – Spring 2010
2010.02.03 - SLIDE 46
Boolean AND NOTAlgorithm
2
5
7
8
15
29
35
100
135
140
155
189
190
195
198
IS 240 – Spring 2010
AND NOT
2
8
9
12
15
22
28
50
68
77
84
100
120
128
135
138
141
150
155
188
189
195
=
5
7
15
29
35
140
190
198
2010.02.03 - SLIDE 47
Hashed Boolean AND (unsorted)
• Put each item in shortest list into hash
table
– For each item in other lists
• If hash entry exists, set flag in hash table entry (or
increment counter)
• Scan hash table contents
– If flag set (or counter == number of lists) add
to new list
IS 240 – Spring 2010
2010.02.03 - SLIDE 48
Hashed Boolean OR (unsorted)
• Put each item in EACH list into hash table
– If match increment counter (optional)
• Scan hash table contents and add to new
list
IS 240 – Spring 2010
2010.02.03 - SLIDE 49
Hashed Boolean AND NOT (unsorted)
• Put each item in left-hand list into hash
table
– For each item in NOT list
• If hash entry exists, remove it
• Scan hash table contents and add to new
list
IS 240 – Spring 2010
2010.02.03 - SLIDE 50
Boolean Summary
• Advantages
– simple queries are easy to understand
– relatively easy to implement
• Disadvantages
– difficult to specify what is wanted, particularly
in complex situations
– too much returned, or too little
– ordering not well determined
• Dominant IR model in commercial systems
until the WWW
IS 240 – Spring 2010
2010.02.03 - SLIDE 51
Basic Concepts for Extended Boolean
• Instead of binary values, terms in
documents and queries have a weight
(importance or some other statistical
property)
• Instead of binary set membership, sets are
“fuzzy” and the weights are used to
determine degree of membership.
• Degree of set membership can be used to
rank the results of a query
IS 240 – Spring 2010
2010.02.03 - SLIDE 52
Fuzzy Sets
• Introduced by Zadeh in 1965.
• If set {A} has value v(A) and {B} has value
v(B), where 0  v  1
• v(AB) = min(v(A), v(B))
• v(AB) = max(v(A), v(B))
• v(~A) = 1-v(A)
IS 240 – Spring 2010
2010.02.03 - SLIDE 53
Fuzzy Sets
• If we have three documents and three
terms…
– D1=(.4,.2,1), D2=(0,0,.8), D3=(.7, .4,0)
For search: t1t2 t3
For search: t1t2 t3
v(D1) = max(.4, .2, 1) = 1
v(D2) = max(0, 0, .8) = .8
v(D3) = max(.7, .4, 0) = .7
v(D1) = min(.4, .2, 1) = .2
v(D2) = min(0, 0, .8) = 0
v(D3) = min(.7, .4, 0) = 0
IS 240 – Spring 2010
2010.02.03 - SLIDE 54
Fuzzy Sets
• Fuzzy set membership of term to
document is f(A)[0,1]
• D1 = {(mesons, .8), (scattering, .4)}
• D2 = {(mesons, .5), (scattering, .6)}
• Query = MESONS AND SCATTERING
• RSV(D1) = MIN(.8,.4) = .4
• RSV(D2) = MIN(.5,.6) = .5
• D2 is ranked before D1 in the result set.
IS 240 – Spring 2010
2010.02.03 - SLIDE 55
Fuzzy Sets
• The set membership function can be, for
example, the relative term frequency
within a document, the IDF or any other
function providing weights to terms
• This means that the fuzzy methods use
sets of criteria for term weighting that are
the same or similar to those used in other
ranked retrieval methods (e.g., vector and
probabilistic methods)
IS 240 – Spring 2010
2010.02.03 - SLIDE 56
Robertson’s Critique of Fuzzy Sets
•
•
•
•
•
•
D1 = {(mesons, .4), (scattering, .4)}
D2 = {(mesons, .39), (scattering, .99)}
Query = MESONS AND SCATTERING
RSV(D1) = MIN(.4,.4) = .4
RSV(D2) = MIN(.39,.99) = .39
However, consistent with the Boolean
model:
– Query = t1t2t3…t100
– If D not indexed by t1 then it fails, even if D is
indexed by t2,…,t100
IS 240 – Spring 2010
2010.02.03 - SLIDE 57
Robertson’s critique of Fuzzy
• Fuzzy sets suffer from the same kind of
lack of discrimination among the retrieval
results almost to the same extent as
standard Boolean
• The rank of a document depends entirely
on the lowest or highest weighted term in
an AND or OR operation
IS 240 – Spring 2010
2010.02.03 - SLIDE 58
Other Fuzzy Approaches
• As described in the Modern Information
Retrieval (optional) text, a keyword
correlation matrix can be used to
determine set membership values, and
algebraic sums and products can be used
in place of MAX and MIN
• Not clear how this approach works in real
applications (or in tests like TREC)
because the testing has been on a small
scale
IS 240 – Spring 2010
2010.02.03 - SLIDE 59
Extended Boolean (P-Norm)
• Ed Fox’s Dissertation work with Salton
• Basic notion is that terms in a Boolean
query, and the Boolean Operators
themselves can have weights assigned to
them
• Binary weights means that queries behave
like standard Boolean
• 0 < Weights < 1 mean that queries behave
like a ranking system
• The system requires similarity measures
IS 240 – Spring 2010
2010.02.03 - SLIDE 60
Probabilistic Inclusion of Boolean
• Most probabilistic models attempt to
predict the probability that given a
particular query Q and document D, that
the searcher would find D relevant
• If we assume that Boolean criteria are to
be ANDed with a probabilistic query…
P(R | Q,D)  P(R | Qbool ,D)P(R | Qprob ,D)
1: if Boolean eval successful for D
P(R | Qbool ,D)  
0 : Otherwise
IS 240 – Spring 2010
2010.02.03 - SLIDE 61
Rubric – Extended Boolean
• Scans full text of documents and stores them
• User develops a hierarchy of concepts which
becomes the query
• Leaf nodes of the hierarchy are combinations of
text patterns
• A “fuzzy calculus” is used to propagate values
obtained at leaves up through the hierarchy to
obtain a single retrieval status value (or
“relevance” value)
• RUBRIC returns a ranked list of documents in
descending order of “relevance” values.
IS 240 – Spring 2010
2010.02.03 - SLIDE 62
RUBRIC Rules for Concepts & Weights
• Team | event => World_Series
• St._Louis_Cardinals | Milwaukee_Brewers =>
Team
• “Cardinals” => St._Louis_Cardinals (0.7)
• Cardinals_full_name => St._Louis_Cardinals
(0.9)
• Saint & “Louis” & “Cardinals” =>
Cardinals_full_name
• “St.” => Saint (0.9)
• “Saint” => Saint
• “Brewers” => Milwaukee_Brewers (0.5)
IS 240 – Spring 2010
2010.02.03 - SLIDE 63
RUBRIC Rules for Concepts & Weights
• “Milwaukee Brewers” =>
Milwaukee_Brewers (0.9)
• “World Series” => event
• Baseball_championship => event (0.9)
• Baseball & Championship =>
Baseball_championship
• “ball” => Baseball (0.5)
• “baseball” => Baseball
• “championship” => Championship (0.7)
IS 240 – Spring 2010
2010.02.03 - SLIDE 64
RUBRIC combination methods
V(V1 or V2) = MAX(V1, V2)
V(V1 and V2) = MIN(V1, V2)
i.e., classic fuzzy matching,
but with the addition…
V(level n) = Cn*V(level n-1)
IS 240 – Spring 2010
2010.02.03 - SLIDE 65
Rule Evaluation Tree
World_Series (0)
Event (0)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0)
0.9
“Milwaukee Brewers” (0)
Championship (0)
Cardinals_full_name (0)
Baseball (0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (0)
“St.” (0)
0.7
“baseball” (0) “championship” (0)
“Saint” (0)
IS 240 – Spring 2010
2010.02.03 - SLIDE 66
Rule Evaluation Tree
World_Series (0)
Event (0)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0)
0.9
“Milwaukee Brewers” (0)
Championship (0)
Cardinals_full_name (0)
Baseball (0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (1.0)
“St.” (0)
“Saint” (0)
IS 240 – Spring 2010
0.7
“baseball” (1.0)“championship” (1.0)
Document containing “ball”, “baseball” & “championship”
2010.02.03 - SLIDE 67
Rule Evaluation Tree
World_Series (0)
Event (0)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0)
0.9
“Milwaukee Brewers” (0)
Championship (0.7)
Cardinals_full_name (0)
Baseball (1.0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (1.0)
“St.” (0)
0.7
“baseball” (1.0)“championship” (1.0)
“Saint” (0)
IS 240 – Spring 2010
2010.02.03 - SLIDE 68
Rule Evaluation Tree
World_Series (0)
Event (0)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0.7)
0.9
“Milwaukee Brewers” (0)
Championship (0.7)
Cardinals_full_name (0)
Baseball (1.0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (1.0)
“St.” (0)
0.7
“baseball” (1.0)“championship” (1.0)
“Saint” (0)
IS 240 – Spring 2010
2010.02.03 - SLIDE 69
Rule Evaluation Tree
World_Series (0)
Event (0.63)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0.7)
0.9
“Milwaukee Brewers” (0)
Championship (0.7)
Cardinals_full_name (0)
Baseball (1.0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (1.0)
“St.” (0)
0.7
“baseball” (1.0)“championship” (1.0)
“Saint” (0)
IS 240 – Spring 2010
2010.02.03 - SLIDE 70
Rule Evaluation Tree
World_Series (0.63)
Event (0.63)
Team (0)
0.9
St._Louis_Cardinals (0)
0.7
0.5
0.9
“Cardinals” (0)
“World Series”
Milwaukee_brewers (0)
“Brewers” (0)
Baseball_championship (0.7)
0.9
“Milwaukee Brewers” (0)
Championship (0.7)
Cardinals_full_name (0)
Baseball (1.0)
Saint (0)
“Louis” (0)
0.9
“Cardinals” (0)
0.5
“ball” (1.0)
“St.” (0)
0.7
“baseball” (1.0)“championship” (1.0)
“Saint” (0)
IS 240 – Spring 2010
2010.02.03 - SLIDE 71
RUBRIC Terrorism Query
Terrorism
0.7
0.2
Event
0.1
0.3
Actor
Reason
effect
Encounter
Takeover
Specific General
actor
actor
Killing
Slaying Shooting
Bombing
0.8
Device 0.6
Explosion
IS 240 – Spring 2010
Kidnapping
0.7
ransom
0.5
Kidnap
event
2010.02.03 - SLIDE 72
Non-Boolean IR
• Need to measure some similarity between
the query and the document
• The basic notion is that documents that
are somehow similar to a query, are likely
to be relevant responses for that query
• We will revisit this notion again and see
how the Language Modelling approach to
IR has taken it to a new level
IS 240 – Spring 2010
2010.02.03 - SLIDE 73
Similarity Measures (Set-based)
Assuming that Q and D are the sets of terms associated with a
Query and Document:
|QD|
|QD|
2
|Q|| D|
Simple matching (coordination level match)
Dice’s Coefficient
|QD|
|QD|
|QD|
1
Jaccard’s Coefficient
1
|Q | | D |
|QD|
min(|Q |, | D |)
2
IS 240 – Spring 2010
2
Cosine Coefficient
Overlap Coefficient
2010.02.03 - SLIDE 74
Next Week
• Moving beyond Boole…
• The vector space model
IS 240 – Spring 2010
2010.02.03 - SLIDE 75
Descargar

Document