Query Models
• Use
• Types
• What do search engines do
What we have covered
•
•
•
•
•
•
What is IR
Evaluation
Tokenization and properties of text
Vector models of documents
Web crawling
This time
– Query models
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Users
Query Engine
Index
Interface
Indexer
Crawler
Off-line
Web
Online vs offline processing
Queries
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Why the interest in Queries?
• Queries are ways we interact with
IR systems
– Expression of an information need
• Nonquery methods?
• Types of queries?
Issues with Query Structures
Matching and ranking criteria
• Given a query, what documents are
retrieved?
• In what order (rank)?
Types of Query Structures
Query Models (languages) – most common
• Boolean Queries
• Extended-Boolean Queries
– Vector space Boolean
• Vector queries
• Natural Language Queries
• Others?
Simple query language: Boolean
– Earliest query model
– Terms + Connectors (or operators)
– terms
•
•
•
•
words
normalized (stemmed) words
phrases
thesaurus terms
– connectors
• AND
• OR
• NOT
– Ex: Beethoven AND sonata
Truth Tables – Boolean Logic
P Q
0 0
0 1
1 0
1 1
NOT P
TRUE
TRUE
FALSE
FALSE
Presence of P, P = 1
Absence of P, P = 0
True = 1
False = 0
P AND Q
FALSE
FALSE
FALSE
TRUE
P OR Q
FALSE
TRUE
TRUE
TRUE
Problems with Boolean Queries
• Ranking?
• Incorrect interpretation of Boolean connectives
AND and OR
• Example - Seeking Saturday entertainment
Queries:
• Dinner AND sports AND symphony
• Dinner OR sports OR symphony
• Dinner AND sports OR symphony
Order of precedence of operators
Example of query. Is
• A AND B
• the same as
• B AND A
• Why?
Sample Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
Satisfaction of Boolean Query
• (Cat OR Dog) AND (Collar OR Leash)
– Each of the following column combinations
works:
•
•
•
•
Cat
Dog
Collar
Leash
x
x
x
x
x
x
x
Others?
x
x
x
x
x
x
x
x
x
x
Order of Preference
– Define order of preference
• EX: a OR b AND c
– Infix notation
• Parenthesis evaluated 1st with left to right
precedence of operators
• Next NOT’s are applied
• Then AND’s
• Then OR’s
– a OR b AND c becomes
– a OR (b AND c)
Infix Notation
– Usually expressed as INFIX operators in IR
• ((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator
• ((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators
• (a AND b AND c AND d)
– Some rules - (De Morgan revisited)
• NOT(a) AND NOT(b) = NOT(a OR b)
• NOT(a) OR NOT(b)= NOT(a AND b)
• NOT(NOT(a)) = a
DNFs and CNFs
All queries can be rewritten as
– Disjunctive Normal Forms (DNFs)
– Conjunctive Normal Forms (CNFs)
• DNF Constituents:
–
–
–
–
Terms (words or phrases)
Conjuncts (terms joined by ANDs)
Disjuncts (conjuncts joined by ORs)
Ex: (A AND B) OR (A AND NOTC)
• CNF Constituents:
–
–
–
–
Terms (words or phrases)
Disjuncts (terms joined by ORs)
Conjuncts (disjuncts joined by ANDs)
Ex: (A OR B) AND (A OR NOTC)
Effect of CNFs
• All complex Boolean queries can be
simplified
• Why do reference librarians like CNFs?
• AND’s reduce the size of the set returned
and are easily expandable
– So do minus’s
Boolean Searching
“Measurement of the
width of cracks in
prestressed
concrete beams”
Cracks
Formal Query:
cracks AND beams
AND Width_measurement
AND Prestressed_concrete
Width
measurement
Beams
Prestressed
concrete
Relaxed Query:
(C AND B AND P) OR
(C AND B AND W) OR
(C AND W AND P) OR
(B AND W AND P)
Ordering (ranking) of Retrieved
Documents
• Pure Boolean has no ordering
• Term is there or it’s not
• In practice:
– order chronologically
– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to have one of each term or many of one term?
Boolean Query - Summary
• Advantages
– simple queries are easy to understand
– relatively easy to implement
• Disadvantages
– difficult to specify what is wanted
– too much returned, or too little
– ordering not well determined
• Dominant language in commercial systems until
the WWW
Vector Space Model
• Queries treated as small documents
• Documents and queries are represented as vectors
in term space
– Terms are usually stems
– Documents represented by binary vectors of terms
• Query and Document weights are based on length
and direction of their vector
• A vector distance measure between the query and
documents is used to rank retrieved documents
Document Vectors
• Documents are represented as “bags of words”
– Words are terms with no order
• Represented as vectors when used
computationally
– A vector is like an array of floating point values
– Has direction and magnitude
– Each vector holds a place for every term in the
collection
– Therefore, most vectors are sparse
Queries
Vocabulary (dog, house, white)
Queries:
• dog
(1,0,0)
• house
(0,1,0)
• white
(0,0,1)
• house and dog
(1,1,0)
• dog and house
(1,1,0)
• Show 3-D space plot
Documents (queries) in Vector
Space
t3
D1
D9
D11
D5
D3
D10
D4 D2
t1
t2
D7
D8
D6
Documents in 3D Space
Assumption: Documents that are “close together”
in space are similar in meaning.
Vector Query Problems
• Significance of queries
– Can different values be placed on the different
terms – eg. 2dog 1house
• Scaling – size of vectors
• Number of words in the dictionary?
• 100,000
Proximity Searches
• Proximity: terms occur within K positions of one
another
– pen w/5 paper
• A “Near” function can be more vague
– near(pen, paper)
• Sometimes order can be specified
• Also, Phrases and Collocations
– “United Nations” “Bill Clinton”
• Phrase Variants
– “retrieval of information” “information retrieval”
Proximity - wikipedia
Filters/field limiters
• Filters: Reduce set of candidate docs
• Often specified simultaneous with query
• Usually restrictions on metadata
– restrict by:
•
•
•
•
•
date range
internet domain (.edu .com .berkeley.edu)
author
size
limit number of documents returned
Natural Language Queries
• The “Holy Grail” of information retrieval
• Issues in Natural Language Processing
–
–
–
–
–
syntax
semantics
pragmatics
speech understanding
speech generation
What do search engines do?
• Tags
– Title
– Meta
• Term frequency and location
• Popularity
• Others
What do search engines do?
• Collection of various methods, sometimes
called pseudo-Boolean
– quotes, minus, plus
– pseudo AND
• truth in vs in truth
– stop words?
What does Google do?
• Basic search
• Search operators
UC Berkeley Search Engine Guide
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html
UC Berkeley Search Engine Guide
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html
Old:Search Engine Query Differences
Older: Search engine query models
Search query string
•
The portion of a dynamic URL that contains the search parameters when a
dynamic Web site is searched. Query strings do not exist until a user plugs the
variables into a database search, at which point the search engine will create
the dynamic URL with the query string based on the results. Query strings
typically contain ? and % characters.
Lucene Basics
• Searches are supported through a wide
range of Query options
–
–
–
–
–
Keyword
Terms
Phrases
Wildcards
Many, many more
QueryParser syntax examples
Query expression
Document matches if…
java
Contains the term java in the default field
java junit
java OR junit
Contains the term java or junit or both in the default
field (the default operator can be changed to AND)
+java +junit
java AND junit
Contains both java and junit in the default field
title:ant
Contains the term ant in the title field
title:extreme –subject:sports
Contains extreme in the title and not sports in subject
(agile OR extreme) AND java Boolean expression matches
title:”junit in action”
Phrase matches in title
title:”junit action”~5
Proximity matches (within 5) in title
java*
Wildcard matches
java~
Fuzzy matches
lastmodified:[1/1/09 TO
12/31/09]
Range matches
Types of Query Structures
Query Models (languages) – most common
• Boolean Queries
– Old model
• Vector queries
– Very common - in all search engines to some extent
• Web queries
– Search engines
• Probabilistic models
– Mostly research (Indri)
• Holy grail of search
– Natural Language Queries
Descargar

query structures