How Search Engines Work:
A Technology Overview
Avi Rappoport
Search Tools Consulting
www.searchtools.com
[email protected]
UC Berkeley SIMS class 202
September 16, 2004
Purpose of Search Engines
Helping people find what they’re looking for
• Starts with an “information need”
• Convert to a query
• Gets results
In the materials available
• Web pages
• Other formats
• Deep Web
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
2
Search is Not a Panacea
Search can’t find what’s not there
• The content is hugely important
Information Architecture is vital
Usable sites have good navigation and
structure
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
3
Search Looks Simple
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
4
But It's Not
Index ahead of time
• Find files or records
• Open each one and read it
• Store each word in a searchable index
Provide search forms
• Match the query terms with words in the index
• Sort documents by relevance
Display results
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
5
Search Processing
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
6
Search is Mostly Invisible
Like an iceberg,
2/3 below water
user
interface
content
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
search
functionality
7
Text Search vs. Database Query
Text search works for structured content
Keyword search vs. SQL queries
Approximate vs. exact match
Multiple sources of content
Response time and database resources
Relevance ranking, very important
Works in the real world (e.g. EBay)
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
8
Search is Only as Good as the Content
Users blame the search engine
• Even when the content is unavailable
Understand the scope of site or intranet
•
•
•
•
•
•
Kinds of information
Divided sites: products / corporate info
Dates
Languages
Sources and data silos: CMSs, databases...
Update processes
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
9
Making a Searchable Index
Store text to search it later
Many ways to gather text
•
•
•
•
•
Crawl (spider) via HTTP
Read files on file servers
Access databases (HTTP or API)
Data silos via local APIs
Applications, CMSs, via Web Services
Security and Access Control
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
10
Robot Indexing Diagram
Sour
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
11
What the Index Needs
Basic information for document or record
• File name / URL / record ID
• Title or equivalent
• Size, date, MIME type
Full text of item
More metadata
• Product name, picture ID
• Category, topic, or subject
• Other attributes, for relevance ranking and display
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
12
Simple Index Diagram
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
13
More Complex Index Processing
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
14
Index Issues
Stopwords
Stemming
Metadata
• Explicit (tags)
• Implicit (context)
Semantics
• CMS and Database fields
• XML tags and attributes
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
15
Search Query Processing
What happens after you click the search
button, and before retrieval starts.
Usually in this order
•
•
•
•
•
Handle character set, maybe language
Look for operators and organize the query
Look for field names or metadata
Extract words (just like the indexer)
Deal with letter casing
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
16
Search and Retrieval
Retrieval: find files with query terms
Not the same as relevance ranking
Recall: find all
relevant items
Precision: find only
relevant items
Increasing one
decreases the
other
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
17
Retrieval = Matching
Single-word queries
• Find items containing that word
Multi-word queries: combine lists
• Any: every item with any query word
• All: only items with every word
• Phrases: find only items with all words in order
Boolean and complex queries
• Use algorithm to combine lists
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
18
Why Searches Fail
Empty search
Nothing on the site on that topic (scope)
Misspelling or typing mistakes
Vocabulary differences
Restrictive search defaults
Restrictive search choices
Software failure
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
19
LII.org No-Matches Page
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
20
Relevance Ranking
Theory: sort the matching items, so the
most relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
• What do people mean when they type “bank”?
First 10 results are the most important
The more transparent, the better
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
21
Relevance Processing
Sorting documents on various criteria
Start with words matching query terms
Citation and link analysis
• Like old library Citation Indexes
• Ted Nelson - not only hypertext, but the links
• Google PageRank
• Incoming links
• Authority of linkers
Taxonomies and external metadata
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
22
TF-IDF Ranking Algorithm
Term frequency in the item
Inverse document frequency of term
• Rare words are likely to be more important
wij = weight of Term Tj in Document Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collection
n = number of Documents where term Tj
occurs at least once
From Salton 1989
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
23
Other Algorithms
Vector space
Probabilistic (binary interdependence)
Fuzzy set theory
Bayesian statistical analysis
Latent semantic indexing
Neural networks
Machine learning
All require sophisticated queries
See MIR, chapter 2
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
24
Relevance Heuristics
Heuristics are rules of thumb
• Not algorithms, not math
Search Relevance Ranking Heuristics
•
•
•
•
Documents containing all search words
Search words as a phrase
Matches in title tag
Matches in other metadata
Based on real-word user behavior
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
25
Search Results Interface
What users see after they click the Search
button
The most visible part of search
Elements of the results page
•
•
•
•
Page layout and navigation
Results header
List of results items
Results footer
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
26
Many Experiments in Interface
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
27
Back to Simplicity
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
28
Search Suggestions (aka Best Bets)
Human judgment beats algorithms
Great for frequent, ambiguous searches
• Use search log to identify best candidates
Recommend good starting pages
• Product information, FAQs, etc.
Requires human resources
• That means money and time
More static than algorithmic search
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
29
MSU Keywords
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
30
Siemens Results
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
31
Cooks.com Results
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
32
Salon.com Results
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
33
Faceted Metadata Search & Browse
Leverage content structure
• database fields (i.e. cruise amenities)
• document metadata (news article bylines)
Provide both search and browse
•
•
•
•
Support information foraging
Integrate navigation with results
Not just subject taxonomies
Display only fruitful paths, no dead ends
Supported by academic research
• Marti Hearst, UCB SIMS, flamenco.berkeley.edu
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
34
Faceted Search: Information
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
35
Faceted Search: Online Catalog
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
36
Search Metrics and Analytics
Metrics
•
•
•
•
Number of searches
Number of no-matches searches
Traffic from search to high-value pages
Relate search changes to other metrics
Search Log Analysis
• Top 5% searches: phrases and words
• Top no-matches searches
• Use as market research
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
37
Search Will Never Be Perfect
Search engines can’t read minds
• User queries are short and ambiguous
Some things will help
•
•
•
•
•
•
Design a usable interface
Show match words in context
Keep index current and complete
Adjust heuristic weighting
Maintain suggestions and synonyms
Consider faceted metadata search
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
38
Search Engines, sorta Rocket Science
Questions and discussion
Contact me
• [email protected]
• www.searchtools.com
This presentation:
• www.searchtools.com/slides/sims/202-04/
UCB SIMS 202, Sept. 2004
Avi Rappoport, Search Tools Consulting
39
Descargar

How Search Engines Work - UC Berkeley School of …