Web Search Engines and Web Data Mining
basic techniques, architectures, tools, and current trends
CATT Short Course - July 12, 2002
Instructor: Prof. Torsten Suel
Department of Computer and Information Science
Polytechnic University
[email protected]
http://cis.poly.edu/suel/
This material was developed for a one-day short course on
search engines and search technology taught at Polytechnic
University in Brooklyn on June 11, 2004. The course was
organized by the Center for Advanced Technology in
Telecommunication (CATT) at Polytechnic University; see
http://catt.poly.edu for more information.
An electronic version of these slides (except for handwritten
additions) is available at
http://cis.poly.edu/suel/webshort/
Which also contains a detailed list of additional pointers and
bibliographic references.
© 2004 Torsten Suel ([email protected])
Goal of this course:
• learn how search engines work
• learn about other types of web search tools
and applications
• learn how to build and deploy such tools
- basic information retrieval techniques
- what software tools to use
- system architectures and performance
Target Audience:
• technically oriented people interested in how it works
• developers who may need to build or deploy tools
Overview:
• Part I - Introduction:
- Motivation (why web search tools?)
- Examples of search tools
- How the web works
- Basic search engine structure
- Introduction to Information Retrieval
(coffee break: 11:00-11:15)
• Part II – Search Engine Basics
- Search engine architecture
- Web crawling basics: following links, robot exclusion, ..
- Storage
- Text indexing
- Querying and term-based ranking
- Basic link-based ranking
(lunch: 12:30-1:30)
Overview: (cont.)
• Part III – Applications, Systems, and Tools
- Types of search tools
- Available software tools
- Search and DBMSs
- Application scenarios:
* Major search engine
* Focused Data Collection and Analysis
* Browsing/Search Assistants
* Site and Enterprise Search
* Geographic Web Search
- Example: citeseer system
- Example: Internet Archive
- Using search engines
- Search engine optimization and manipulation
(break 2:45 -3:00)
Overview: (cont.)
• Part IV - Advanced Techniques
- High-performance crawling
- Recrawling and focused crawling
- Link-based ranking (Pagerank, HITS)
- Structural analysis of the web graph
- Optimizing query execution
- Parallel search engines and scaling
- Meta search engines
- Document clustering and duplicate detection
Not Covered:
• Semi-structured data and XML
• Web accessible databases
- crawling the hidden web
- efficient query processing on remote data sources
- wrapper construction
• Extracting relational data from the web
• Shopping bots
• Image and multimedia search
• Peer-to-peer search technologies
• advanced IR: categorization, clustering, ...
• natural language processing (NLP)
1 – Introduction and Motivation: What is the Web?
What is the web? (another view)
• pages containing (fairly unstructured) text
• images, audio, etc. embedded in pages
• structure defined using HTML
(Hypertext Markup Language)
• hyperlinks between pages!
• over 10 billion pages
• over 150 billion hyperlinks
a giant graph!
How is the web organized?
Web Server
(Host)
Web Server
(Host)
www.cnn.com
www.poly.edu
• pages reside in servers
• related pages in sites
• local versus global links
• logical vs. physical structure
Web Server
(Host)
www.irs.gov
How do we find pages on the web?
• more than 10 billion pages
• more than 150 billion hyperlinks
• plus images, movies, .. , database content
we need specialized tools for finding
pages and information
2 - Overview of web search tools
• Major search engines
(google, fast, altavista, teoma, wisenut)
• Web directories
(yahoo, open directory project)
• Specialized search engines (citeseer, achoo, findlaw)
• Local search engines
(for one site or domain)
• Meta search engines (dogpile, mamma, search.com, vivisimo)
• Personal search assistants
(alexa, google toolbar)
• Comparison shopping
(mysimon, pricewatch, dealtime)
• Image search
(ditto, visoo (gone), major engines)
• Natural language questions (askjeeves?)
• Deep Web/Database search (completeplanet/brightplanet)
Major search engines:
Basic structure of a search engine:
indexing
Crawler
Index
disks
Query: “computer”
Search.com
look up
Ranking:
• return best pages first
• term- vs. link-based approaches
Challenges for search engines:
• coverage
(need to cover large part of the web)
need to crawl and store massive data sets
• good ranking
(in the case of broad queries)
smart information retrieval techniques
• freshness
(need to update content)
frequent recrawling of content
• user load
(up to 10000 queries/sec - Google)
many queries on massive data
• manipulation
(sites want to be listed first)
naïve techniques will be exploited quickly
Web directories:
(Yahoo, Open Directory Project)
Topic hierarchy:
sports
baseball
everything
politics
foreign
hockey
domestic
soccer
business
health
....
....
....
....
….
Challenges:
• designing topic hierarchy
• automatic classification: “what is this page about?”
• Yahoo and Open Directory mostly human-based
....
Specialized search engines: (achoo, findlaw)
• be the best on one particular topic
• use domain-specific knowledge
• limited resources
do not crawl the entire web!
• focused crawling techniques or meta search
Meta search engines:
(dogpile, vivisimo, search.com, mamma)
• uses other search engines to answer questions
• ask the right specialized search engine, or
• combine results from several large engines
• may need to be “familiar” with thousands of engines
Personal Search Assistants: (Alexa, Google Toolbar)
• embedded into browser
• can suggest “related pages”
• search by “highlighting text”
can use context
• may exploit individual browsing behavior
• may collect and aggregate browsing information
privacy issues
• architectures:
- on top of crawler-based search engine (alexa, google), or
- based on meta search (MIT Powerscout)
- based on limited crawls by client or proxy
(MIT Letizia, Stanford PowerBrowser)
Perspective:
systems
machine learning
information
retrieval
AI
natural
language
processing
library &
information
science
algorithms
databases
Example #1: Link-based ranking techniques
• Ragerank (Brin&Page/Google)
“significance of a page depends on
significance of those referencing it”
• HITS (Kleinberg/IBM)
“Hubs and Authorities”
Example #2: Crawling 100 million pages
• crawler architecture
• networking requirements
• data structures: size and robustness
• crawling etiquette
• concerns for webmasters
Example #3: Analysis of the web graph
• What does the web look like?
(diameter, connectivity, in-degree)
• Why are there so many bipartite cliques?
(IBM)
(and why do we care?)
(2,3-clique)
• How do you compute with a 500 million node graph?
Example #4: Finding near duplicates on the web
• given 1 billion pages, find duplicates (15 TB)
• more difficult: find very similar pages (clustering)
• find mirror sites and replicated collections
• process collection to account for duplicates
3 - How the web works
(more details)
Fetching “www.cnn.com/world/index.html”
give me the file “/world/index.html”
Web Server
Desktop
(with browser)
www.cnn.com
here is the file: “...”
Three Main Ingredients:
• Naming: URL (uniform resource locators)
(used to identify and locate objects)
• Communication: HTTP (hypertext transfer protocol)
(used to request and transfer objects)
• Rendering: HTML (hypertext markup language)
(used to defined how object should be presented to user)
Client Server Paradigm:
• Client (browser) used HTTP to ask server (web
server) for object identified by URI, and renders this
object according to rules defined by HTML
Domain Name Service:
root DNS
server
where is www.poly.edu located?
answer: 123.238.24.10
where is www.cnn.com located?
2.
3.
desktop
(or crawler)
1.
4.
6.
local DNS
server
5.
DNS server
for cnn.com
Names, addresses, hosts, and sites
• one machine can have several host names and IP addresses
• one host name may correspond to several machines
• one host can have several “sites” (what is a site?)
• one “site” on several hosts
• issues: detecting duplicates, crawling, local vs. global links
HTTP:
GET /world/index.html HTTP/1.0
User-Agent: Mozilla/3.0 (Windows 95/NT)
Host: www.cnn.com
From: …
Referer: …
If-Modified-Since: ...
desktop or
crawler
web server
HTTP/1.0 200 OK
Server: Netscape-Communications/1.1
Date: Tuesday, 8-Feb-99 01:22:04 GMT
Last-modified: Thursday, 3-Feb-99 10:44:11 GMT
Content-length: 5462
Content-type: text/html
<the html file>
HTML:
HTTP & HTML issues:
• “dynamic” URLs:
http://www.google.com/search?q=brooklyn
http://www.amazon.com/exec/obidos/ASIN/1558605703/qid%3D9…
http:/cis.poly.edu/search/search.cgi
• result file can be computed by server in arbitrary manner!
• persistent connections in HTTP/1.1
• mime types and extensions
• frames
• redirects
• javascript/java/JEB/flash/activeX ????????
4 – Basic Search Engine Structure:
indexing
Crawler
Index
disks
Query: “computer”
Search.com
look up
Crawling
Crawler
disks
• fetches pages from the web
• starts at set of “seed pages”
• parses fetched pages for hyperlinks
• then follows those links (e.g., BFS)
• variations:
- recrawling
- focused crawling
- random walks
Indexing
indexing
disks
aardvark
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
zebra
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
“inverted index”
• parse & build lexicon & build index
• index very large
I/O-efficient techniques needed
Querying
Boolean queries:
(zebra AND armadillo) OR armani
compute unions/intersections of lists
Ranked queries: zebra, armadillo
give scores to all docs in union
aardvark
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
zebra
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
look up
5 – Introduction to Information Retrieval
(Information Retrieval from Memex to Google)
• (a) Introduction: What is IR?
• (b) Historical development of IR
(plus Information Science and Library Science)
• (c) Fundamentals: Indexing, search, data acquisition
• (d) Mathematical Techniques: vector space model,
ranking and classification
(a) Introduction:
Information Retrieval (IR):
“IR is concerned with the representation, storage,
organization of, and access to information items”
• focus on automatic processing (indexing, clustering,
search) of unstructured data (text, images, audio, ...)
• subfield of Computer Science, but with roots in
Library Science, Information Science, and Linguistics
• In this course, we focus on text data
• Applications:
- searching in a library catalog
- categorizing a collection of medical articles by area
- web search engines
(a) Introduction:
Historical Perspective:
• WW2 era computers built for number crunching:
ballistics computations, code breaking
• since earliest days, also used to “organize information”
- Memex (Vannevar Bush, 1945)
• today, this is the main application!
- store and organize data and documents
- model organizations and work processes
• Computer
Organizer
• … however, no clean separation
(a) Introduction:
Structured vs. Unstructured Data:
• IR: lesser known cousin of field of Databases
• Databases: focus on structured data
• IR: unstructured data: “documents”
- scientific articles, novels, poems, jokes, web pages
• Information retrieval vs. data retrieval
• IR focused on human user (?)
• Challenges: semistructured data, closing the gap
(a) Introduction:
Recall: the fraction of the relevant documents (R) that is
successfully retrieved:
|Ra|
Recall =
|R|
Precision: the fraction of the retrieved documents (A) that
are relevant:
|Ra|
Precision = |A|
Collection
Relevant Documents
in Answer Set |Ra|
• fundamental trade-off
• policy ramifications
Relevant Docs |R|
Answer Set |A|
(a) Introduction:
Text IR Operations:
• Indexing: create a full-text or keyword index
• Querying/ranking: find documents (most) relevant to query
• Clustering: group documents in sets of similar ones
• Categorization: assign documents to given set of categories
• Citation Analysis: find frequently cited or influential papers
• (Extraction/tagging: identify names of people, organizations)
• (Summarization: automatically create a summary)
• (Translation: translate text between languages)
Note: not the same as NLP, string processing
(a) Introduction:
Text IR Applications:
• Digital Libraries and Library Catalog Systems
- full-text indexing and search of electronic documents
- keyword search for other documents
- categorization of medical or news articles
- browsing by categories and relationships
• Web Search Engines
- very large amounts of data, data acquisition problem
- many short documents, very informal, manipulation
• National Security and Competitive Intelligence:
- web surveillance (extremist sites, competitors)
- filtering and querying intercepted communications
- searching internal documents (files from branch offices)
- analyzing phone call graphs, server and search engine logs
(a) Introduction:
Online Analytical Processing:
(OLAP)
• scenario: analyst posing questions (queries) to a system
• preferably based on powerful query language (e.g., SQL)
• goal: to discover properties of the data
• for text/IR: need query language (VQL?)
Data Mining:
• how to automatically find interesting patterns and
rules in data
• or how to find exceptions to rules (outliers)
• no clear boundary between data mining and OLAP
• relevant to structured and unstructured data
• example: association rules - products, words, links, books
(a) Introduction:
IR versus NLP:
(Natural Language Processing
or Computational Linguistics)
• IR does not analyze grammar, local structure
(document as a set or bag or words)
• NLP analyzes sentence structure, grammar (shallow/deep parsing)
• IR: simple statistical methods, good for search & categorization
• NLP: good for automatic translation, summarization, extraction
• IR is largely language independent
• NLP uses knowledge about the language
(WordNet, thesaurus)
• NLP: rooted in linguistics, grammar vs. statistical NLP
• web search: so far, NLP has not proven that critical yet
(a) Introduction:
IR versus databases:
• IR based on statistical techniques
• IR “fuzzy”, many different techniques
• databases: very precisely defined semantics
• databases: “pounding reality into table form”
Machine Learning:
(sometimes part of AI)
• symbolic/logic vs. statistical approaches
• set of general statistical techniques for learning from past data
(how to profile correctly)
• widely used in IR and increasingly in NLP
(b) Historical Development
Before 1945:
• Babylonians, Greeks, Romans, etc
• Indexing and creation of concordances
- “algorithms for full-text indexing” !
- e.g., Dobson 1940-65 Byron concordance: “last handmade one”
• Library of Congress and Dewey Classification
• Documentalism
• Bibliometric and Informetric distributions:
- Bradford, Lotka, Zipf, Pareto, Yule (1920s-40s)
• Citation Analysis and Social Network Analysis
• Microfilm rapid selectors:
- E. Goldberg 1931
- V. Bush 1939
(b) Historical Development
Zipf and Bibliometric/Informetric Distributions
• distributions observed in many fields of science
• some things occur much more frequently than others
“some words are used much more often than others
“some authors are much more often cited than others”
“some people have much more money than others”
“some animal species are more common than others”
• follows a particular class of distributions:
f(i) ~ i
-z
• large z means large skew
• heavy-tailed: “some have a lot more, but most wealth is held by the many”
(b) Historical Development
Citation and Social Network Analysis
• “who has written the most influential papers?”
• “who is the most influential person in a social network?”
• maybe the person who knows the largest number of people?
• or someone who knows a few influential people very well?
graph-theoretic approaches to analyzing
social networks and citation graphs
• national security applications:
- funding, communication, coordination
- telephone call graphs
(b) Historical Development
Memex:
Vannevar Bush (1890-1974)
• “As We May Think”, Atlantic Monthly, 1945
(mostly written 1939)
(b) Historical Development
Memex:
• device for storing and retrieving information
• storage based on microfilm
• users can store all their experiences and knowledge
and retrieve information later
• trails (hyperlinks): users can connect different pieces
of information
• surprising earlier history …
(b) Historical Development
Rapid Microfilm Selectors:
• microfilm for storage of large amounts of data
• storage density in MB per square inch (1925)
• but how can we search such massive data?
• idea:
- add index terms to microfilm boundaries
- build rapid selection machines that can scan index
• Rapid Microfilm Selectors:
- use light and photo cells to find matches in index terms
- hundreds of index cards per second
Source: M. Buckland, UC Berkeley
See http://www.sims.berkeley.edu/~buckland/goldbush.html
(b) Historical Development
Rapid Microfilm Selectors:
(b) Historical Development
(b) Historical Development
Emanuel Goldberg:
• * Moscow 1881 – Tel Aviv 1970
• director at Zeiss Ikon (Dresden)
• multiple invention in photo technologies
• inventor of rapid microfilm selector
Documentalism:
• predecessor of library and information sciences
• goal: “organize and share knowledge to make it widely available”
• The Bridge: organization to design a “world brain”
• Otlet, Ostfeld, Schuermeyer, ….
• Also related: H.G. Wells
Source: M. Buckland, UC Berkeley
See http://www.sims.berkeley.edu/~buckland/goldbush.html and links in page
(b) Historical Development
"We should have a complex of associated machines which
would achieve the following operations simultaneously or
sequentially: 1. Conversion of sound into text; 2. Copying
that text as many times as is useful; 3. Setting up
documents in such a way that each datum has its own
identity and its relationships with all the others in the
group and to which it can be re-united as needed; 4.
Assignment of a classification code to each datum;
[division of the document into parts, one for each datum,
and] rearrangement of the parts of the document to
correspond with the classification codes; 5. Automatic
classification and storage of these documents; 6.
Automatic retrieval of these documents for consultation
and for delivery either for inspection or to a machine for
making additional notes; 7. Mechanized manipulation at
will of all the recorded data in order to derive new
combinations of facts, new relationships between ideas,
new operations using symbols.” (Otlet, 1934)
"But what a revolution for information retrieval and especially for libraries television
can bring! Perhaps one day we shall see our reading rooms deserted and in their place a
room without people in which books requested by telephone are displayed, which the
users read in their homes using television equipment." (Schuermeyer 1936)
(b) Historical Development (after Bush)
• early work by H.-P. Luhn (KWIC index, SDI, abstraction)
• hypertext (Nelson, Engelbart, 1960s)
- links between different documents and sections
- Xanadu
- hypertext community (HT conferences)
• vector space model and ranking methods
- Salton et al (Cornell), 1960s
- cosine measure, SMART system
• automatic text classification
- manually generated classifiers (since 1960s)
- machine learning approach (1980s- now)
• WWW - “the Web”, Berners-Lee, ~1992
- earlier: gopher, archie, WAIS
- 1994: Mosaic browser, breakthrough in size
- 1994/1995: first crawler-based search engines
(c) Fundamentals of IR Systems
Text Index:
• a data structure that for supporting IR queries
• most popular form:
inverted index structure
• like index in a book
indexing
disks with documents
aalborg
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
zz
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
inverted index
(c) Fundamentals of IR Systems
Boolean querying:
(see Managing Gigabytes book)
Boolean queries:
(zebra AND armadillo) OR armani
unions/intersections of lists
aalborg
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
Zz
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
look up
(c) Fundamentals of IR Systems
Ranking:
“return best documents first”
(c) Fundamentals of IR Systems
Text classification:
sports
baseball
everything
politics
foreign
hockey
domestic
soccer
business
health
....
....
....
....
….
Challenges:
• designing topic hierarchy
• automatic classification: “what is this page about?”
• Yahoo and Open Directory mostly human-based
....
(c) Fundamentals of IR Systems
Basic structure of a very simple IR system:
store
indexing
Index
document collection
Query: “computer”
disks
user interface
look up
(c) Fundamentals of IR Systems
Simple IR system:
• constructs and maintain inverted index on documents
• supports Boolean and ranked queries
• may use automatic or manual classification
• may support other clustering techniques
• may support advanced browsing operations
• “searching vs. browsing”
• often well-structured collections (news, medical articles)
• queries with multiple keywords (up to hundreds)
(c) Fundamentals of IR Systems
Basic structure of a search engine:
indexing
Crawler
Index
web pages
disks
Query: “brooklyn”
look up
(c) Fundamentals of IR Systems
Differences to traditional IR systems:
• data acquisition important (crawling the web)
• collections are much larger (3 billion pages = 50 TB)
• documents are of very mixed quality and types
• queries are VERY short (less than 3 words on average)
• traditional stat. techniques do not work as well
• but additional sources of information:
- hyperlink structure
- usage data / logs
• search engine manipulation!!!
(d) Mathematical Techniques
Vector-Space Model
• each document D represented as set of words
• a query Q is also just a set of words
• let L be the set of all words in the collection, |L| = m
• D and Q correspond to m-dimensional vectors
- if word does not occur in D resp. Q, the corresponding
element is set to 0
- otherwise, element is positive
• score of D with respect to query Q is D * Q
• return documents with highest k scores
(d) Mathematical Techniques
Vector-Space Model
• Example: put a 1 into vector for each word
L = {a, alice, bob, book, like, reads}, m = 6
doc1: “Bob reads a book”
doc2: “Alice likes Bob”
D1 = ( 1, 0, 1, 1, 0, 1 )
D2 = ( 0, 1, 1, 0, 1, 0 )
doc3: “book”
D3 = ( 0, 0, 0, 1, 0, 0 )
query: “bob, book”
Q = ( 0, 0, 1, 1, 0, 0 )
D1*Q = 2,
D2 * Q = 1,
D3 * Q = 1
• very primitive ranking function: “how many words in common?”
• smarter functions: assign appropriate weights to doc vectors
• vector-matrix multiplication to score are documents
(d) Mathematical Techniques
Vector-Space Model for Ranking
• higher score for more occurrences of a word
• higher score for rare words
• lower score for long documents
• example: “cosine measure” (and many others)
• f_d,t = number of occurrences of term t in document d
• f_t = total number of occurrences of t in the collection
(d) Mathematical Techniques
Vector-Space Model
• vast amount of vector space work in IR
(see Witten/Moffat/Bell and Baeza-Yates/Ribeiro-Neto for intro & pointers)
• many different ranking functions
• additional factors in ranking (mainly for web):
- higher weight if word in title, in large font, in bold face
- search engines: higher score if word in URL, in anchortext
- distance between terms in text (near, or far away?)
- user feedback or browsing behavior?
- hyperlink structure
• execution: “compute score of all documents containing at least
one query term, by scanning the inverted lists”
(d) Mathematical Techniques
Text classification:
sports
baseball
everything
politics
foreign
hockey
domestic
soccer
business
health
....
....
....
....
....
….
• given a topic hierarchy, assign pages automatically to topics
• learning-based approaches
• document generation models
• naïve Bayes, Support Vector Machines (SVM), many more …
(d) Mathematical Techniques
Naïve Bayes:
• document generation model: e.g., baseball vs. football
- a person decides to creates a web page
- with probability 0.2 creates a page about football,
with probability 0.8 creates a page about baseball
- now fills the page with words according to some probability
distribution that depends on the chosen topic
- say, if topic is baseball, word “pitcher” more probably likely to occur
- now under this model, given a page, what is the likelihood
that this page was intended to be about baseball?
• compare to testing for rare disease:
- only one out of a million people has the disease
- but one out of 10000 tests results in a false positive
- given a person that tests positive, what is the likelihood of him being sick?
- in this example, a person that tests positive is most likely healthy
(d) Mathematical Techniques
Naïve Bayes:
(cont.)
• Pr[a & b] = Pr[a | b] * Pr[b] = Pr[b | a] * Pr[a]
Pr[a | b] = Pr[b | a] * Pr[a] / Pr[b]
• let b be the event that we get a particular page p
• let a be the event that the page is about
baseball
Pr[
• so given page p, we would like to know Pr[a | b]
• to do this, we need to know Pr[b | a], Pr[a], and Pr[b]
• suppose we are given 1000 labeled pages, that is, presentative pages that
have been marked as football or baseball (by a human expert)
• estimate Pr[a] by counting number of football and baseball pages
• estimate Pr[b | a] and Pr[b] by estimating topic-dependent word
distributions: if 1.2% of all words in baseball pages are “pitcher” then
assume word pitcher is generated with probability 0.012 in each step
Descargar

Programming Parallel N-Body Codes with the BSP Model