Web Scale Information Discovery
CS 431 – 20040329
Carl Lagoze – Cornell University
Acknowledgements:
Luis Gravano
Andreas Paepcke
Bill Arms
Search Strategies
•
•
•
•
Metadata harvesting
Automated discovery
Federated searching
Meta-searching
Web Search Strategies – Metadata Harvesting
metadata
Web Search Strategies – Metadata Harvesting
metadata
?
Author
Title
Abstract
Identifer
Searching via metadata harvesting - examples
• NSDL – http://www.nsdl.org
• OAIster http://oaister.umdl.umich.edu/o/oaister/
Web Search Strategies – Crawling and Automated Indexing
“central”
index
?
Automated Information Discovery
Creating metadata records manually is laborintensive and hence expensive.
The aim of automated information discovery
is for users to discover information without
using skilled human effort to build indexes.
Resources for Automated Information Discovery
Computer power
brute force computing
ranking methods
automatic generation of metadata
The intelligence of the user
browsing
relevance feedback
information visualization
Similarity Ranking
Ranking methods using similarity
• Measure the degree of similarity between a query and
a document (or between two documents).
• Basic technique is the vector space model with term
weighting.
Similar
Requests
Documents
Similar: How similar is document to a
request?
Vector Space Methods: Concept
n-dimensional space, where n is the total number of different
terms (words) in a set of documents.
Each document is represented by a vector, with magnitude in
each dimension equal to the (weighted) number of times that the
corresponding term appears in the document.
Similarity between two documents is the angle between their
vectors.
Much of this work was carried out by Gerald Salton and
colleagues in Cornell's computer science department.
Example 1: Incidence Array
terms in d1 -> ant ant bee
terms in d2 -> bee hog ant dog
terms in d3 -> cat gnu dog eel fox
terms ant bee cat dog eel fox gnu hog
d1
1
1
d2
1
1
d3
1
1
1
1
1
1
1
Weights: tij = 1 if document i contains term j and zero otherwise
Reasons for Term Weighting
Similarity using an incidence matrix measures the
occurrences of terms, but no other characteristics of
the documents.
Terms are more useful for information retrieval if
they:
• appear several times in one document (weighting by
term frequency)
• only appear in some documents (weighting by
document frequency)
• appear in short document (weighting by document
length)
Inverse Document Frequency
Concept
A term that occurs in a few documents is likely
to be a better discriminator that a term that
appears in most or all documents.
Issues in extending traditional IR for the Web
• Traditional TREC benchmarks are relatively small
scale (large is about 20GB)
• Web queries are very short
• Quality is an issue in ranking
• Sex, lies, and the hidden web
• Polysemy due to domain overlap
• Web has context and “hints”
– Structure of pages (e.g. html title might be rated higher)
– Implicit metadata of link context
• Anchor text of citing pages
• Weighting influenced by citation structure (recall Garfield)
PageRank Algorithm (Google)
Concept:
The rank of a web page is higher if many pages link
to it.
Links from highly ranked pages are given greater
weight than links from less highly ranked pages.
Google Example
5
1
6
2
3
4
Adjacency Matrix
Normalize by Number of Links from Page
Iterate until convergence
Motivating the Damping Factor
5
1
6
2
3
4
PageRank with Damping Factor Intuitive Model
A user:
1.
Starts at a random page on the web
2a. With probability p, selects any random page and jumps to it
2b. With probability 1-p, selects a random hyperlink from the
current page and jumps to the corresponding page
3. Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
The PageRank Iteration
The basic method iterates using the normalized
link matrix, B.
wk = Bwk-1
This w is the high order eigenvector of B
Google iterates using a damping factor. The
method iterates using a matrix B', where:
B' = dN + (1 - d)B
N is the matrix with every element equal to 1/n.
d is a constant found by experiment.
Information Retrieval Using PageRank
Simple Method
Consider all hits (i.e., all document vectors that share at least
one term with the query vector) as equal.
Display the hits ranked by PageRank.
The disadvantage of this method is that it gives no attention
to how closely a document matches a query
With dynamic document sets, references patterns are
calculated for a set of documents that are selected based on
each individual query.
Web Search Strategies – Federated Searching
?
Search Client
Z39.50
http://www.loc.gov/z3950/agency/
Aims of Z39.50
• Permits one computer, the client, to search and retrieve
information on another, the database server
• Important both technically and for its wide use in library
systems
• Most development has concentrated on bibliographic data
• Most implementations emphasize searches that use a
bibliographic set of attributes to search databases of
MARC records
• Built on notion of common protocol as interoperability
paradigm
Technical history
Z39.50
• Developed for X.25 networks (connection orientation),
conversion to run over TCP fitted later
• Original concept in days when repeating a search was
expensive computation (about 1980)
Z39.50 principles
Abstract view of database searching.
• Server stores a set of databases with searchable
indexes
• Interactions are based on a session
• The client opens a connection with the server, carries
out a sequence of interactions and then closes the
connection.
• During the course of the session, both the server and
the client remember the state of their interaction.
State
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the
result set.
• Thus the client can modify a large set by increasingly
precise requests, or can request a presentation of any
record in the set, without searching entire database.
Problems with Z39.50
• Very difficult to implement
– There are freely available implementations, but they are
complex
• Outdated assumptions
– Searching is expensive computationally
– Bandwidth is limited (ASN.1 compression)
• Originally designed for bibliographic record
retrieval, and not full documents or other objects
• “Overspecified”
• Assumes questionable user model (stateful)
ZING – update of Z39.50 concepts
• http://www.loc.gov/z3950/agency/zing/zinghome.html
• SRW (Search and Retrieval for the Web)
– Retain basic Z39.50 concepts (state, explain, flexible
access points or metadata formats)
– Simplifications and modernizations (statefulness, use of
XML/SOAP)
• CQL (Common Query Language)
– Expressive common query language with XML definition
Web Search Strategies - Metasearching
?
Metasearch
Engine
What is “Metasearching”?
• Given many document sources and a query, a metasearcher:
– Finds the good sources for the query
– Evaluates the query at these sources
– Merges the results from these sources
Metasearcher
Unindexed
Documents
Legacy
Database /
WAIS / etc.
Existing
Web
Application
<%
%>
Metasearching Issues
• How to query different types of sources?
• How to combine results and rankings from multiple data
sources?
Metasearcher
grep
‘biomedical’
*.txt
SELECT
title
FROM
articles . . .
http://…/getTitle?
title=‘biomedical’&…
<%
%>
Metasearching Issues . . . Cont’d
• How to choose among multiple data sources?
• How to get metadata about multiple data sources?
Metasearcher
cat
*.txt
SELECT
SCHEMA
…….
<%
%>
Best:
http://….?getMetaData
Worst:
“Hi. What do you have?”
STARTS/SDARTS
http://sdarts.cs.columbia.edu/default.html
STARTS
• Stanford Protocol Proposal for Internet Retrieval
and Search
• Joint work of Stanford Digital Library Project and
Cornell Digital Library Research Group
• SDARTS – current work at Columbia to extend
STARTS
– Integrate with SDLIP and metadata harvesting (OAIPMH)
– Include “deep web” automated content summary concepts
Different text search engines are largely
incompatible
• Different query languages
(the query-language problem)
• Different ranking algorithms
(the rank-merging problem)
• No exported information about sources
(the metadata problem)
SDLIP – search middleware
Rank Merging
• Return information in query result to allow rank
merging:
– unnormalized score of the document
– statistics about each query term
We cannot merge document ranks from different sources directly
• Search engines use different ranking algorithms:
DB1: (doc1, 0.7), (doc2, 0.3)
DB2: (doc3, 1000), (doc4, 400)
Merged rank?
• Some algorithms depend on the source characteristics
Extra information helps merge document ranks
meaningfully
Sources return query results and statistics:
Query: "distributed databases"
DB1: (doc1, 0.7)
"distributed" appears 3 times in doc1
"databases"
appears 5 times in doc1
Motivating Source Metadata
Routing Problem - Disjoint Search Sources
author=Hopcroft?
Hopcroft
I1, I3
Hartmanis
I3
Tarjan
I1, I2
Wilensky
I2
I1,I3
doc1, doc2
doc8
Content Summary
Hopcroft
doc8
Tarjan
doc9
I1
Tarjan
doc6
Wilensky
doc7
I2
Hopcroft
doc1, doc2
Hartmanis
doc3, doc4
I3
Source Metadata
• Data to help select the right sources for a query
source metadata attributes - what the source engine can do
source content summary - what the source engine can
search
• Simplified form of Z39.50 “explain” service
Source Content Summary
For each source:
•
•
•
•
Vocabulary
Document frequency for each word
Total number of postings for each word
Number of documents
• Implementation of GLOSS work:
– GlOSS: Text-Source Discovery over the Internet, L. Gravano,
H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database
Systems, vol. 24, no. 2, Jun. 1999
Deep Web Content Summary Extraction via Focused Query
Probing
Descargar

Web Scale Information Discovery