Content-Based Retrieval in
Hierarchical Peer-to-Peer Networks
Jie Lu
Jamie Callan
School of Computer Science
Carnegie Mellon University
March 8, 2004
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Peer-to-Peer Networks
• Peer-to-peer computing
– The sharing of computer resources and services (such as contents,
processing cycles, and storage space for files) by direct exchange
between systems
• Peer-to-peer architectures
pure p2p
leaf node
structured p2p
hierarchical p2p
hub (supernode, superpeer, ultrapeer, directory node)
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Information Retrieval in Peer-to-Peer Networks
• What is it
– The search and retrieval of documents satisfying user’s
information requests in peer-to-peer networks
• What activities are involved
– Issue information requests
– Respond to the requests with documents (document retrieval)
– Route requests to other nodes (query routing, resource selection)
• What architecture to use
– Pure peer-to-peer architecture
– Structured peer-to-peer architecture (DHT-based)
– Hierarchical peer-to-peer architecture
• What search mechanism to use
– Name-based retrieval (simple keyword matching)
– Content-based retrieval
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Why Content-Based Retrieval
• Name-based retrieval only suffices known-item search
– Digital objects such as music, video, and software have relatively
obvious naming convections and descriptions
– The user is familiar with the object being requested
– Any copy is as good as any other
• Search across networks of digital libraries with more
varied content requires content-based retrieval
– Text documents usually don’t have certain naming conventions and
it is often difficult to describe a document in a few words
– The user usually doesn’t know whether there are any relevant
documents with respect to the information request
– There may be zero, one, or more relevant documents for an
information request
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Why Hierarchical Peer-to-Peer Networks
• Centralization
– Ensures relatively consistent coverage and speed
– Suffers from a single point of failure
– Has limited scalability
• Pure P2P (complete decentralization)
– Robust
– Less efficient, may overload the network
• Structured P2P
– Most systems only support name-based retrieval
– It is not straightforward to adopt more sophisticated retrieval models
• Hierarchical P2P
– Improves efficiency and scalability without sacrificing robustness
– The special dedicated hubs can provide more sophisticated services to
improve query routing efficiency as well as retrieval accuracy
– It is straightforward to adopt various retrieval algorithms
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Related Work
• Search in pure P2P networks
– Gnutella 0.4: Flooding
– PlanetP: Each node collects compact summaries of other nodes’
inverted indexes through gossiping (Cuenca-Acuna et al. 2002)
– Each node learns a profile for each neighboring node by recording
the past queries this node answered (Kalogeraki et al. 2002)
– Each node learns and maintains a list of shortcuts to other nodes
based on their previous behavior w.r.t. its requests (Sripanidkulchai
et al. 2003)
• Search in structured P2P networks
– CAN, Chord, Pastry, Tapestry: Use distributed hash table to locate
the hub that holds the key most similar to the request
– pSearch: Use the semantic vector (generated by Latent Semantic
Indexing) of each document as the key to distribute document
index
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Related Work (cont’d)
• Search in hierarchical P2P networks
– Gnutella 0.6, KaZaA, Limewire: Hubs use name-based retrieval to
search among the content hashes uploaded by connecting leaf
nodes; hubs flood information request to neighboring hubs
– JXTA: Hubs conduct XML retrieval against the descriptions
registered by connecting leaf nodes; how hubs cooperate to route
information request is yet to be defined
• Resource selection in federated search using a single
directory service
– CORI: Inference network model
– gGlOSS: Vector space model
– K-L divergence-based algorithms: Language modeling
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Our Contributions
• Applied content-based resource selection and document
retrieval algorithms to hierarchical peer-to-peer networks
and demonstrated their effectiveness in retrieval accuracy
and query routing efficiency
• Provided a method for a hub to automatically learn the
contents that other hubs cover without requiring their
cooperations
• Explored a simple pruning technique on resource
descriptions to reduce storage costs associated with
content-based retrieval
• Developed a large-scale peer-to-peer testbed from TREC
WT10g dataset with millions of documents, thousands of
nodes, and tens of thousands of queries
– Available at http://www.cs.cmu.edu/~callan/Data
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Hierarchical P2P Architecture Based on Gnutella 0.6
leaf node
Carnegie Mellon
School of Computer Science
hub
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Retrieval in Hierarchical Peer-to-Peer Networks
• Two levels of retrieval
– At hubs: resource selection
– At leaf nodes: document retrieval
• Three types of retrieval methods
– Name-based: check query terms against terms from document
names for resource selection and document retrieval
– Match-based: match query terms against collection vocabularies
for resource selection
– Content-based: use collection language models for resource
selection and document language models for document retrieval
• Two selection rules
– Query matching rule: how many query terms need to be matched
for a query
– Number of selected neighbors: query message is routed to how
many neighboring nodes
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Resource Description Acquisition
• Resource description describes the contents that a leaf
node contains or a hub covers, it is used by resource
selection algorithms to select nodes for query routing
• Cooperation is required between hubs and leaf nodes
– Leaf nodes provide full resource descriptions to hubs
– Leaf nodes provide pruned resource descriptions to hubs
• Resource descriptions of hubs can be acquired
cooperatively or uncooperatively
– A hub can provide aggregation of its leaf nodes’ resource
descriptions to neighboring hubs
– A hub can learn automatically a content language model for each
of its neighboring hubs
– The content model of a hub is learned by recording the query terms
of past queries that this node has responded to
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Testbed
• Contents
– TREC WT10g was divided into 11,485 collections based on
document URLs
– 2,500 collections were randomly selected, each of which defined a
leaf node in a hierarchical P2P network
• Topology
– Focus on hubs that cover specific types of content
– Leaf nodes were clustered into 25 clusters, each of which defined
the contents of a single hub (i.e. which leaf nodes were connected
to this hub)
– The connections between hubs were generated randomly
(minimum 1, maximum 7, average 4 neighbors)
• Queries
– Key terms were extracted from each document to serve as a query
– 15,000 queries were used in our experiments
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Sample Queries of Different Lengths
Length
Terms
1
sdtech
2
malignant hyperthermia
3
cardiac surgery; anesthesia
4
trade remedy; nafta law
5
drug drive collision police investigate
6
quarter company revenue increase sybase cash
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Evaluation Methodology
• Retrieval accuracy
– Expensive to obtain relevance judgments for large amounts of
automatically-generated queries
– The 50 top-ranked documents retrieved from a single large
collection (union of all leaf nodes’ contents) were used as baseline to
measure how well the P2P network could reproduce this baseline
– Measure the accuracy by set-based Recall, Precision and F-score
Recall 
|r|
Precision

| A|
|r|
|R|
F - Score  2 / (
1
Recall
1

Precision
• Query routing efficiency
– Measure the efficiency by the average number of query messages
routed for each query in the network
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
)
Experimental Results
namebased
matchbased
contentbased
Resource Selection and
Document Retrieval Algorithms
Precision
Recall
F-Score
Avg Num
Qry Msgs
Per Qry
NBRS+NBDR (100% qry terms)
70.97%
4.98%
0.0931
71
NBRS+NBDR (≥50% qry terms)
21.88%
28.85%
0.2489
479
MBLNS+CBDR
62.12%
33.76%
0.4375
540
Random-MBLNS+CBDR
60.94%
19.59%
0.2965
122
CBLNS-F+CBDR
71.82%
29.65%
0.4197
116
CBLNS-P+CBDR
72.21%
29.42%
0.4181
114
CBLNS-F+HSM+CBDR
71.95%
28.46%
0.4079
85
CBLNS-P+HSM+CBDR
72.42%
28.48%
0.4088
83
CBLNS-F+HSR-F+CBDR
75.23%
28.35%
0.4118
46
CBLNS-F+HSR-P+CBDR
74.59%
28.13%
0.4085
46
CBLNS-P+HSR-P+CBDR
75.48%
28.18%
0.4104
46
Carnegie Mellon
School of Computer Science
retrieval accuracy
qry routing efficiency
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Conclusions
• Testbed created for use in our experiments is one of the
largest so far for research on peer-to-peer systems
– 2,500 leaf nodes, 25 hubs, 15,000 queries, total number of documents
1,421,088
• Content-based vs. name-based retrieval
– Content-based retrieval is far more accurate and efficient than namebased retrieval and flooding methods
• Content-based vs. match-based resource selection
– Content-based resource selection is more effective at restricting the
query routing only to those nodes that are very likely to satisfy the
user’s information need than match-based resource selection
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Conclusions (cont’d)
• Content-based hub selection vs. flooding
– Content-based hub selection based on learned content models helps
improving query routing efficiency without degrading retrieval
accuracies
– Content-based hub selection based on full/pruned aggregate
resource descriptions can greatly improve query routing efficiency
without hurting accuracies, but requires larger storage space
compared with hub selection based on learned content models
• Pruned vs. full resource descriptions
– Using pruned resource descriptions for content-based resource
selection has comparable retrieval accuracies compared with using
full resource descriptions, but requires around half the storage
space for resource selection index
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Questions?
Carnegie Mellon
School of Computer Science
Copyright © 2004, Carnegie Mellon. All Rights Reserved.
Descargar

No Slide Title