Content-Based Retrieval in Hierarchical Peer-to-Peer Networks Jie Lu Jamie Callan School of Computer Science Carnegie Mellon University March 8, 2004 Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Peer-to-Peer Networks • Peer-to-peer computing – The sharing of computer resources and services (such as contents, processing cycles, and storage space for files) by direct exchange between systems • Peer-to-peer architectures pure p2p leaf node structured p2p hierarchical p2p hub (supernode, superpeer, ultrapeer, directory node) Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Information Retrieval in Peer-to-Peer Networks • What is it – The search and retrieval of documents satisfying user’s information requests in peer-to-peer networks • What activities are involved – Issue information requests – Respond to the requests with documents (document retrieval) – Route requests to other nodes (query routing, resource selection) • What architecture to use – Pure peer-to-peer architecture – Structured peer-to-peer architecture (DHT-based) – Hierarchical peer-to-peer architecture • What search mechanism to use – Name-based retrieval (simple keyword matching) – Content-based retrieval Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Why Content-Based Retrieval • Name-based retrieval only suffices known-item search – Digital objects such as music, video, and software have relatively obvious naming convections and descriptions – The user is familiar with the object being requested – Any copy is as good as any other • Search across networks of digital libraries with more varied content requires content-based retrieval – Text documents usually don’t have certain naming conventions and it is often difficult to describe a document in a few words – The user usually doesn’t know whether there are any relevant documents with respect to the information request – There may be zero, one, or more relevant documents for an information request Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Why Hierarchical Peer-to-Peer Networks • Centralization – Ensures relatively consistent coverage and speed – Suffers from a single point of failure – Has limited scalability • Pure P2P (complete decentralization) – Robust – Less efficient, may overload the network • Structured P2P – Most systems only support name-based retrieval – It is not straightforward to adopt more sophisticated retrieval models • Hierarchical P2P – Improves efficiency and scalability without sacrificing robustness – The special dedicated hubs can provide more sophisticated services to improve query routing efficiency as well as retrieval accuracy – It is straightforward to adopt various retrieval algorithms Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Related Work • Search in pure P2P networks – Gnutella 0.4: Flooding – PlanetP: Each node collects compact summaries of other nodes’ inverted indexes through gossiping (Cuenca-Acuna et al. 2002) – Each node learns a profile for each neighboring node by recording the past queries this node answered (Kalogeraki et al. 2002) – Each node learns and maintains a list of shortcuts to other nodes based on their previous behavior w.r.t. its requests (Sripanidkulchai et al. 2003) • Search in structured P2P networks – CAN, Chord, Pastry, Tapestry: Use distributed hash table to locate the hub that holds the key most similar to the request – pSearch: Use the semantic vector (generated by Latent Semantic Indexing) of each document as the key to distribute document index Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Related Work (cont’d) • Search in hierarchical P2P networks – Gnutella 0.6, KaZaA, Limewire: Hubs use name-based retrieval to search among the content hashes uploaded by connecting leaf nodes; hubs flood information request to neighboring hubs – JXTA: Hubs conduct XML retrieval against the descriptions registered by connecting leaf nodes; how hubs cooperate to route information request is yet to be defined • Resource selection in federated search using a single directory service – CORI: Inference network model – gGlOSS: Vector space model – K-L divergence-based algorithms: Language modeling Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Our Contributions • Applied content-based resource selection and document retrieval algorithms to hierarchical peer-to-peer networks and demonstrated their effectiveness in retrieval accuracy and query routing efficiency • Provided a method for a hub to automatically learn the contents that other hubs cover without requiring their cooperations • Explored a simple pruning technique on resource descriptions to reduce storage costs associated with content-based retrieval • Developed a large-scale peer-to-peer testbed from TREC WT10g dataset with millions of documents, thousands of nodes, and tens of thousands of queries – Available at http://www.cs.cmu.edu/~callan/Data Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Hierarchical P2P Architecture Based on Gnutella 0.6 leaf node Carnegie Mellon School of Computer Science hub Copyright © 2004, Carnegie Mellon. All Rights Reserved. Retrieval in Hierarchical Peer-to-Peer Networks • Two levels of retrieval – At hubs: resource selection – At leaf nodes: document retrieval • Three types of retrieval methods – Name-based: check query terms against terms from document names for resource selection and document retrieval – Match-based: match query terms against collection vocabularies for resource selection – Content-based: use collection language models for resource selection and document language models for document retrieval • Two selection rules – Query matching rule: how many query terms need to be matched for a query – Number of selected neighbors: query message is routed to how many neighboring nodes Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Resource Description Acquisition • Resource description describes the contents that a leaf node contains or a hub covers, it is used by resource selection algorithms to select nodes for query routing • Cooperation is required between hubs and leaf nodes – Leaf nodes provide full resource descriptions to hubs – Leaf nodes provide pruned resource descriptions to hubs • Resource descriptions of hubs can be acquired cooperatively or uncooperatively – A hub can provide aggregation of its leaf nodes’ resource descriptions to neighboring hubs – A hub can learn automatically a content language model for each of its neighboring hubs – The content model of a hub is learned by recording the query terms of past queries that this node has responded to Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Testbed • Contents – TREC WT10g was divided into 11,485 collections based on document URLs – 2,500 collections were randomly selected, each of which defined a leaf node in a hierarchical P2P network • Topology – Focus on hubs that cover specific types of content – Leaf nodes were clustered into 25 clusters, each of which defined the contents of a single hub (i.e. which leaf nodes were connected to this hub) – The connections between hubs were generated randomly (minimum 1, maximum 7, average 4 neighbors) • Queries – Key terms were extracted from each document to serve as a query – 15,000 queries were used in our experiments Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Sample Queries of Different Lengths Length Terms 1 sdtech 2 malignant hyperthermia 3 cardiac surgery; anesthesia 4 trade remedy; nafta law 5 drug drive collision police investigate 6 quarter company revenue increase sybase cash Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Evaluation Methodology • Retrieval accuracy – Expensive to obtain relevance judgments for large amounts of automatically-generated queries – The 50 top-ranked documents retrieved from a single large collection (union of all leaf nodes’ contents) were used as baseline to measure how well the P2P network could reproduce this baseline – Measure the accuracy by set-based Recall, Precision and F-score Recall |r| Precision | A| |r| |R| F - Score 2 / ( 1 Recall 1 Precision • Query routing efficiency – Measure the efficiency by the average number of query messages routed for each query in the network Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. ) Experimental Results namebased matchbased contentbased Resource Selection and Document Retrieval Algorithms Precision Recall F-Score Avg Num Qry Msgs Per Qry NBRS+NBDR (100% qry terms) 70.97% 4.98% 0.0931 71 NBRS+NBDR (≥50% qry terms) 21.88% 28.85% 0.2489 479 MBLNS+CBDR 62.12% 33.76% 0.4375 540 Random-MBLNS+CBDR 60.94% 19.59% 0.2965 122 CBLNS-F+CBDR 71.82% 29.65% 0.4197 116 CBLNS-P+CBDR 72.21% 29.42% 0.4181 114 CBLNS-F+HSM+CBDR 71.95% 28.46% 0.4079 85 CBLNS-P+HSM+CBDR 72.42% 28.48% 0.4088 83 CBLNS-F+HSR-F+CBDR 75.23% 28.35% 0.4118 46 CBLNS-F+HSR-P+CBDR 74.59% 28.13% 0.4085 46 CBLNS-P+HSR-P+CBDR 75.48% 28.18% 0.4104 46 Carnegie Mellon School of Computer Science retrieval accuracy qry routing efficiency Copyright © 2004, Carnegie Mellon. All Rights Reserved. Conclusions • Testbed created for use in our experiments is one of the largest so far for research on peer-to-peer systems – 2,500 leaf nodes, 25 hubs, 15,000 queries, total number of documents 1,421,088 • Content-based vs. name-based retrieval – Content-based retrieval is far more accurate and efficient than namebased retrieval and flooding methods • Content-based vs. match-based resource selection – Content-based resource selection is more effective at restricting the query routing only to those nodes that are very likely to satisfy the user’s information need than match-based resource selection Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Conclusions (cont’d) • Content-based hub selection vs. flooding – Content-based hub selection based on learned content models helps improving query routing efficiency without degrading retrieval accuracies – Content-based hub selection based on full/pruned aggregate resource descriptions can greatly improve query routing efficiency without hurting accuracies, but requires larger storage space compared with hub selection based on learned content models • Pruned vs. full resource descriptions – Using pruned resource descriptions for content-based resource selection has comparable retrieval accuracies compared with using full resource descriptions, but requires around half the storage space for resource selection index Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Questions? Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved.