The Million Book Project The Mini-UL Digital Library Platform Carnegie Mellon University School of Computer Science Raj Reddy Eric Burns What is the Million Book Project? Free-to-read, open-platform digital library Worldwide distribution and mirroring Public domain works Out of print but in copyright Rare materials Collaborative content acquisition India 20 mini scanning centers, 3 mega scanning centers Over 80,000 books to date China Over 30,000 books to date USA / Carnegie Mellon (Hunt Library/SCS) 1200 books, technology contributor Truly multi-lingual corpus Several Indian languages Mandarin Chinese Most European languages MBP offers unique systems challenges Multiple deployments China India Partners in US Human-intensive scanning process Error prone DC XML entered by hand Operator error on scanning devices Difficult to standardize Multiple QA passes required Everyone wants autonomy and customization System-level solution must satisfy small and large data sets CMU must provide a framework for remote sites to extend Equipment budget is limited Developing nations’ networks are limited China, India output must be shipped to US Core Problems Multiple scanning centers, each with: Distinct values and goals Limited connectivity Varying IT infrastructure Common base requirements Searching Browsing Viewing File-system compatibility Basic standard for acquiring and storing scanned books Data preservation Quality assurance Flexibility Openness Fault-tolerant storage at all sites Data movement via physical shipment Standardized OS and base software Our Solution: Mini-UL Embedded Digital library on a CD OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code (Perl) on single ISO Boots single systems or whole clusters Ensures standardization, eases upgrades To use new software, admins burn CD and reboot Commodity PC and disk hardware spec Software RAID: Use low-end PC as network-attach storage Sub-$1000 PC = 1 TB NASD Barebones economy PC 250 GB OEM disk x 4 Add storage PCs as needs grow 1 processor per storage unit CD + PC(s) = Embedded digital library “Black box” approach Dump MBP-format books into upload bucket Easily search, browse, view, and download all books added The MBP Book Format Dir w/ five subdirectories: OTIFF “Original TIFF”: exactly as scanned Eight-digit, zero-padded page numbers (00000123.tif) 1-bit color at 600 DPI, lossless PTIFF “Processed TIFF”: current best batch image processing Eight-digit zero-padded numbers match OTIFF TXT ASCII, UTF-8, or UTF-16 text Numbers match OTIFF/PTIFF HTML UTF-8 HTML w/ low-res JPEG images Numbers match OTIFF/PTIFF [MARC|DC] Binary MARC record Dublin Core XML Flexible: other format directories can be added Internal storage format: OTIFF/PTIFF -> multipage TXT/HTML -> zip 500 page book = 2001 files Converted at addition time to 5 files Speeds copying High-level Cluster Architecture Web traffic Head Node (NASD 0) Internal network subnet NASD 1 NASD 2 NASD … NASD n Network-Attach Storage Devices (SATA RAID PCs) Adding a Book Head node has SMB share “Upload” User moves one or more MBP-format books into Upload share System automatically checks each book for completeness/correctness: All formats present Contiguous page numbers Metadata present and parseable Errors presented to user for correction Converts to internal storage format Assigns serial number Moves to NASD node with most free space Incremental search index Viewing a book Users view original page images HTML, raw TXT as option Intra-book searching Seeks to matching page Highlights token match Rapidly seek from one token match to the next Boolean queries, phrase matching PaperSight ImageServer Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF Real-time conversion performance is faster than human response Anti-aliased grey-scale image is ideal for monitor reading Significant reduction in bandwidth Conversion happens on hosting NASD node, not head Browsing Simple alphabetic browse Keep list sizes small The Missing Piece: Search Searching the full text of tens of thousands of books is computationally intensive Solution: parallelize Each NASD node indexes and searches content it stores Results are unified and sorted at head node NASD cluster architecture maintains parity between processors and storage Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous corpus) Search too slow? Increase machine count and redistribute data. Search features Fast! 0.1 sec per-token response in most cases (AMD 1400+). Joint bibliographic and full-text search with single query Phrase matching, boolean queries, cross-page phrases Context display for full-text matches Rich scoring system: Metadata matches Token proximity scoring (multi-token queries only) Direct-to-page matching Full text matches yield actual matching page, with highlighting Full search API (Perl) Customization APIs provided for all major components: Search Book Reader Metadata processing and conversion All HTML lives in read-write space on head node Development sites can create rich HTML hierarchies Scripting is not limited to CD contents cgi-bin and site_perl can be extended CD/core upgrades leave extensions untouched Future Directions Search engine in wider distribution GPL Perl CPAN “Phone Home” capability Individual Mini-UL systems with slow but persistent links relay manifests Metadata + text Master site to search all sites IIIT Hyderabad contributions MySQL-based metadata search Separate search and storage clusters 9 TB hardware RAID servers Multiple diskless search nodes Embedded Digital Library Uses Gives MBP sites foundation on which to build Allows convergence on standards as sites contribute new extensions to main distribution Gives basic search, browse, view, and audit capability to any site, regardless of development staff Uses extend beyond MBP deployments Any site with archives of multi-page text documents can benefit Only requirements are a scanner and a PC Virtually no administration required Questions?