The Million Book Project
The Mini-UL Digital Library Platform
Carnegie Mellon University
School of Computer Science
Raj Reddy
Eric Burns
What is the Million Book Project?
 Free-to-read, open-platform digital library




Worldwide distribution and mirroring
Public domain works
Out of print but in copyright
Rare materials
 Collaborative content acquisition
 India
 20 mini scanning centers, 3 mega scanning centers
 Over 80,000 books to date
 China
 Over 30,000 books to date
 USA / Carnegie Mellon (Hunt Library/SCS)
 1200 books, technology contributor
 Truly multi-lingual corpus
 Several Indian languages
 Mandarin Chinese
 Most European languages
MBP offers unique systems challenges
 Multiple deployments
 China
 India
 Partners in US
 Human-intensive scanning process
 Error prone
 DC XML entered by hand
 Operator error on scanning devices
 Difficult to standardize
 Multiple QA passes required
 Everyone wants autonomy and customization
 System-level solution must satisfy small and large data sets
 CMU must provide a framework for remote sites to extend
 Equipment budget is limited
 Developing nations’ networks are limited
 China, India output must be shipped to US
Core Problems
 Multiple scanning centers, each with:
 Distinct values and goals
 Limited connectivity
 Varying IT infrastructure
 Common base requirements




Searching
Browsing
Viewing
File-system compatibility
 Basic standard for acquiring and storing scanned books




Data preservation
Quality assurance
Flexibility
Openness
 Fault-tolerant storage at all sites
 Data movement via physical shipment
 Standardized OS and base software
Our Solution: Mini-UL Embedded
 Digital library on a CD
 OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code
(Perl) on single ISO
 Boots single systems or whole clusters
 Ensures standardization, eases upgrades
 To use new software, admins burn CD and reboot
 Commodity PC and disk hardware spec
 Software RAID: Use low-end PC as network-attach storage
 Sub-$1000 PC = 1 TB NASD
 Barebones economy PC
 250 GB OEM disk x 4
 Add storage PCs as needs grow
 1 processor per storage unit
 CD + PC(s) = Embedded digital library
 “Black box” approach
 Dump MBP-format books into upload bucket
 Easily search, browse, view, and download all books added
The MBP Book Format
 Dir w/ five subdirectories:
 OTIFF
 “Original TIFF”: exactly as scanned
 Eight-digit, zero-padded page numbers (00000123.tif)
 1-bit color at 600 DPI, lossless
 PTIFF
 “Processed TIFF”: current best batch image processing
 Eight-digit zero-padded numbers match OTIFF
 TXT
 ASCII, UTF-8, or UTF-16 text
 Numbers match OTIFF/PTIFF
 HTML
 UTF-8 HTML w/ low-res JPEG images
 Numbers match OTIFF/PTIFF
 [MARC|DC]
 Binary MARC record
 Dublin Core XML
 Flexible: other format directories can be added
 Internal storage format:
 OTIFF/PTIFF -> multipage
 TXT/HTML -> zip
 500 page book = 2001 files
 Converted at addition time to 5 files
 Speeds copying
High-level Cluster Architecture
Web traffic
Head
Node
(NASD 0)
Internal network subnet
NASD
1
NASD
2
NASD
…
NASD
n
Network-Attach Storage Devices (SATA RAID PCs)
Adding a Book
 Head node has SMB share “Upload”
 User moves one or more MBP-format books into Upload
share
 System automatically checks each book for
completeness/correctness:
 All formats present
 Contiguous page numbers
 Metadata present and parseable
 Errors presented to user for correction




Converts to internal storage format
Assigns serial number
Moves to NASD node with most free space
Incremental search index
Viewing a book
 Users view original page images
 HTML, raw TXT as option
 Intra-book searching
 Seeks to matching page
 Highlights token match
 Rapidly seek from one token match to the next
 Boolean queries, phrase matching
 PaperSight ImageServer
 Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF
 Real-time conversion performance is faster than human
response
 Anti-aliased grey-scale image is ideal for monitor reading
 Significant reduction in bandwidth
 Conversion happens on hosting NASD node, not head
Browsing
Simple alphabetic browse
Keep list sizes small
The Missing Piece: Search
 Searching the full text of tens of thousands of books is computationally
intensive
 Solution: parallelize
 Each NASD node indexes and searches content it stores
 Results are unified and sorted at head node
 NASD cluster architecture maintains parity between processors and storage
 Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous
corpus)
 Search too slow? Increase machine count and redistribute data.
 Search features





Fast! 0.1 sec per-token response in most cases (AMD 1400+).
Joint bibliographic and full-text search with single query
Phrase matching, boolean queries, cross-page phrases
Context display for full-text matches
Rich scoring system:
 Metadata matches
 Token proximity scoring (multi-token queries only)
 Direct-to-page matching
 Full text matches yield actual matching page, with highlighting
 Full search API (Perl)
Customization
 APIs provided for all major components:
Search
Book Reader
Metadata processing and conversion
 All HTML lives in read-write space on head node
Development sites can create rich HTML hierarchies
 Scripting is not limited to CD contents
cgi-bin and site_perl can be extended
 CD/core upgrades leave extensions untouched
Future Directions
 Search engine in wider distribution
GPL
Perl CPAN
 “Phone Home” capability
Individual Mini-UL systems with slow but persistent links
relay manifests
 Metadata + text
Master site to search all sites
 IIIT Hyderabad contributions
MySQL-based metadata search
 Separate search and storage clusters
9 TB hardware RAID servers
Multiple diskless search nodes
Embedded Digital Library Uses
 Gives MBP sites foundation on which to build
Allows convergence on standards as sites contribute
new extensions to main distribution
Gives basic search, browse, view, and audit capability to
any site, regardless of development staff
 Uses extend beyond MBP deployments
Any site with archives of multi-page text documents can
benefit
Only requirements are a scanner and a PC
Virtually no administration required
Questions?
Descargar

The Million Book Project The Mini