Million Book Project:
Vision Becoming Reality
Gabrielle Michalek, Carnegie Mellon
Presentation to Carnegie Mellon Qatar Library November 9 & 10, 2005
Vision

“To attempt to understand and
solve the technical, economic,
and social policy issues of
providing online access to all
creative works of the human race.”
– Dr.
Raj Reddy
What is the Million Book Project?

The Million Book Project (MBP) is a
worldwide endeavor to digitize and provide
full-text searching and free-to-read access to
a million books by 2007.
Why is this important?






To share knowledge and inform citizenry
Facilitate new knowledge
Enhance student learning and success of
faculty research
Address copyright absurdities
Support digital library research
Preserve rare and fragile cultural materials
Digital library research initiatives





Machine translation
Massive distributed
database
Storage formats
Use of digital
libraries
Distribution and
sustainability






Security
Search engines
Image processing
Optical Character
Recognition (OCR)
Language
processing
Copyright laws
Who is involved?







Carnegie Mellon University Libraries and the
School of Computer Science
Other U.S. libraries
OCLC, Digital Library Federation, and
College & Research Libraries
Internet Archive
U.N. Food and Agriculture Organization
India
China
Partners

Indian Institute of Science  International Institute of
Information Technology  Indian Institute of
Information Technology  Anna University  Mysore
University  University of Pune  Goa University 
Tirumala Tirupati Devasthanams  Shanmugha Arts,
Science, Technology & Research Academy 
Arulmigu Kalasalingam College of Engineering 
Maharashtra Industrial Development Corporation

Chinese Academy of Science  Chinese Ministry of
Education  Fudan University  Nanjing University 
Peking University  Tsinghua University  Zhejiang
University
Partners

National Science Foundation
2001
2002
2003
2004
2005
$665,600
$1,000,000
$1,000,000
$1,000,000
$58,500
for equipment and travel
Content parameters

Balance users’ wants with legality

Opportunity-driven,
many sub-collections

Some content strategies:
 Books for College Libraries
 Public domain materials
 Cultural heritage materials
Almost 500,000 books scanned to date



230,000 books in
Chinese
100,000 books in
Indian languages
140,000 English or
western language
books
Incised palm leaves from the
Saraswathi Mahal Library
Scanning in India



Established 20
scanning centers
Have scanned 200,000
books to date
Provides above
average wages,
desirable jobs
Scanning in China

Established 17 scanning
centers, including one in
the Shenzhen Free
Trade Zone
 Are scanning indigenous
Shenzhen scanning center
materials, public domain works shipped from
the U.S., and U. S. copyrighted works already
in Chinese libraries (with permission granted)
 Provides above average wages, desirable jobs
Million Book Project in China


Centers scan 1,000 volumes / 200,000
pages daily
270,000 volumes have been scanned
to date
Quality control improvements



Data corruption discovered in some testcase books was caused by compressing
digital files to transfer data
Presently and in the future, rather than
compressing files, more disks are used to
transfer data
Other quality control improvements in the
Shenzhen scanning center and North
Technical Center in Beijing
Value of digitization


Digitization preserves
fragile old or ancient
books and manuscripts
Digitization benefits the
worldwide public as well
academic communities
by sharing knowledge
that is otherwise
unavailable to citizens
Standards and workflow

National standards for digital preservation
www.imls.gov/pubs/forumframework.htm

National standards for cataloging

Documented workflow & training developed
and provided by Carnegie Mellon University
Libraries
Digitization workflow





Operators scan, postprocess and OCR
600 dpi TIFFs
Scan-Fix
Abby Fine Reader
Technicians capture
metadata
Sustaining the collection

Goal: Ten organizations host collection



Cost per host site is ~$1M per host site
Collection is ~20 terabytes
Current host sites:





Digital Library of India
Universal Library, China
Universal Library, Carnegie Mellon
Internet Archive
UC Merced
Thank you

Gabrielle Michalek, Head of Archives & Digital
Library Initiatives, Carnegie Mellon University
Libraries, [email protected]
Descargar

MBP Content