The Million Book Digital Library Project
Raj Reddy, Jaime Carbonell, Michael Shamos, Gloriana StClair
Carnegie Mellon University
Pittsburgh, Pa. USA
November 5, 2003
The Grand Challenge
Create Access to
• All published works online
• Instantly available
• In any language
• Anywhere in the world
• Searchable, browsable, navigable
• By humans and machines
The Challenge:
One Step at a Time…
• Million Book DL
– Only about 1% of all the world’s books
• Harvard University
12M
• Library of Congress
30M
• OCLC catalog
42M
• All Multilingual Books ~100M
• At the rate of digitization of the last decade it
would take a 100 years!
Million Book Project: Issues
• Time
– At one page per second (20,000 pages per day
shift), it will take 100 years (200 working days per
year) to scan a million books of 400 pages each
• Cost
– 100M books at US$100 per book would coat $10B
– Even in India and China the cost will be $1B
– The annual cost is currently expected to be close
$10M per year with support from US, India and
China.
• Selection
– Selection of appropriate books for scanning is time
consuming and expensive
Million Book Project: Issues (cont)
• Logistics
– Each containers hold 10,000 to 20,000 books.
Shipping and handling costs about $10,000
• Meta Data
– Accessing and/or creating Meta data requires
professionals trained in Library science
• Optical Character Recognition Technology
– Essential for searching, translation and
summarization
– Many languages don’t have OCR
21st Century Computing
• Exponential advances in Information and
Communication Technologies will result in
– innovations that will transform the way we live,
learn and work.
– In retrospect, these transformations will be seen
as revolutionary by the future generations
Exponential Growth Trends in Computer Performance
1638400
819200
Tera PC
Doubling every 15 months
409600
204800
100G PC
102400
51200
M25600
I
P12800
S 6400
Doubling every 2 years
10G PC
3200
1600
Giga PC
800
400
200
100
1994 1995 1996 1997 1998
1999 2000 2001 2002 2003 2004 2005
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Year
Technology Trends
• A Giga-PC in 2002
–
–
–
–
Billion operations per second,
Billion bits of memory
Billion bits per second Network bandwidth
Less than $2 k
• A Tera-PC by the year 2015
• A Peta-PC by the year 2030
What do we do with all this power?
• Social systems not affected:
– Food we eat
– Clothes we wear
– Mating rituals
• The processing will transform the way, we
–
–
–
–
Live
Learn
Work, and
Communicate
Trends in Magnetic Disk Memory
•
•
•
•
Densities doubling every 12 months
Thousand-fold improvement every 10 years
100GB disk memory costs ~ $100 (2003)
100 GB can be used to store
– 20 movies, 500 paintings, 5000 songs of MP3 music and
25000 books
– larger than most of our personal collections at home
• By 2013, 100 Tera Bytes cost ~ $100
– A personal Library of several million books, a lifetime
collection of music and videos– all on our home PC
• By 2025, 100 Peta Bytes ~ $100
– Infinite amount of memory for all practical purposes
What do we do with a Peta Byte?
• Capture everything you ever said
– From the moment of birth
– To the moment you die
– Takes less than 1% of a Peta Byte !!
• Everything you did or experienced
– can be captured in living color
– with only a few Peta bytes
Advances in Fiber Optic Technology
• 1.6 Tera bits per second on a single fiber
– 160 wavelengths each at 10 Gbps
– Dense Wavelength Division Multiplexing (DWDM)
• What can you do with 1.6 Tera bits per second ?
–
–
–
–
–
–
20 HDTV movies
200 regular full-length movies
30000 hours of MP3 music
In one second on a Single fiber !
20 minutes to transmit ALL books in the Library of Congress !
ALL phone calls on a single fiber with room to spare !
Bottlenecks in Infinite Data Transmission
• Main bottleneck is not fiber bandwidth
• It is:
–
–
–
–
–
Bus bandwidth
Router capacity and speed
Speed of light!
Round-trip delay times in TCP/IP
At Tera bit rates with RT times of about 30 ms
across the US, 30 billion bits would have been
transmitted before an acknowledgement is
received
Technology Trends
• Exponential doubling of memory and
bandwidth will continue for 10 to 20 years
– Leading to the availability of
• Peta-byte disks
• Peta-bytes per second bandwidth
• At a cost of pennies per day
• Leading to Changes in Computer Science and
Theory of Algorithms, and
• Leading new innovates uses of Information
Technology for the benefit of Society, such as
– eLearning: Universities without walls
– Ubiquitous Access to Knowledge: Digital Libraries
– Telemedicine
Access to Information in the 21st Century
• Maxim: Access to all human knowledge
anytime anywhere
• Access, query, and print any book, magazine,
newspaper, video, data item, or reference
document
– regardless of language
• Challenges in data access
– High bandwidth networking for multimedia access
– Intellectual property protection while facilitating
access
– Intelligent information retrieval
– Delivery and protection of critical information
Million Book Project: Status
• 15 Centers in India
• 14 centers in China
• 1 Center in Egypt
• Planned : Australia and Europe
• About 78,000 books scanned
– About 50,000+ accessible on the web
– Uses 4TB of storage
– 10 TB server at CMU Library planned for July 2004
– 100,000 books by the end of 2004
– Capacity to scan a million pages a day expected to be
operational by the end of 2004
Title
Author
Language
Subject
Publisher
Year
Abstract
Rig Veda
Pandit Sriram Sharma Acharya
Sanskrit
Philosophy
Sanskriti Sansthan Bareli
Rig Veda is the oldest of the
Vedas. The Rig Veda is the
oldest book in Sanskrit or any
Indo-European language. Many
great Yogis and scholars who
have understood the
astronomical references in the
hymns, date the Rig Veda as
before 4000 B.C., perhaps as
early as 12,000. Modern
western scholars date it around
1500 B.C., though recent
archaeological finds in India
(like Dwaraka) now appear to
require a much earlier date
Title
Author
Language
Subject
Publisher
Year
Abstract
Elementary Treatise on the
Wave-Theory of Light
Humphery Lloyd, D.D, D.C.L
English
Physics
Longmans, Green & Co
1873
This book deals with the
various aspects of the wave
theory of light. It is a critical
work which contains an
analytical discussion of the
most recent researches in
Optics. It presents a clear and
connected view of the
subject.
Title
Author
Language
Subject
Publisher
Year
Abstract
Beauties from Kalidas
Keshav Appa Padhye
Sanskrit
Poetry
1927
A collection of some of the
Best works of Kalidas, Ancient
India’s Most Famous Sanskrit
Poet. Abhignyana
Sakuntalam, Kumara
Sambhavam, Ritu Samhara
are some of the renowned
works of Kalidas.
Title
Author
Language
Subject
Publisher
Year
Abstract
Gems, Jewels, Coins and
Medals Ancient & Modern
Archibald Billing
English
Fine Arts
Daldy, Isbister & Co
1875
This volume deals with the
detailed description of the
varied types of fine arts
dealing with precious stones,
Jewelry and sculpture.
Title
Author
Language
Subject
Publisher
Year
Abstract
Mudalayiram Mulamum
Periya Jeeyar
Tamil
Religion
Sri Vaishnava Sampirathaya
Sanjeevikiri Sabayai
1909
This volume is written in Tamil.
It provides a detailed account
of the origin of Vaishnava and
is written by Periya Jeeyar. .
Title
Author
Language
Subject
Publisher
Year
Abstract
Gulzar-A-Badesha
Khader Badesha
Urdu
Literature
Namipress, Chennai
1919
Literature
Title
Author
Language
Subject
Publisher
Year
Abstract
Jawahar Ali Joyviyah
Dr.Ilyas lomas
Arabic
Metrology
Bakri and Issa
1876
It is a book on Metrology, a
study of measurements
Title
Author
Language
Subject
Publisher
Year
Abstract
Panchatantramu
Narayana Kavi
Telugu
Moral Stories
Vavilla Ramaswamy and Sons
1912
It is a compilation of stories
told by a guru to his royal
students, each story teaching
a moral. Most of the characters
in the stories are animals. The
book served as an excellent
guide to prospective kings in
their everyday life, including
their behaviour and their
choice of friends. It also is a
great asset to parents to teach
ethics to their children.
Title
Author
Language
Subject
Publisher
Year
Abstract
Bharateeya Smritigalu
Vidwan Ragu Sutta
Kannada
Biographical Notes
Hemantha Sahitya
Compilation of Ancient
Memories
Title
Author
Language
Subject
Publisher
Year
Abstract
The Fauna of British India
including Ceylon and Burma
Lt. Conl. J. Stephenson
English
Biology
Taylor and Francis
1929
Biological notes on fauna and
insects compiled during
British India
Title
Author
Language
Subject
Publisher
Year
Abstract
Harijan: A Journal of Applied
Gandhism, 1933-1955
Joan Bondurant (introduction)
English
Philosophy
Garland Publishing Inc.
1973
A journal on Practical
implementation of Gandhiism
in Every Day Life
Title
Author
Language
Subject
Publisher
Year
Abstract
Structure Des Molecules
Victor Henri
French
Chemistry
Taylor and Francis
1925
This is a unique book that
explicates, in detail, the
structure of molecules and
touches upon certain specific
characteristics of molecules
with particular reference to
Benzene
Million Book Project: Research Challenges
• Providing Access to Billions everyday
– Distributed Cached Servers in every country and
region
• Easy to use interfaces for Billions
• Multilingual Information Retrieval
• Translation
• Summarization
Million Book Project: Policy Challenges
• Compensating for Creative Works
– 5% out of copyright
– 92% out-of-print and in-copyright
– 3% in-print and in-copyright
• Options
– Tax Credit
– Usage based Government funded compensation
• Analogous to Public Lending Right in UK and Australia
– Usage charges to the user
• Compulsory Licensing
• Digital Submissions to National Archives of all books
that are “born-digital”
Can we do it?
The Grand Challenge: Create Access to
• All published works online
• Instantly available
• In any language
• Anywhere in the world
• Searchable, browsable, navigable
• By humans and machines
URLs:
• http://www.ulib.org
• http://www.dli.ernet.in
• www.archive.org/texts/collection.php?collection=millionbooks
Descargar

MBP May 9 2004