Re-envisioning
(and Re-purposing)
Collections:
Mass Digitization, Google, and the
HathiTrust
Ivy Anderson
CDL
CDL Users Council Meeting
April 10, 2009
I have always imagined that
paradise will be a kind of library
- Jorge Luis Borges
Diderot’s Encyclopédie, 1751 - 1772
UC Holdings By Format
(Adjusted for Duplication)
Computer Files
0.1%
Electronic Resources
1.2%
Pictorial Items
19.9%
Multimedia
0.9%
Government Documents
1.3%
Bound Volumes
38.8%
Pamphlets
2.0%
Archival Materials
0.6%
Microforms
32.5%
Maps
2.4%
Current Serials
0.2%
Usage of Library Materials at UC
(2007)
45,000,000
40,000,000
39,293,150
35,000,000
30,000,000
25,000,000
20,000,000
15,000,000
10,000,000
3,624,662
5,000,000
225,690
0
E-Resource Usage,
2007 (partial data)
Circulation
Transactions
ILL Borrowing
Transactions
…and Along Came Google
• Google Library Project
– 2005: The ‘Google Five:’
• Harvard, Oxford, New
York Public Library,
Stanford, University of
Michigan
– 2009: 22 library partners in
5 countries
• Google Publisher Partner
Program
…and the Open Content
Alliance
• October 2005
– Founders: Internet
Archive, University of
California, U of
Toronto…
– Large-scale digitization
of out-of-copyright
works only
– A project of the Internet
Archive
…and Microsoft
Out-of-Copyright Works Only
UC Mass Digitization Projects
Founding Member
of Open Content
Alliance
October
2005
UC Joins Google
Library Project
August
2006
Microsoft
Digitization
Agreement
March
2007 July
2008
So: Two Projects, One Goal
• Goal: Mass digitization of library book collections
• Google
– In-copyright and out-of-copyright works
– All languages
– Available via Google search engine and Google Book Search
• Internet Archive / Open Content Alliance
– Out-of-copyright works only
– Primarily English language, some romance languages now
– Available via the Internet Archive and Open Library websites to any
and all search engines
– Library and grant-funded
Why Are They Doing It?
• Google’s vision:
– To put all the world’s information online
– To gain marketshare and competitive advantage for
their search (and online advertising) services
– It’s all about Search
• Internet Archive: To put the world’s information
online, for free, forever
– It’s all about the public good
Why are we doing it?
•
Improve discovery
–
•
Fulfill our public service mission
–
•
In earthquake and fire-prone California, digitizing books in our collections may also help
protect the university from catastrophic loss should disaster someday strike our libraries
Enhance student and faculty research
–
•
Many books of enduring general interest that are in the public domain – including classic
works of literature but also more unique items such as early histories of the settlement of
California and the West – can now be read by anyone, anywhere, anytime.
Preserve and protect our collections
–
•
indexing the full text of every book and making that full text available via Google and other
search engines makes our books easier to find by placing them where the users are.
Scholars can trace the evolution of ideas and perform other sophisticated textual analysis
more easily when the full text is indexed and searchable by computer, opening scholarship in
new ways.
Support collection management
–
by making our collections more available digitally, we can explore more efficient and effective
ways to manage our print collections
Internet Archive: UC
Contributors
•
•
•
•
•
Northern Regional Library Facility (NRLF)
Southern Regional Library Facility (SRLF)
UC Berkeley, Bancroft Library
UCLA
UC Davis
Google Project: UC Contributors
• Northern Regional Library Facility
(NRLF) + UC Berkeley Systems
• UC Santa Cruz
• UC San Diego
CDL’s role, on behalf of UC
• Liaison with partners
• Planning &
coordination
• Funding
• Stewardship of digital
content
• New services
Campuses Provide the Books
The Book Digitization Process
• A world of
barcodes, logistics,
loading docks,
packing materials,
and scanning
machines!
Digital files
• Images
• OCR - Text
• OCR - Page
coordinates
• Metadata
Reasons books might get rejected
(images)
What subjects are being
digitized?
•
•
•
•
•
•
Cookbooks
Children’s books
American history
Humanities
Science
East Asian & Pacific Rim collections
Languages
Where can you access the
books?
• Google Book Search: http://books.google.com/
• Internet Archive:
http://www.archive.org/details/university_of_california
_libraries
• Melvyl and WorldCat Local
– Via Google API
• And eventually…
– Google Institutional Subscription
– HathiTrust
What does the future hold?
• Additional access via the Google
Settlement
• Access and Preservation via HathiTrust
Google Settlement: Key Facts
• In October 2008, Google settled a class action lawsuit brought by
organizations representing authors and publishers, who claimed that
Google’s library scanning program violated their copyrights. Google
has always claimed that this was fair use and legitimate under
copyright law.
• The Settlement must be approved by the courts in order to become
effective. At this time we do not know if the court will approve the
Settlement, although we hope to know more after a court hearing in
June.
• At this time, everything we say about the effect of the Settlement
should be considered preliminary and provisional.
• UC will continue to digitize books from its collections with Google
regardless of whether the Settlement goes forward.
Benefits of the Settlement
• Public Access terminals in public libraries across the country that
will allow the general public to find and read books that are out of
print or in the public domain
• An Institutional Subscription that will allow UC students and
faculty and other academic libraries to access the full text of millions
of out of print books digitized from libraries around the world.
– Books in the institutional subscription will have persistent links for use in
electronic course reserves, course reading lists, etc.
• A Research Corpus that will support advanced computational
research on the full text of millions of books that Google has
digitized
• Services for visually-impaired users to read and access all of the
volumes Google has scanned
Existing Google services will also
remain available
• Google Book Search will continue to make the full text of all books
searchable
– “Find in a library” pointers will lead users to the copies in our libraries
– More books in GBS will be enabled for full text viewing, and still more
will have ‘preview’ mode enabled, for better browsability
• UC will receive copies of all of the books scanned from our
collections
– Use of the digital files will depend on the copyright status of the book
– At a minimum, we will be able to use the digital files to replace missing
or deteriorated copies in our collection when needed
– These copies will be stored in the HathiTrust shared repository
• Books in the public domain can be used and downloaded freely
by scholars and the general public
– Libraries can share their copies with other academic institutions for
scholarship and research
What the Settlement won’t allow
• There are a few things we won’t be able to do
with our own digital copies of the Google books
• We cannot:
– Use in-copyright books for interlibrary loan or
e-reserves
• reserve links will be possible from the institutional
subscription
– allow full text viewing of in-copyright works
• this will be possible through the institutional subscription
– allow access via 3rd-party search engines and
automated crawlers
Is the Settlement a good thing or a
bad thing?
• The Google Settlement is not without controversy. Some people are
concerned that it will:
– Give Google a monopoly over book digitization and suppress
competition
– Allow Google to charge high prices for subscriptions
– Create an artificial market for orphan works, preventing orphan works
legislation from being passed that might lead to more open sharing of
those works
• Orphan works = works still under copyright whose copyright owners cannot
be identified or located
• On balance, UC supports the Settlement. While not perfect, UC
believes that the Google Library Partner Program and the Google
Settlement will result in greatly improved access to the millions of
books residing in research library collections, both for libraries and
the general public.
But we’re not just banking on
Google…
http://www.hathitrust.org
Currently Digitized
2,790,739 volumes
976,758,650 pages
104 terabytes
33 miles
2,267 tons
434,390 volumes (~16% of total)
in the public domain
What is the HathiTrust?
• A shared digital repository for mass digitized books
formed in October 2008
• Members:
–
–
–
–
University of Michigan
Indiana University
University of California
CIC Libraries (Committee on Institutional Cooperation) – “Big
Ten+” schools
– University of Virginia
– More institutions may join in future
• Where are the digitized files stored?
– Servers at the University of Michigan and Indiana University
– Additional mirror sites may be developed in future
Why is UC participating?
•
Economy of scale
– Storing mass digitized books is expensive – many terabytes of data
•
Stewardship and preservation of UC resources
– Will bring our Google and Internet Archive books together in one preservation
repository under UC control - we can’t leave it all to Google and other third
parties
•
Access to our own books
– UC will be able to link to full text in HathiTrust from Melvyl and WorldCat Local
and build its own access interfaces via the HathiTrust API
•
Aggregate multiple library collections for greater research impact
– With UC, nearly 5 million books and counting
– ¾ million books in the public domain
– HathiTrust will support shared access and search mechanisms across all partner
content to the extent possible
•
Experiment with largescale search, text mining, and other specialized
services developed with academic users in mind
– Google and Internet Archive are building services for the general user
– Research libraries will build services optimized for academic users
When will all this be available?
• UC Google books will be ingested into HathiTrust over
the next several months
– UC Internet Archive books will follow
• CDL is beginning to investigate access mechanisms in
concert with Michigan and other HathiTrust partners
– Planning discussions are underway with OCLC for a HathiTrust
catalog based on WorldCat Local
– APIs will allow UC to add links to books via WorldCat Local
– “Collection builder” functionality will allow librarians and
individual end users to create and share specific themed
collections
– More advanced search and text mining to follow
• Building robust services will take time
What about our beloved print
collections??
Collections
CurrentUC
Picture:
UC Collections
Digitized
Materials
Data
Digital
WAS
Special
Collection
Licensed Content
Print
Print
Print
Print
Print
Print
Print
RLF
RLF
Print
Future Picture: UC Collections
UC Collections
Data
Digitized
Materials
Digital
Special Collection
WAS
Licensed Content
Print
Print
Print
External
Repository
External Repository
Print
Print
Print
Print
RLF
RLF
Print
Print is not going away!
• Some books will always have artifactual value
• But…Mass Digitization:
– Creates collection management opportunities
• Can we store print more economically and deliver it to users on demand?
• Can digital surrogates allow us to reduce duplication among our physical
collections?
• Can we develop shared print repositories with other research institutions to
mirror our shared electronic repositories?
• Can we optimize the use of our valuable library space through better print
collection management?
– Will allow us to better understand our users’ needs for print vs. digital
collections
– Will help us shape the library of the 21st Century for the 21st Century
user
Descargar

Re-envisioning (and Re-purposing) Collections: