Digital Libraries &
Document Image Analysis
Henry S. Baird
Statistical Pattern & Image Analysis research
Information Sciences & Technologies Lab
DLs as seen by a DIA Researcher
15 years in DIA R&D
Lucky to have known/collaborated with:
PARC DL enthusiasts: Masinter, Street, Bloomberg, et al
UC Berkeley Digital Library project: Wilensky, Fateman, et al
CMU Universal Library project: Thibadeau, Hauptmann, et al
Xerox Scanning Service Bureaus: Wallis, et al
… many others with an interest in DLs
What challenges do DLs pose to DIA R&D?
Digital Library Dreams
Electronic networked DLs promise to provide:
more books, journals, etc
to more people
at more places & times
than physical libraries can hope to….
The Ideal DL: an international, interoperable,
sustainable body of rich cultural
materials in digital form
Document Images’ Usefulness in DLs
raster image
display, print
+ metadata (title, author, …)
+ index, catalogue
+ OCRed text
+ retrieve (more or less well)
+ correct text
+ retrieve well, reuse,
summarize, translate, …
+ layout format (e.g. RTF)
+ reprinting
+ links (e.g. HTML)
+ Web publishing
+ functional tags (e.g. XML)
+ “semantic web”
Advantages of Digital Displays
versus Ink-on-Paper
networked -- potentially unbounded content
rapidly rewritable -- supports animation
radiant -- legible in the dark
sensitive -- markable, interactive
Generally thought to be overwhelming, but …
Advantages of Ink-on-Paper
versus Digital Displays
 cheap
 large, many
 high-resolution
 lightweight
 thin
 unpowered
 stable
 expensive
 small, few
 low-resolution
 heavy
 thick
 powered
 requires
A. Dillon, “Reading from Paper versus Screen: a critical review
of the empirical literature,” Ergonomics 53(10): 1297-1326, 1992.
DISPLAYS in future
 less expensive
 larger, more
 higher-resolution
 lighter
 thinner
 lower power
eBooks, e-paper,
notebooks, laptops,
PDAs, …
The fact is, for many uses
Paper is Still Widely Preferred
“Paper [remains today] the medium of choice
for reading, even when most high-tech
[display] technologies are to hand”
— Sellen & Harper (2002)
Why is this? Paper allows:
flexible navigation though documents
cross-referencing of several documents
interweaving of reading and writing
A. J. Sellen & R. H. R. Harper, The Myth of the Paperless Office,
The MIT Press, Cambridge, MA, 2002.
Document Images are Doubly
Disadvantaged within DLs
They fail to support most uses that
symbolically encoded, tagged data do
They lose many key advantages they
enjoyed on paper
A Threat: ‘If it’s not in Google, I don’t need it!’
Can they be made as useful in DLs as encoded data?
Can they sometimes work better in DLs than encoded data?
…these are challenges to us, the DIA R&D community.
The British Library
‘The World’s Knowledge’
38.8M items catalogued
website: 18.4M page hits/year
Compare Google:
• >3B pages
• 150M searches/day
“[Reinforcing] the Library’s role as the pre-eminent
global document supplier, digital scanning from print
and microfilm originals will give researchers rapid,
high quality delivery from over 100 million research
articles, reports, and conference papers direct to
their desktop.”
-- Lynne Brindley, Chief Executive
2002-2003 Annual Report
Bibliothèque nationale de France
The Digital Library
– digitization of both printed books and graphic material
– primarily in image mode to begin with
– most out-of-copyright
Gallica 2000
– multimedia documents: Middle Ages -> early 20th century
– 35,000 printed volumes: images
– 1000 titles full text
– “one of the largest DLs free of charge on the web”
Million Book DL Project
1M books to be scanned by 2005
– bitonal, 600 dpi
Free-to-read, universally accessible
Searchable by full text (where OCR is possible)
– ABBYY Fine Reader OCR
Books in public domain or copyrighted but out of print
Fifteen partners:
– US, India, China; est. 4000 person-years of clerical labor
– Multinational, multilingual (mainly English)
20Tbyte trusted repository
Research testbed for summarization, OCR, automatic
extraction of metadata, machine translation
Reddy, Raj and Gloriana St. Clair, “The Million Book Project,”
CMU, Dec. 1, 2001.
Google Catalogs
“1000’s” of scanned mail-order catalogs
free for publishers, ‘few days’ turnaround
– for a fee: link products to web sites
free to users: download page images
indexed by: vendor, date, page numbers, etc
(not by full text content)
12 plan
‘Look Inside the Book II’
~500k books: in-copyright, non-fiction
 Scan (full color), OCR cover-to-cover
 Full-text search, download sample pages
 Free but limited access to page images
Can Google be far behind…?
search document image files found on Web
David D. Kirkpatrick, “Amazon Plan Would Allow Searching
Text of Many Books,” The New York Times, July 21, 2003.
Capturing Document Images
To digitize a book: $4 - $1000 each!
bitonal, low quality, mass scanning, …
expensively: color, quality control, individual handling, …
Breakdown of costs:
cataloging, description, indexing
scanning, OCR, correction, markup
quality control, file maintenance, admin
NOTE: DIA can help with all three
“The Price of Digitization,” Proc., NINCH Symposium
(National Initiative for a Networked Cultural Heritage), New
York, April 8, 2003.
Document Image Capture Operations
Usually, large-scale batch operations
Sometimes destructive:
– cut off spines, discard covers, wear & tear
– hot debate over ‘scan-and-discard’ policies
Image quality standards are often subjective
– usually: “completeness”; no missing pages, text
– seldom: checked for human, machine legibility
– rarely: guaranteed suitable for future uses
Scan once, for ever:
– seldom rescanned (Lesk: “not for 5-10y”)
M. Lesk, Practical Digital Libraries: Books, Bytes, & Bucks.
Morgan Kaufmann, San Francisco, CA, 1997.
The PARC Rare Book Scanner
• Bulk scanning w/out
damaging books
• Zero force on binding
• Book is open 90 degrees
• Pages turned manually
• 280 dpi
• 9.25”x11.75” field
• Throughput
• 8-bit grey
 450 pages/h
• 24-bit color 120 pages/h
Bob Street & Steve Ready, PARC.
GUI & IP for Image Capture
• Calibration
• color test targets
• per-pixel gain/offset map
• Image Processing
• performed on the fly:
• contrast, cleaning, etc
• crop. skew-correct
• processing templates
• Assuring Quality
• visual inspection
• Capturing Metadata
• automatic page numbering
1,2,3,.../ i,ii,iii,.../ I,II,III,…
• section labels
• comments (manual)
DIA R&D for Image Quality Control
Measuring document image quality
– new test target designs
– image processing algorithms
– rigorous, quantitative standards
Assuring quality
– fast algorithms for on-the-fly image quality
Predicting human & machine legibility
What image quality features correlate
well with human and OCR legibility?
… and with other, later DIA tasks?
K. Summers, “Document Image Improvement for
OCR as a Classification Problem,” Proc., DR&R
Santa Clara,CA, Jan 2003.
E. H. Barney Smith & X. Qiu, “Relating
Statistical Image Differences & Degradation
Features,” Proc, 5th DAS, Princeton, NJ., Aug 2002.
When Quality Control Goes Wrong
Front Page, 1852 Edition of the New York Times
Scanned from microfilm.
The Historical New York Times Project, CMU/NYT, 1999.
Extracting & Recognizing Content
These are central DIA R&D goals
But existing doc image understanding systems
cannot guarantee high accuracy
across the full range of documents:
- typefaces, h/w styles
image qualities
layout geometries
writing systems
domains of discourse
old fashioned
poor & variable
DL’s scholarly & historical docs are often harder
S. Rice, G. Nagy, T. Nartker, OCR: An Illustrated Guide to the Frontier,
Kluwer Academic Publishers: 1999.
Richly Meaningful
Typographical Book Designs
Rare Botanical Reference Book
• Jepson’s A Flora of California, 1943.
• Authoritative, still in demand by scholars
• Only a few copies are left
• Difficult to OCR well
• Scanned at PARC, all page images put
on the Univ. California, Berkeley Digital
Library website
Cut into Word-box Images:
layout analysis without OCR
ICDARAug 4, 2003 - HSB
Reflow Word Boxes into Textlines
to Fit the Display Geometry
T. Breuel, W. Janssen, K.
Popat, H. Baird, “Paper to
PDA,” Proc., ICPR, Quebec
City, 2002.
Make Doc-Images Highly Portable,
Legible Everywhere
No OCR errors!
(Only layout errors.)
Preserve meaningful
 reading order
 non-text
 navigation
 linking
Other ‘Pure-Image’ DIA for DLs
Not Dependent on Accurate Recognition
For Text
seems feasible
– Summarization of doc images w/out OCR
– Outlining, condensing, linking
– Reflowing tables
For Non-text seems dauntingly hard
– Mathematics
– Chemical formulae
– Line-art drawings
Vitally important to try
since recognition & encoding
are highly problematic
– Graphics generally
Personal Digital Libraries
People are beginning to
– collect
– manage
– share
their own small DLs
Scanned & encoded documents, mixed together
How to assist ‘productive reading’
These users lack specialized skills
DIA tools need to be deskilled to a clerical level
… and to work together far better
Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et
al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.
Interactive Digital Libraries
Today’s DIA tools leave many errors
in recognition, encoding, tagging etc
How can these mistakes affordably be fixed?
Invite volunteer help:
– e.g. Gutenberg Project, Open Mind Initiative
Challenge: provide interactive tools to
accept corrections on-line
enforce review, verification
efficiently make the most of every correction
DIA tools able to benefit from correction
Thanks to: George Nagy, David Stork, Dan Lopresti.
Collaborative DLs:
DIA for the Masses
Enable non-professionals to collaborate
in improving, manually, on the best that
automatic DIA tools can do, e.g.
– one person may correct thresholding
– another corrects OCR errors
– yet another adds tags
Offer DIA tools downloadable from the web,
possibly under GPL-like licenses
 Dimp ? — document image processing toolkit
interoperable via common data structures & file formats
Thanks to: Tom Breuel, Kris Popat, Bill Janssen.
DIA R&D Opportunities for DLs
Making Document Images as Useful as
Symbolically Encoded Data
Image capture, quality control
Image improvement, rectification, etc
Content extraction, recognition, & analysis
Legibility, presentation, reflowing
Markup, indexing, retrieval, summarization
Personal & interactive DLs
Offering DIA tools to DL users
… many more, no doubt
An Urgent Responsibility?
Vast, irreplaceable, culturally vital legacy collections
of paper documents are competing ineffectively for
attention with billions of digital documents
Thus paper archives are threatened with neglect,
perceived irrelevance, …. & eventually, oblivion?
The DIA community is uniquely qualified
to help the DL community rescue them.
Principal DL Conferences
ECDL: European Conf. on Digital Libraries
July 2003: Trondheim, Norway (7th)
RCDL: All-Russian Scientific Congress
Oct 2003: St. Petersburg (5th)
ICADL: Int’l Conf. on Asian Digital Libraries
Dec 2003: Kuala Lumpur (6th)
JCDL: ACM/IEEE Joint Conf. on Digital Libraries
June 2004: Tuscon, AZ (8th)
Henry S. Baird
Statistical Pattern & Image Analysis
[email protected]
