Digital Libraries &
Document Image Analysis
Henry S. Baird
Statistical Pattern & Image Analysis research
Information Sciences & Technologies Lab
1
DLs as seen by a DIA Researcher

15 years in DIA R&D

Lucky to have known/collaborated with:
–
–
–
–
–

PARC DL enthusiasts: Masinter, Street, Bloomberg, et al
UC Berkeley Digital Library project: Wilensky, Fateman, et al
CMU Universal Library project: Thibadeau, Hauptmann, et al
Xerox Scanning Service Bureaus: Wallis, et al
… many others with an interest in DLs
What challenges do DLs pose to DIA R&D?
ICDARAug 4, 2003 - HSB
2
Digital Library Dreams
Electronic networked DLs promise to provide:
–
–
–
–
more books, journals, etc
to more people
faster
at more places & times
than physical libraries can hope to….
The Ideal DL: an international, interoperable,
sustainable body of rich cultural
materials in digital form
ICDARAug 4, 2003 - HSB
3
Document Images’ Usefulness in DLs
raster image
display, print
+ metadata (title, author, …)
+ index, catalogue
+ OCRed text
+ retrieve (more or less well)
+ correct text
+ retrieve well, reuse,
summarize, translate, …
+ layout format (e.g. RTF)
+ reprinting
+ links (e.g. HTML)
+ Web publishing
+ functional tags (e.g. XML)
+ “semantic web”
ICDARAug 4, 2003 - HSB
4
Advantages of Digital Displays
versus Ink-on-Paper

Many…
–
–
–
–

networked -- potentially unbounded content
rapidly rewritable -- supports animation
radiant -- legible in the dark
sensitive -- markable, interactive
Generally thought to be overwhelming, but …
ICDARAug 4, 2003 - HSB
5
Advantages of Ink-on-Paper
versus Digital Displays
PAPER
 cheap
 large, many
 high-resolution
 lightweight
 thin
 unpowered
 stable
DISPLAYS today
 expensive
 small, few
 low-resolution
 heavy
 thick
 powered
 requires
maintenance
A. Dillon, “Reading from Paper versus Screen: a critical review
of the empirical literature,” Ergonomics 53(10): 1297-1326, 1992.
ICDARAug 4, 2003 - HSB
DISPLAYS in future
 less expensive
 larger, more
 higher-resolution
 lighter
 thinner
 lower power
eBooks, e-paper,
notebooks, laptops,
PDAs, …
6
The fact is, for many uses
Paper is Still Widely Preferred
“Paper [remains today] the medium of choice
for reading, even when most high-tech
[display] technologies are to hand”
— Sellen & Harper (2002)
Why is this? Paper allows:
–
–
–
–
flexible navigation though documents
cross-referencing of several documents
annotations
interweaving of reading and writing
A. J. Sellen & R. H. R. Harper, The Myth of the Paperless Office,
The MIT Press, Cambridge, MA, 2002.
ICDARAug 4, 2003 - HSB
7
Document Images are Doubly
Disadvantaged within DLs


They fail to support most uses that
symbolically encoded, tagged data do
They lose many key advantages they
enjoyed on paper
A Threat: ‘If it’s not in Google, I don’t need it!’
Can they be made as useful in DLs as encoded data?
Can they sometimes work better in DLs than encoded data?
…these are challenges to us, the DIA R&D community.
ICDARAug 4, 2003 - HSB
8
The British Library
‘The World’s Knowledge’
38.8M items catalogued
website: 18.4M page hits/year
Compare Google:
• >3B pages
• 150M searches/day
“[Reinforcing] the Library’s role as the pre-eminent
global document supplier, digital scanning from print
and microfilm originals will give researchers rapid,
high quality delivery from over 100 million research
articles, reports, and conference papers direct to
their desktop.”
-- Lynne Brindley, Chief Executive
2002-2003 Annual Report
ICDARAug 4, 2003 - HSB
9
Bibliothèque nationale de France

The Digital Library
– digitization of both printed books and graphic material
– primarily in image mode to begin with
– most out-of-copyright

Gallica 2000
– multimedia documents: Middle Ages -> early 20th century
– 35,000 printed volumes: images
– 1000 titles full text
– “one of the largest DLs free of charge on the web”
ICDARAug 4, 2003 - HSB
10
Million Book DL Project

1M books to be scanned by 2005
– bitonal, 600 dpi


Free-to-read, universally accessible
Searchable by full text (where OCR is possible)
– ABBYY Fine Reader OCR


Books in public domain or copyrighted but out of print
Fifteen partners:
– US, India, China; est. 4000 person-years of clerical labor
– Multinational, multilingual (mainly English)


20Tbyte trusted repository
Research testbed for summarization, OCR, automatic
extraction of metadata, machine translation
Reddy, Raj and Gloriana St. Clair, “The Million Book Project,”
CMU, Dec. 1, 2001.
ICDARAug 4, 2003 - HSB
11
Google Catalogs


“1000’s” of scanned mail-order catalogs
free for publishers, ‘few days’ turnaround
– for a fee: link products to web sites


free to users: download page images
indexed by: vendor, date, page numbers, etc
(not by full text content)
ICDARAug 4, 2003 - HSB
12
Amazon.com plan
‘Look Inside the Book II’
~500k books: in-copyright, non-fiction
 Scan (full color), OCR cover-to-cover
 Full-text search, download sample pages
 Free but limited access to page images
———
Can Google be far behind…?
search document image files found on Web

David D. Kirkpatrick, “Amazon Plan Would Allow Searching
Text of Many Books,” The New York Times, July 21, 2003.
ICDARAug 4, 2003 - HSB
13
Capturing Document Images
To digitize a book: $4 - $1000 each!
cheaply:
bitonal, low quality, mass scanning, …
expensively: color, quality control, individual handling, …
Breakdown of costs:
1/3
1/3
1/3
cataloging, description, indexing
scanning, OCR, correction, markup
quality control, file maintenance, admin
NOTE: DIA can help with all three
“The Price of Digitization,” Proc., NINCH Symposium
(National Initiative for a Networked Cultural Heritage), New
York, April 8, 2003.
ICDARAug 4, 2003 - HSB
14
Document Image Capture Operations


Usually, large-scale batch operations
Sometimes destructive:
– cut off spines, discard covers, wear & tear
– hot debate over ‘scan-and-discard’ policies

Image quality standards are often subjective
– usually: “completeness”; no missing pages, text
– seldom: checked for human, machine legibility
– rarely: guaranteed suitable for future uses

Scan once, for ever:
– seldom rescanned (Lesk: “not for 5-10y”)
M. Lesk, Practical Digital Libraries: Books, Bytes, & Bucks.
Morgan Kaufmann, San Francisco, CA, 1997.
ICDARAug 4, 2003 - HSB
15
The PARC Rare Book Scanner
• Bulk scanning w/out
damaging books
• Zero force on binding
• Book is open 90 degrees
• Pages turned manually
• 280 dpi
• 9.25”x11.75” field
• Throughput
• 8-bit grey
 450 pages/h
• 24-bit color 120 pages/h
Bob Street & Steve Ready, PARC.
ICDARAug 4, 2003 - HSB
16
GUI & IP for Image Capture
• Calibration
• color test targets
• per-pixel gain/offset map
• Image Processing
• performed on the fly:
• contrast, cleaning, etc
• crop. skew-correct
• processing templates
• Assuring Quality
• visual inspection
• Capturing Metadata
• automatic page numbering
1,2,3,.../ i,ii,iii,.../ I,II,III,…
• section labels
• comments (manual)
ICDARAug 4, 2003 - HSB
17
DIA R&D for Image Quality Control

Measuring document image quality
– new test target designs
– image processing algorithms
– rigorous, quantitative standards

Assuring quality
– fast algorithms for on-the-fly image quality
estimation

Predicting human & machine legibility
What image quality features correlate
well with human and OCR legibility?
… and with other, later DIA tasks?
K. Summers, “Document Image Improvement for
OCR as a Classification Problem,” Proc., DR&R
X,
Santa Clara,CA, Jan 2003.
ICDARAug 4, 2003 - HSB
E. H. Barney Smith & X. Qiu, “Relating
Statistical Image Differences & Degradation
Features,” Proc, 5th DAS, Princeton, NJ., Aug 2002.
18
When Quality Control Goes Wrong
Front Page, 1852 Edition of the New York Times
Scanned from microfilm.
The Historical New York Times Project, CMU/NYT, 1999.
ICDARAug 4, 2003 - HSB
19
Extracting & Recognizing Content
These are central DIA R&D goals
But existing doc image understanding systems
cannot guarantee high accuracy
across the full range of documents:
- typefaces, h/w styles
-
image qualities
layout geometries
writing systems
languages
domains of discourse
old fashioned
poor & variable
deformed
obsolete
rare
arcane
DL’s scholarly & historical docs are often harder
S. Rice, G. Nagy, T. Nartker, OCR: An Illustrated Guide to the Frontier,
Kluwer Academic Publishers: 1999.
ICDARAug 4, 2003 - HSB
20
Richly Meaningful
Typographical Book Designs
Rare Botanical Reference Book
• Jepson’s A Flora of California, 1943.
• Authoritative, still in demand by scholars
• Only a few copies are left
• Difficult to OCR well
• Scanned at PARC, all page images put
on the Univ. California, Berkeley Digital
Library website
ICDARAug 4, 2003 - HSB
21
Cut into Word-box Images:
layout analysis without OCR
ICDARAug 4, 2003 - HSB
22
Reflow Word Boxes into Textlines
to Fit the Display Geometry
T. Breuel, W. Janssen, K.
Popat, H. Baird, “Paper to
PDA,” Proc., ICPR, Quebec
City, 2002.
ICDARAug 4, 2003 - HSB
23
Make Doc-Images Highly Portable,
Legible Everywhere
No OCR errors!
(Only layout errors.)
Preserve meaningful
appearance
Challenges:
 reading order
 non-text
 navigation
 linking
ICDARAug 4, 2003 - HSB
24
Other ‘Pure-Image’ DIA for DLs
Not Dependent on Accurate Recognition

For Text
seems feasible
– Summarization of doc images w/out OCR
– Outlining, condensing, linking
– Reflowing tables

For Non-text seems dauntingly hard
– Mathematics
– Chemical formulae
– Line-art drawings
Vitally important to try
since recognition & encoding
are highly problematic
– Graphics generally
ICDARAug 4, 2003 - HSB
25
Personal Digital Libraries

People are beginning to
– collect
– manage
– share




their own small DLs
Scanned & encoded documents, mixed together
How to assist ‘productive reading’
These users lack specialized skills
DIA tools need to be deskilled to a clerical level
… and to work together far better
Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et
al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.
ICDARAug 4, 2003 - HSB
26
Interactive Digital Libraries



Today’s DIA tools leave many errors
in recognition, encoding, tagging etc
How can these mistakes affordably be fixed?
Invite volunteer help:
– e.g. Gutenberg Project, Open Mind Initiative

Challenge: provide interactive tools to
–
–
–
–
accept corrections on-line
enforce review, verification
efficiently make the most of every correction
DIA tools able to benefit from correction
Thanks to: George Nagy, David Stork, Dan Lopresti.
ICDARAug 4, 2003 - HSB
27
Collaborative DLs:
DIA for the Masses

Enable non-professionals to collaborate
in improving, manually, on the best that
automatic DIA tools can do, e.g.
– one person may correct thresholding
– another corrects OCR errors
– yet another adds tags
Offer DIA tools downloadable from the web,
possibly under GPL-like licenses
 Dimp ? — document image processing toolkit

interoperable via common data structures & file formats
Thanks to: Tom Breuel, Kris Popat, Bill Janssen.
ICDARAug 4, 2003 - HSB
28
DIA R&D Opportunities for DLs
Making Document Images as Useful as
Symbolically Encoded Data
Image capture, quality control
Image improvement, rectification, etc
Content extraction, recognition, & analysis
Legibility, presentation, reflowing
Markup, indexing, retrieval, summarization
Personal & interactive DLs
Offering DIA tools to DL users
… many more, no doubt
ICDARAug 4, 2003 - HSB
29
An Urgent Responsibility?

Vast, irreplaceable, culturally vital legacy collections
of paper documents are competing ineffectively for
attention with billions of digital documents

Thus paper archives are threatened with neglect,
perceived irrelevance, …. & eventually, oblivion?
The DIA community is uniquely qualified
to help the DL community rescue them.
ICDARAug 4, 2003 - HSB
30
Principal DL Conferences
ECDL: European Conf. on Digital Libraries
July 2003: Trondheim, Norway (7th)
RCDL: All-Russian Scientific Congress
Oct 2003: St. Petersburg (5th)
ICADL: Int’l Conf. on Asian Digital Libraries
Dec 2003: Kuala Lumpur (6th)
JCDL: ACM/IEEE Joint Conf. on Digital Libraries
June 2004: Tuscon, AZ (8th)
ICDARAug 4, 2003 - HSB
31
Contact
Henry S. Baird
Statistical Pattern & Image Analysis
[email protected]
www.parc.com/baird
+1-650-812-4481
ICDARAug 4, 2003 - HSB
FAX –4374
32
Descargar

Document Image Decoding Research / ISTL / PARC …