New Resources for Document
Classification, Analysis and Translation
Technologies
Stephanie Strassel, Lauren Friedman, Safa Ismael,
David Lee, Kazuaki Maeda, Linda Brandschain
{strassel, lf, safa, david4, maeda, [email protected]
Linguistic Data Consortium
http://projects.ldc.upenn.edu/MADCAT
LREC 2008, Marrakech Morocco - May 30 2008
Presentation Outline
MADCAT Program Overview
Technology Challenges
Roadmap
Data Creation
 Phase 1 Data Profile
 Processing
 Collection
 Annotation
Data Format
Evaluation
Conclusions and Future Work
LREC 2008, Marrakech Morocco - May 30 2008
MADCAT Overview
MADCAT: Multilingual Document Classification
Analysis and Translation
A 5-year DARPA program
MADCAT technologies will convert foreign
language document images into English text,
enabling English speakers to extract, assess,
and respond to information in a timely manner
Multiple input types and domains
 Hard-copy, PDF, camera-captured
 Newspapers, letters, signs, graffiti, how-to manuals,
memos, postcards, forms, diaries, ledgers, etc.
LREC 2008, Marrakech Morocco - May 30 2008
Technology Challenges
 Extract relevant metadata about the document structure
 Integrate and optimize page segmentation, metadata
extraction, OCR and translation technologies
 Create end-to-end system for deployment at program’s
end with over 90% accuracy
 Current baseline is ~2%
 Primary evaluation metric is edit distance: HTER
 Same protocols as used in the GALE program
 Limited focus in Phase 1
 Arabic > English
 High resolution (600 dpi) images of handwritten newspaper and
web text
 Topics primarily news, current events and commentary
 Manual segmentation provided
LREC 2008, Marrakech Morocco - May 30 2008
Phase
Roadmap
Pre-MADCAT:
State of the Art
Phase 1:
Add handwriting
Newswire
Newswire
Phase 2-3:
New data types
Genre
Letters
Broadcast
Talk
Shows
Topic
Medium
Calendar
Personal
Identif.
Forms
Maps
Instructns
Ledgers
Poems
Books
Diaries
Verdicts
Training
Manuals
Calendars
Talk
Shows
Ledgers
Source Data
Quality
Letters
Diaries
Broadcast
Forms
Phase 4-5:
New genres, topics, quality conditions
Instructns
Weblogs
Weblogs
News
News
Commentary
Engineering
Science
Personal
Military
Commentary
Commentary
Science
Personal
Engineering
Religious
Other
Printed
Printed
Printed
Printed
Handwritten
Handwritten
Handwritten
Controlled
Controlled
Controlled
LREC 2008, Marrakech Morocco - May 30 2008
Uncontrolled
Uncontrolled
Phase 1 Data Profile
 In Phase 1, data drawn from DARPA GALE program
 New collection to acquire handwritten versions
 Genres: Formal text (newswire) and informal text (weblogs)
 Benefits
 Eliminates domain mismatch between GALE state of the art MT
models and MADCAT test sets
 Allows developers to focus on primary challenge: handwriting
 Data characteristics well understood, cost and time factors are
reasonably well known
 Training data costs controlled since translations exist
 Production begins immediately, training data available sooner
 Provides controlled test sets for evaluation across programs
 Subsequent phases will add new data types, genres
and other challenge elements
LREC 2008, Marrakech Morocco - May 30 2008
Training and DevTest
 Training
 Minimum 2000 unique pages
• Half formal (newswire), half informal (web text)
• 100-250 words per page
 Minimum 100 unique scribes in training pool
 5 scribes per page
 At minimum 10,000 manuscripts (scribe-pages) in Phase 1
training set
 DevTest
 320 unique pages
• Half formal (newswire), half informal (web text)
• 125 words/page
 50 scribes in devtest pool
• 25 from training, 25 previously unseen
 2 scribes per page, ~7 pages per scribe
 Total of 640 manuscripts; 80,000 words
LREC 2008, Marrakech Morocco - May 30 2008
Evaluation Data
320 unique pages from GALE P3 Eval set
 Half formal (newswire), half informal (web
text)
 125 words/page
50 scribes in eval partition
 25 from training, 25 previously unseen
6 scribes per page, ~40 pages per scribe
Total of 1920 manuscripts, 240,000 words
Subset of eval set designated for pilot
evaluation in September 2008
LREC 2008, Marrakech Morocco - May 30 2008
Data Preparation
 Start with electronic text from GALE
 Whole documents collected from newswire or web
 Segmented into SUs (semantic/sentence units)
 Each segment manually translated
 Pre-processing prior to handwriting
 Tokenization to words for later stages
 Segments reordered and formatting added to create optimal
pages for handwriting assignment
• Roughly 5 words/line to avoid line wrapping
• No more than 25 lines/page to avoid page breaks
 After handwriting, images scanned at high resolution
(600 dpi, greyscale)
 Images are ground truth annotated at line, word level
 Major challenge is logical storage of many layers of
information across multiple versions of the same data
LREC 2008, Marrakech Morocco - May 30 2008
Collection
 New human subjects collection required to produce
handwritten versions of existing data
 Pilot collection currently underway at LDC in Philadelphia
• LDC Arabic staff and recent Iraqi immigrants in Philly
 Additional collections planned with partner sites in Lebanon,
Morocco and possibly Egypt
 Regional variety necessary to capture stylistic writing
differences
• E.g. use of Indic vs. Arabic numbers
 Assignment and tracking of data and scribes controlled
through centralized LDC database and assignment
protocol




Scribe partition (train only, test only, both)
Writing conditions
Regional variation
Genre, topic and source balance
LREC 2008, Marrakech Morocco - May 30 2008
Writing Conditions
Implement
 90% ballpoint pen (I)
 10% pencil (P)
Paper
 75% unlined white paper (U)
 25% lined paper (L)
Writing speed
 90% normal (N)
 5% fast (F)
 5% careful (C)
LREC 2008, Marrakech Morocco - May 30 2008
Collection Workflow
LDC selects
source data
LDC generates
kits (documents +
writing conditions)
LDC delivers data
kits to collection
sites
Sites publicize
study and recruit
participants
Site coordinator
schedules
appointment
Scribe comes
in, takes writing
sample test
Site coordinator
verifies scribe
eligibility
Site coordinator
logs in to secure
website via login
page
Scribe completes
registration via
registration page
Scribe verifies info
via confirmation
page
Site coordinator prints
out subject ID and
instructions for subject
via assignment page
Coordinator
pulls kit for this
subject ID
Scribe leaves with
kit and instructions
Scribe returns
completed kit to
site
Coordinator files
completed kit for
scanning/delivery
Site scans
completed kit(s) as
safeguard
Site uploads image
file to LDC
Scribe visits
public URL,
contacts site
coordinator
LREC 2008, Marrakech Morocco - May 30 2008
Coordinator verifies kit
completeness and
arranges payment
Site ships completed
paper kit(s) to LDC for
archiving
LDC processes
completed kits for
subsequent tasks
Scribe Demographics
Scribes register in person at collection site and
take writing test
 To assess literacy and ability to follow instructions
Enter demographic info on LDC's secure server





Name, address (for payment purposes only)
Age, gender, level of education, occupation
Where born, where raised
Primary language of educational instruction
Handedness
After registration, scribes receive brief tutorial
 No line wrapping, no page breaks
 Copy text exactly: no omissions or insertions, no
corrections to source text
LREC 2008, Marrakech Morocco - May 30 2008
Scribe Assignments
Assignments are in the form of printed "kits"
 50 printed pages to be copied plus assignment table
• Assignment table specifies page order and writing conditions
 Multiple scribes/kit, so conditions and order vary
 Printed pages labeled with page and kit ID
 Scribes affix label with scribe, page and kit ID to back
of completed manuscript
• To facilitate data tracking during scanning and postprocessing
Scribes supply paper and writing instrument
 To sample natural variation
Payment per completed kit
 Exhaustive check on first assignment (completeness
and accuracy)
 Spot check on remainder of assignments
LREC 2008, Marrakech Morocco - May 30 2008
Ground Truthing
Zones created at word level only for Phase 1
 Lines can be extrapolated from annotation
 Other zone types possible in future phases
• Structural elements (e.g. signature block)
Explicit reading order preserved
Locations are polygons
 Restricted to upright rectangles in the first phase
Each zone contains a unique ID, the contents,
location (coordinates)
Status tags to accommodate scribe mistakes
 extra, missing, typo
nextZoneID tag to indicate reading order
In Phase 1, ground truthing primarily by partner
site (Applied Media Analysis)
LREC 2008, Marrakech Morocco - May 30 2008
GEDI Toolkit
GroundTruth - Editor and Document Interface (GEDI)
created by Applied Media Analysis (AMA)
LREC 2008, Marrakech Morocco - May 30 2008
Data Format
MADCATUnifier Process
takes multiple data streams
and generates single xml
output file which contains all
required information
1) Text layer
*Source Text
*Tokenization
*SU Segmentation
*Translation
2) Image layer
*zone bounding boxes
3) Scribe demographics
4) Document metadata
LREC 2008, Marrakech Morocco - May 30 2008
Evaluation
Input: (segmented) Arabic handwritten image
Output: segmented English text
HTER is primary evaluation metric (edit
distance)
 Manual post-editing task corrects MT output one
segment at a time until it has the same meaning as
the reference translation, making as few edits as
possible
 NIST-developed MTPostEditor GUI
• Editors review segment-aligned MT and gold standard
translation
 No access to original Arabic text or handwritten image file
No official separate evaluation of OCR or
processing components
LREC 2008, Marrakech Morocco - May 30 2008
Conclusions; Future Work
LDC is creating a set of new linguistic resources
for image processing, document classification
and translation on a scale not previously
available
 Phase 1: Large collection of Arabic handwritten,
translated, segmented, ground truthed text
 Infrastructure for collection, annotation and data
management
• Including a unified, extensible data format
Extended to new data types, domains,
languages, annotations in future phases
Resources will be available through LDC
LREC 2008, Marrakech Morocco - May 30 2008
Acknowledgements
This work was supported in part by the Defense
Advanced Research Projects Agency, MADCAT
Program Grant No. HR0011-08-1-004. The
content of this paper does not necessarily
reflect the position or the policy of the
Government, and no official endorsement
should be inferred.
Thank you to Audrey Le and Mark Przybocki at
NIST for helping to define data and format
requirements
LREC 2008, Marrakech Morocco - May 30 2008
Descargar

Document