Document Engineering of Complex Software
Specifications
Mehrdad Nojoumian
Supervisor: Professor T. C. Lethbridge
University of Ottawa
School of Information Technology and Engineering
June 4, 2007, MSc Thesis in Computer Science
Motivation and Goal
Problems triggering our motivation:
Software Specifications:
 are dense and intricate (Numerous materials)
 have complicated structures (lots of tables, figures, lists, codes, etc)
 are difficult for browsing and navigating
 are mostly available in the PDF format or just a single hypertext page
Major goal:
 Re-engineer PDF based documents (Specifications, Conf. Proceedings, e-Books, etc)
 Illustrate how to make more usable version of documents
Data Analyses
UML Superstructure Specifications
1. The most frequent words among headings
2. Frequency of the previous words as found in
the entire document
3. The most frequent words in the doc. index
Headings and the document index carry the
most important words in a document
Other OMG Specifications
 Sorted document and heading tokens based on
their frequency in two separate lists
MP: Mean of [P1…PN]
NDT: Total number of document tokens
Percentage = (MP * 100) / NDT
Series2
100
80
Percentage
 Defined position of heading tokens among
document tokens: P1, P2, …, PN
Series1
Most frequent headings (# of occurrence > 2) are
among the most frequent words in the entire doc
60
40
20
0
1
2
3
4
5
6
7
10 OMG Specifications
8
9
10
Document Transformation
I.
II.
Transforming the raw input into a format
more amenable to analysis (XML)
Extracting and refining the structure
Conversion Experiments:

Tools:





Adobe Acrobat Professional 7.8
Microsoft Word 2003
Stylus Studio XML Enterprise Suite
ABBYY PDF Transformer 1.0
Criteria:
1.
2.
3.
4.
5.
Generality
Low Volume
Clean & Understandable
Similarity to XML
Having Good Clues
Input Format
(Size KB)
Tools for
Conversions
Output Format
(Size KB)
DOC (34.5)
Microsoft Office
Word 2003
TXT (2.81)
DOC (34.5)
Microsoft Office
Word 2003
RTF (55)
DOC (34.5)
Microsoft Office
Word 2003
HTML (40.7)
DOC (34.5)
Microsoft Office
Word 2003
XML (55)
DOC (34.5)
Adobe Acrobat
Professional 7.8
PDF (19)
with Bookmarks
DOC (34.5)
Adobe Acrobat
Professional 7.8
PDF (15.9)
without Bookmarks
PDF (19)
with Bookmarks
Adobe Acrobat
Professional 7.8
HTML (6.38)
PDF (15.9)
without Bookmarks
Adobe Acrobat
Professional 7.8
HTML (5.15)
PDF (19)
with Bookmarks
Adobe Acrobat
Professional 7.8
XML (9.92)
PDF (15.9)
without Bookmarks
Adobe Acrobat
Professional 7.8
XML (8.30)
PDF (19)
with Bookmarks
ABBYY PDF
Transformer 1.0
HTML (19.2)
PDF (19)
with Bookmarks
ABBYY PDF
Transformer 1.0
TXT (2.82)
Logical Structure Extraction
Java parsers
 Solved the mis-tagging problem
which had been created during
previous phase
 Extracted
entire
headings
existing in the document bookmark
 Removed some information and
XML tags
 Formed the document logical
structure in a clean XML format
Hypertext pages & Text Extraction


Produced multiple outputs for each Chapter, Section, Subsection, etc
(1.html, 2.html, 2.1.html, etc)
Generated table of contents for headings (use it as a frame)

Connected hypertext outputs sequentially
 XPath expressions
 Programming approach

Formed major document elements
1.
2.
3.
4.
Anchors in long pages
Figures and their captions
Simple & Nested Lists
Dynamic Tables
Concept Extraction
UML Superstructure Specification
 UML class & package hierarchies extraction
 If the first child of a <Section> element
contains the ‘Class Descriptions’ string then
you can detect UML classes & packages in
grandchildren of that <Section> element
Other specifications:
 Common Warehouse Meta-model (CWM)
 UML Infrastructure (UML Inf.)
 Meta Object Facility (MOF)
Question?
 How can we detect such a logical relation
among heading elements automatically?
Cross Referencing
 Developed an XSLT program to extract
heading phrases and their corresponding
hyperlinks
 Filtered some phrases which had common
substrings
such
as
Association
&
AssociationClass
 Removed phrases which had many
independent hypertext pages (different entries
in user interfaces)
 Also applied package names just for UML
Superstructure
Specification
in
cross
referencing as anchors
 Finally, developed a Java program to replace
hyperlinks in generated HTML pages
Usability of User Interfaces
Reasons for generating small hypertext pages:






A better sense of location (navigating)
Less chance of getting lost (scrolling)
Less overwhelming sensation (learning)
Statistical analyzing (interesting topics)
Faster downloading (entire document!)
Easier printing, Cross referencing among diverse specifications, etc
User Interfaces Demo
Original OMG # of PDF Pages # of Headings # of Headings # of U-Tokens # of U-Tokens Data Analysis
Spec.
Used in C-Ref in Doc Body in Headings
Results
CORBA
UML Sup.
CWM
MOF
UML Inf.
DAIS
XTCE
UMS
HUTN
WSDL
1152
771
576
292
218
188
90
78
74
38
787
418
550
61
200
135
18
69
88
17
662
202
471
52
122
102
18
59
83
17
13179
10204
6434
6065
4329
3051
3075
1937
2264
1106
702
378
463
92
176
151
26
94
144
36
15.1%
12.2%
13.2%
8.0%
9.3%
12.6%
2.6%
22.7%
9.8%
16.3%
# of HTML
Pages
788
421
551
62
201
136
19
70
89
18
Contributions
1.
A generic approach to reengineer complex documents
2.
A data analyses showing that words in headings provide a sufficient basis for the
document reengineering
3.
Extraction of the document logical structure in XML format
4.
Various techniques for text & concept extractions using W3C technologies
5.
Major software components for an “Integrated Document Engineering Tool”
Engineering Lessons & Challenges
Engineering Lessons:
1.
Generating a clean XML file from PDF images requires complicated features to recognize
each document element correctly and deal with mis-tagging, page boundary, etc
2.
Remarkable role of latest technologies in engineering tasks: e.g. XPath 2.0 vs. parsing
packages which is a high level interaction close to human’s language
3.
Comprehensive data analysis can facilitate the DocEng process, form a better
understanding, and construct robust rules & regulations for such a processing
Low Level Challenges:
1.
Generating multiple hypertext pages by Saxon
2.
Detecting errors in XSTL programming
3.
Creating complicated XPath expressions, etc
Future Work
1.
Extracting the initial XML document independently from Adobe Acrobat
2.
Automating the concept extraction procedure or creating some HCI features
3.
Developing an automatic document analyzer for comprehensive data analyses
4.
Investigating usability of current user interfaces to discover users’ demands
5.
Generating interaction features in UIs: online query submission to XML files
Publications
Refereed Conference Paper:
M. Nojoumian & T. C. Lethbridge, “Extracting document structure to facilitate a KB creation
for UML specifications”, in proceedings of the 4th IEEE International Conference on
Information Technology: New Generations (ITNG), pp. 393-400, Las Vegas, USA, 2007.
Invited to publish in the Journal of Computers (JOC):
M. Nojoumian & T. C. Lethbridge, “Document engineering of complex software
specifications”, Academy Publisher.
Thank you very much
Questions?
Descargar

Slide 1