Document Engineering of Complex Software Specifications Mehrdad Nojoumian Supervisor: Professor T. C. Lethbridge University of Ottawa School of Information Technology and Engineering June 4, 2007, MSc Thesis in Computer Science Motivation and Goal Problems triggering our motivation: Software Specifications: are dense and intricate (Numerous materials) have complicated structures (lots of tables, figures, lists, codes, etc) are difficult for browsing and navigating are mostly available in the PDF format or just a single hypertext page Major goal: Re-engineer PDF based documents (Specifications, Conf. Proceedings, e-Books, etc) Illustrate how to make more usable version of documents Data Analyses UML Superstructure Specifications 1. The most frequent words among headings 2. Frequency of the previous words as found in the entire document 3. The most frequent words in the doc. index Headings and the document index carry the most important words in a document Other OMG Specifications Sorted document and heading tokens based on their frequency in two separate lists MP: Mean of [P1…PN] NDT: Total number of document tokens Percentage = (MP * 100) / NDT Series2 100 80 Percentage Defined position of heading tokens among document tokens: P1, P2, …, PN Series1 Most frequent headings (# of occurrence > 2) are among the most frequent words in the entire doc 60 40 20 0 1 2 3 4 5 6 7 10 OMG Specifications 8 9 10 Document Transformation I. II. Transforming the raw input into a format more amenable to analysis (XML) Extracting and refining the structure Conversion Experiments: Tools: Adobe Acrobat Professional 7.8 Microsoft Word 2003 Stylus Studio XML Enterprise Suite ABBYY PDF Transformer 1.0 Criteria: 1. 2. 3. 4. 5. Generality Low Volume Clean & Understandable Similarity to XML Having Good Clues Input Format (Size KB) Tools for Conversions Output Format (Size KB) DOC (34.5) Microsoft Office Word 2003 TXT (2.81) DOC (34.5) Microsoft Office Word 2003 RTF (55) DOC (34.5) Microsoft Office Word 2003 HTML (40.7) DOC (34.5) Microsoft Office Word 2003 XML (55) DOC (34.5) Adobe Acrobat Professional 7.8 PDF (19) with Bookmarks DOC (34.5) Adobe Acrobat Professional 7.8 PDF (15.9) without Bookmarks PDF (19) with Bookmarks Adobe Acrobat Professional 7.8 HTML (6.38) PDF (15.9) without Bookmarks Adobe Acrobat Professional 7.8 HTML (5.15) PDF (19) with Bookmarks Adobe Acrobat Professional 7.8 XML (9.92) PDF (15.9) without Bookmarks Adobe Acrobat Professional 7.8 XML (8.30) PDF (19) with Bookmarks ABBYY PDF Transformer 1.0 HTML (19.2) PDF (19) with Bookmarks ABBYY PDF Transformer 1.0 TXT (2.82) Logical Structure Extraction Java parsers Solved the mis-tagging problem which had been created during previous phase Extracted entire headings existing in the document bookmark Removed some information and XML tags Formed the document logical structure in a clean XML format Hypertext pages & Text Extraction Produced multiple outputs for each Chapter, Section, Subsection, etc (1.html, 2.html, 2.1.html, etc) Generated table of contents for headings (use it as a frame) Connected hypertext outputs sequentially XPath expressions Programming approach Formed major document elements 1. 2. 3. 4. Anchors in long pages Figures and their captions Simple & Nested Lists Dynamic Tables Concept Extraction UML Superstructure Specification UML class & package hierarchies extraction If the first child of a <Section> element contains the ‘Class Descriptions’ string then you can detect UML classes & packages in grandchildren of that <Section> element Other specifications: Common Warehouse Meta-model (CWM) UML Infrastructure (UML Inf.) Meta Object Facility (MOF) Question? How can we detect such a logical relation among heading elements automatically? Cross Referencing Developed an XSLT program to extract heading phrases and their corresponding hyperlinks Filtered some phrases which had common substrings such as Association & AssociationClass Removed phrases which had many independent hypertext pages (different entries in user interfaces) Also applied package names just for UML Superstructure Specification in cross referencing as anchors Finally, developed a Java program to replace hyperlinks in generated HTML pages Usability of User Interfaces Reasons for generating small hypertext pages: A better sense of location (navigating) Less chance of getting lost (scrolling) Less overwhelming sensation (learning) Statistical analyzing (interesting topics) Faster downloading (entire document!) Easier printing, Cross referencing among diverse specifications, etc User Interfaces Demo Original OMG # of PDF Pages # of Headings # of Headings # of U-Tokens # of U-Tokens Data Analysis Spec. Used in C-Ref in Doc Body in Headings Results CORBA UML Sup. CWM MOF UML Inf. DAIS XTCE UMS HUTN WSDL 1152 771 576 292 218 188 90 78 74 38 787 418 550 61 200 135 18 69 88 17 662 202 471 52 122 102 18 59 83 17 13179 10204 6434 6065 4329 3051 3075 1937 2264 1106 702 378 463 92 176 151 26 94 144 36 15.1% 12.2% 13.2% 8.0% 9.3% 12.6% 2.6% 22.7% 9.8% 16.3% # of HTML Pages 788 421 551 62 201 136 19 70 89 18 Contributions 1. A generic approach to reengineer complex documents 2. A data analyses showing that words in headings provide a sufficient basis for the document reengineering 3. Extraction of the document logical structure in XML format 4. Various techniques for text & concept extractions using W3C technologies 5. Major software components for an “Integrated Document Engineering Tool” Engineering Lessons & Challenges Engineering Lessons: 1. Generating a clean XML file from PDF images requires complicated features to recognize each document element correctly and deal with mis-tagging, page boundary, etc 2. Remarkable role of latest technologies in engineering tasks: e.g. XPath 2.0 vs. parsing packages which is a high level interaction close to human’s language 3. Comprehensive data analysis can facilitate the DocEng process, form a better understanding, and construct robust rules & regulations for such a processing Low Level Challenges: 1. Generating multiple hypertext pages by Saxon 2. Detecting errors in XSTL programming 3. Creating complicated XPath expressions, etc Future Work 1. Extracting the initial XML document independently from Adobe Acrobat 2. Automating the concept extraction procedure or creating some HCI features 3. Developing an automatic document analyzer for comprehensive data analyses 4. Investigating usability of current user interfaces to discover users’ demands 5. Generating interaction features in UIs: online query submission to XML files Publications Refereed Conference Paper: M. Nojoumian & T. C. Lethbridge, “Extracting document structure to facilitate a KB creation for UML specifications”, in proceedings of the 4th IEEE International Conference on Information Technology: New Generations (ITNG), pp. 393-400, Las Vegas, USA, 2007. Invited to publish in the Journal of Computers (JOC): M. Nojoumian & T. C. Lethbridge, “Document engineering of complex software specifications”, Academy Publisher. Thank you very much Questions?