Improve the way you create, manage and distribute information
INNOVATION
INSPIRATION
Automating Content Analysis with Trang and
Simple XSLT Scripts
Bob DuCharme
XML 2008
December 9, 2008
www.innodata-isogen.com
What We Do
We help companies lower the cost of
creating and managing information.
2
2
About me
• Solutions Architect,
Isogen
Innodata
• weblog:
http://www.snee.com/bobdc.blog
• other writing:
See http://www.snee.com/bob
• URLs referenced today:
http://www.snee.com/xml/xml2008
3
3
Single source publishing and “editorial” XML
Input
1
Process
D
Output
1
Process
E
Output
3
Process
F
Output
2
Input
2
Input
3
Input
4
Process
A
Process
B
Process
C
Editorial
Master (XML)
Input
5
4
4
Content analysis: why?
•
•
•
•
You’ve “inherited” some content
Convert to your current editorial format
Convert it to new output formats
Efficient development of efficient conversion
routines
5
5
Handy tool 1 before we get to the XML parts: sort
• colors.txt:
$ sort colors.txt
red
green
blue
green
blue
blue
red
blue
blue
blue
green
green
red
red
6
6
Handy tool 2 before we get to the XML parts: uniq
sort colors.txt | uniq -c
3 blue
2 green
2 red
7
7
Sample data
8
8
trang
From http://www.thaiopensource.com/relaxng/trang.html:
Trang converts between different schema languages for XML. It supports
the following languages:
•
•
•
•
RELAX NG (XML syntax)
RELAX NG compact syntax
XML 1.0 DTDs
W3C XML Schema
A schema written in any of the supported schema languages can be
converted into any of the other supported schema languages, except that
W3C XML Schema is supported for output only, not for input.
Trang can also infer a schema from one or more example XML
documents.
9
9
trang
Trang can also infer a schema
from one or more example XML
documents!!!!!
10
10
Analyzing content with trang
<whatever>
<?xml version="1.0" encoding=“UTF-8" ?>
<somedoc>Here is one document</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
</whatever>
11
11
Create RELAX NG versions of …
• Elsevier article DTD:
trang art510.dtd art510.rng
• Combined sample content:
trang issueContents.xml issueContents.rng
• Compare results:
saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out
12
12
compareElsRNG.xsl (1 of 2)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:r="http://relaxng.org/ns/structure/1.0">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:variable name="schema“
select="document('issueContents.rng')"/>
<xsl:template match="text()"/>
13
13
compareElsRNG.xsl (2 of 2)
<xsl:template match="r:element">
<xsl:variable name="name" select="@name"/>
<xsl:choose>
<xsl:when test="$schema/r:grammar//r:[email protected][. =
$name]">
Yes: <xsl:value-of select="$name"/>
</xsl:when>
<xsl:otherwise>
No: <xsl:value-of select="$name"/>
</xsl:otherwise>
</xsl:choose>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
14
14
compareElsRNG.xsl: some sample output
No:
No:
No:
No:
Yes:
Yes:
Yes:
Yes:
Yes:
Yes:
Yes:
tb:colspec
tb:left-border
tb:right-border
tb:top-border
aid
article
body
ce:abstract
ce:abstract-sec
ce:acknowledgment
ce:affiliation
15
15
Analyzing the XML itself
• Or SGML, after using James Clark’s sx:
sx -f err.out -x lower myfile.sgm > myfile.xml
16
16
Counting elements: countElements.xsl
<xsl:stylesheet version="1.0“
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*">
<xsl:value-of select="name()"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
17
17
Using countElements.xsl to count elements
saxon issueContents.xml countElements.xsl | sort | uniq -c | sort
18
18
Result of counting elements
Start of list:
1
1
1
1
1
1
1
1
1
1
2
3
3
ce:chem
ce:displayed-quote
ce:inline-figure
ce:nomenclature
ce:textbox
ce:textbox-body
ce:underline
ce:vsp
doc
sb:e-host
small-caps
display
formula
End of list:
5726
6916
7225
7760
7760
7929
8458
9326
10331
12438
16453
17082
17095
ce:cross-ref
entry
mml:mo
sb:maintitle
sb:title
ce:label
ce:hsp
mml:mi
mml:mrow
ce:italic
sb:author
ce:given-name
ce:surname
19
19
Count element/parent combinations
<xsl:stylesheet version="1.0“
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*">
<xsl:value-of select="name(..)"/>/<xsl:value-of
select="name()"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
20
20
Some parent/child counts
1
59
107
115
859
ce:displayed-quote/ce:simple-para
ce:biography/ce:simple-para
ce:legend/ce:simple-para
ce:abstract-sec/ce:simple-para
ce:caption/ce:simple-para
21
21
countAttributes.xsl
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="@*">
<xsl:value-of select="name(..)"/>
<xsl:text>/@</xsl:text>
<xsl:value-of select="name()"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="*">
<xsl:apply-templates select="*|@*"/>
</xsl:template>
</xsl:stylesheet>
22
22
Counting the attributes: an excerpt
1
28
44
50
79
104
142
175
180
182
713
4224
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
ce:[email protected]
23
23
Count formula elements with/without ID values
<xsl:stylesheet version="1.0"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
Yes: <!-- finds 180 -->
<xsl:value-of select="count(//ce:formula[@id])"/>
No:
<!-- finds 208 -->
<xsl:value-of select="count(//ce:formula[not(@id)])"/>
</xsl:template>
</xsl:stylesheet>
24
24
Find all values of a particular attribute
<xsl:stylesheet version="1.0"
xmlns:ce="http://www.elsevier.com/xml/common/dtd"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:apply-templates select="*|@*"/>
</xsl:template>
<xsl:template match="text()|@*"/>
<xsl:template match="ce:[email protected]">
<xsl:value-of select="."/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
25
25
Running OneAttValue.xsl
xsltproc OneAttvalue.xsl issueContents.xml
| sort | uniq -c | sort
• Output ending like this:
10
11
14
17
17
18
24
37
55
67
91
99
103
103
gr12
gr11
gr10
fx1
fx2
gr9
gr8
gr7
gr6
gr5
gr4
gr3
gr1
gr2
26
26
Output just the comments in a document
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="text()"/>
<xsl:template match="comment()">
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
27
27
Output just the processing instructions in a document
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>
<xsl:template match="processing-instruction()">
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
28
28
elAttList.xsl goal
• Go through rng schema
• For each element, output
dtdname.dtd\telementName
• For each attribute, output
dtdname.dtd\telementName\tattributeName
29
29
elAttList.xsl part 1 of 2
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:r="http://relaxng.org/ns/structure/1.0"
version="1.0">
<xsl:param name="dtdname"
>no dtdname parameter supplied</xsl:param>
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="r:files|r:attribute| r:value "/>
30
30
elAttList.xsl part 1 of 2
<xsl:template match="r:element">
<xsl:variable name="elName" select="@name"/>
<xsl:value-of select="$dtdname"/>
<xsl:text>&#9;</xsl:text>
<xsl:value-of select="@name"/>
<xsl:text>&#10;</xsl:text>
<xsl:for-each
select="r:attribute | r:optional/r:attribute">
<xsl:value-of select="$dtdname"/>
<xsl:text>&#9;</xsl:text>
<xsl:value-of select="$elName"/>
<xsl:text>&#9;</xsl:text>
<xsl:value-of select="@name"/>
<xsl:text>&#10;</xsl:text>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
31
31
normalizeRNG.xsl
<xsl:stylesheet version="1.0“
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:r="http://relaxng.org/ns/structure/1.0" >
<xsl:output indent="yes"/>
<xsl:template match="r:element/r:ref | r:optional/r:ref">
<xsl:variable name="referent" select="@name"/>
<xsl:apply-templates select="//r:define[@name = $referent]“
mode="copying"/>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="r:define" mode="copying">
<xsl:apply-templates select="node()"/>
</xsl:template>
</xsl:stylesheet>
32
32
Analyzing an SGML DTD
•
•
•
Why? When migrating away from it
RNG or W3C XSD both XML, but not SGML
Using Earl Hood’s perlSGML DTD analysis tools
33
33
XML-based analysis of SGML DTD
1. Run Earl Hood’s dtd2html utility
2. Run tagsoup or HTML Tidy on output files
3. Now you’ve got XML where you can pull out element
information with XSLT
34
34
XML-based analysis of SGML DTD (revised)
1. Tweak dtd2html to add <div class=“whatever”></div>
elements
2. Run Earl Hood’s dtd2html utility
3. Run tagsoup or HTML Tidy on output files
4. Now you’ve got XML where you can pull out element
information with XSLT
35
35
Summary
• This is not an integrated report generator. It’s
Legos.
• Pipelining data between existing tools, re-usable
scripts, and quick hacks.
• Document your command lines, e.g.
saxon temp1.xml temp3.xsl > temp1a.xml
• Clients like reports, especially in spreadsheets.
36
36
Thank you!
• Referenced resources:
http://www.snee.com/xml/xml2008
37
37
Descargar

Innodata Isogen PPT Template