Performing impossible feats of
XML processing with pipelining
XML Open 2004
Sean McGrath
Propylon
http://www.propylon.com
http://seanmcgrath.blogspot.com
Sean McGrath http://www.propylon.com 1
Contents
•
•
•
•
•
•
•
•
The pipelining philosophy
Major functional elements of pipelines
Some examples
Pipelining and Grids
Pipelining and Web Services/SOAs
Some anticipated objections (and answers)
Some musings
Some technology pointers
Sean McGrath http://www.propylon.com 2
What is XML pipelining?
• It is an architectural framework for
developing robust, scaleable, manageable
XML processing systems.
• based on proven mechanical manufacturing
patterns. Specifically:
– Assembly Lines (divide and conquer)
– Component assembly and component re-use
Sean McGrath http://www.propylon.com 3
What is XML pipelining and why
is it useful?
• A way of thinking about systems that focuses on
XML dataflows rather than object APIs. (This is
critical and non-trivial focus-shift for many
programmers!)
• Why? Because pipelining provides a mechanical,
inspiration-free, genius-free way of handling the
mind-boggling complexity of complex XML
transformation projects.
Sean McGrath http://www.propylon.com 4
Pipelining Philosophy
XML is all about complex
hierarchical data structures…
Sean McGrath http://www.propylon.com 5
Pipelining Philosophy
Cars are complex, hierarchical structures
Henry Ford’s Model T Ford Assembly Line – 1914
Sean McGrath http://www.propylon.com 6
Pipelining Philosophy
Lunch is a complex, hierarchical structure
Lunch Assembly Line. NY, 2004
Sean McGrath http://www.propylon.com 7
Pipelining Philosophy
We are complex, hierarchical structures
Sean McGrath http://www.propylon.com 8
Pipelining philosophy
• What have these scenes got it common?
– Complex construction of cars, tuna melts and
tendons made possible and efficient through
• assembly line manufacturing pattern of divide and
conquer
• re-usable component processes and component
materials
• Why not apply this approach to XML
“manufacturing”?
Sean McGrath http://www.propylon.com 9
Pipeline philosophy
• Why does the assembly line approach work?
– Transformation task decomposition
– Re-usable transformation components
• Transformation decomposition is the key to
complexity management. Just ask:
– Henry Ford
– Herbert Simon (The Two Watchmakers – “The Architecture of
Complexity”)
– George Miller (7+/-2)
– Adam Smith (An Inquiry into the Nature And Causes of the
Wealth of Nations,1776)
– Any electrical or chemical engineer.
Sean McGrath http://www.propylon.com 10
Pipeline philosophy
• Component re-use is the key to productivity
– Ask any form of engineer (electrical, chemical
etc.) apart from software engineers…
– Component re-use remains a holy grail in
software engineering
– Pipelining is yet another attempt based on data
transformation and data flow rather than
algorithms
Sean McGrath http://www.propylon.com 11
Pipeline philosophy
•
•
•
A lot of data processing for the forseable future
will consist of XML to XML transformation
A lot of non-XML data processing can consist of
XML to XML transformations with the addition
of top and tail transformations to non-XML
formats
An XML pipeliners mantra:
1.
2.
3.
Get data into XML as quickly as possible
Keep it in XML until the last possible minute
Bring all your XML tools to bear on solving the data processing
problem
Sean McGrath http://www.propylon.com 12
Pipeline philosophy
Input
Output
XML
XML
Top Transformation
Non-XML
Input
Tail Transformation
Non-XML
Output
Sean McGrath http://www.propylon.com 13
Pipeline philosophy
• The philosophy hinges on the fact that every
complex XML transformation can be broken down
into a series of smaller ones than can be chained
together
Sean McGrath http://www.propylon.com 14
Pipeline philosophy
• Only so many ways to
re-arrange an XML
tree structure
• A finite number of
fundamental
transformations, from
which all
transformations can be
derived
Sean McGrath http://www.propylon.com 15
Pipeline philosophy
1.
2.
3.
4.
Starting point: data at time T conforming to “spec” A.
Data at time T2 conforming to “spec.” B.
Transformation Analysis/Decomposition – decompose
the problem of getting from A to B into independent
XML in, XML out stages
Decide what transformation components you already
have.
Implement the ones you don’t – make them re-usable
for the next transformation project.
Sean McGrath http://www.propylon.com 16
Pipeline philosophy
– Transformation analysis & decomposition leads to
• a series of small, manageable, “stand alone” problems with an
XML input “spec” and an XML output “spec”. “Spec” =
schemas + structure rules + narrative.
• Can build, test, use and then re-use these transformation
components
• Very team development friendly – parallel development of
loosely coupled components
• Very debugging friendly – log2(n) “chops” to find any given
problem.
Sean McGrath http://www.propylon.com 17
Pipeline debugging
Schema
A
Schema
Delta 1
…
Schema
Delta N
Input
Output
XML
Top Transformation
Schema
B
XML
XML
Delta 1
XML
Delta N
Tail Transformation
Non-XML
Input
Non-XML
Output
Sean McGrath http://www.propylon.com 18
Pipeline philosophy
• The answer to the SAX/DOM question is “mu”.
(More on this later)
• No such thing as “the” correct abstraction for
processing XML
• Pipeline approach means you can mix ‘n’match
black-box components that internally use
whatever paradigm best suited the problem
•
•
•
•
Lexical
SAX,STAX,DOM,XOM
COmega,XSLT, XQuery
XDuce, Pyxie, Java, C#, Groovy, Ruby, Haskell, WebIt! Etc.
etc.
Sean McGrath http://www.propylon.com 19
Sample Pipeline
DB
/CMS
Character
Set Mods
Lexical
Add
Doctype
+ validate
+ strip doctype
Lexical
Re-arrange
Elements
Validation
DOM
Schematron/
Stats + FTP
RelaxNG/ Rhino
Jython
SQL
Replace
Java
XHTML
Generate
XSLT
Sean McGrath http://www.propylon.com 20
Pipeline philosophy
• Many XML transformations end up monolithic
• Assertion : developers would use a more
component based approach to XML processing if
they did not have to write the plumbing
(orchestration, exception handling) themselves
– “Gee, this problem is complex. Maybe I’ll do it in
multiple stages! Gee, now I have to orchestrate the
stages somehow. Batch files/shell scripts/driver
program – all ugly and error prone. Maybe I’ll just
write a single program after all. Besides, it will run
faster...”
Sean McGrath http://www.propylon.com 21
Pipeline philosophy
• “Professional developers spend 50
percent of their time writing plumbing” –
Adam Bosworth
• Pipelining promotes the creation of a
reusable plumbing “layer” letting
developers concentrate on the
application in hand.
Sean McGrath http://www.propylon.com 22
Philosophy Summary
• Think flow - data processing == data
transformation w.r.t. time – Michael
Jackson
• XML is the current runaway winner in the
self-descriptive data stakes and a very good
IDDL (Intermediate Data Description
Language) for all types of data that are not
natively XML based
Sean McGrath http://www.propylon.com 23
Philosophy Summary
• Inside every complex XML transformation
is a sequence of simpler XML
transformations trying to get out – a
pipeline
• Decomposed transformation:
– new transformations +
– already componentized transformations
– -> Component Reuse Nirvana
Sean McGrath http://www.propylon.com 24
Pipeline Philosophy
Out
In
Level 2 – Rudimentary
orchestration
Out
Level 1 - pipeline
In
Level 0 – transformation
component
Out
In
Out
Sean McGrath http://www.propylon.com 25
Simple pipeline transformation
component examples
• Fundamental Operation – Rename Element
– Rename
• Input : <foo>baz</foo>
• Output: <bar>baz</bar>
foo
bar
baz
baz
Sean McGrath http://www.propylon.com 26
Simple pipeline transformation
component examples
• Fundamental Operation - Peel
• Input : <foo><bar>baz</bar></foo>
• Output: <foo>baz</foo>
foo
foo
bar
baz
baz
Sean McGrath http://www.propylon.com 27
Simple pipeline transformation
component examples
• Compound Operation - Matryoshka
• Input:
– <foo><bar>baz</bar></foo>
• Output:
foo
bar
– <foo></foo><bar></bar>baz
foo
bar
baz
baz
Sean McGrath http://www.propylon.com 28
Simple pipeline transformation
component examples
• KlingonCloak
– Input:
• <foo><bar>baz</bar></foo>
– Output:
– <tag name=“foo”><tag name=“bar”>baz</tag></tag>
foo
bar
tag
type=“foo”
tag type=“bar”
baz
baz
Sean McGrath http://www.propylon.com 29
Simple pipeline transformation
component examples
• Reading a file is an XML to XML
transformation
– <file>lewisscarrol.xml</file>
– <poem><line>Twas brillig, and the slithy
tomes, did gyre and gimbal in the
wave</line>…</poem>
Sean McGrath http://www.propylon.com 30
Simple pipeline transformation
component examples
• Arithmetic is an XML to XML
transformation
– <expr>1 + 2</expr>
– <res>3</res>
Sean McGrath http://www.propylon.com 31
Simple pipeline transformation
component examples
• Unix pipe utilities e.g. tr
– hello world
– HELLO WORLD
Sean McGrath http://www.propylon.com 32
A little orchestration in a
transformation component
• Conditionals are XML to XML
transformation “tee junctions” triggered by
XPaths
if XPath TRUE branch
In
if XPath
if XPath FALSE branch
Sean McGrath http://www.propylon.com 33
Validation as a transformation
component
XML
A
Input
RelaxNG
Schematron
Jython/Java/JACL
XComponent
Validation
Log
XML
A’
Output
Error
Sean McGrath http://www.propylon.com 34
Sample Transformation
Component Examples
• Once you start thinking in terms of pipes –
components appear everywhere:
–
–
–
–
–
–
–
Regular fragmentations
Doctype changer
Namespace normalizer
Character set transcoder
Hash generator
Architectural form processing
RelaxNG/Schematron etc
Sean McGrath http://www.propylon.com 35
First objection
• “It will be dog slow” or (stronger form):
– “Re-usable tree transforming components won’t
work in my shop – my XML files are too big to
schlep around in strings, never mind DOMs!”
Sean McGrath http://www.propylon.com 36
Document fulcra and the
scatter/gather pattern
• For any given transformation t to be
performed on documents conforming to
schema s, there is a fragment expression
that can be used to chop each document into
n pieces, on which t can be performed.
• I call these points fulcra and are a function
of (t,s)
Sean McGrath http://www.propylon.com 37
Identifying Fulcra
• For data-oriented XML, the fulcra often
coincide with the “record” iteration in the
XML schema and may be independent of t.
• For document-oriented XML, the fulcra are
much more dependent on t.
Sean McGrath http://www.propylon.com 38
Document fulcra and
scatter/gather pattern
• Having identified the fulcra:– Chop the input document into fragments –
scatter phase
– Perform t
– Join all the processed fragments together to
constitute the output document – gather phase
• Three stage pipeline – scatter & gather
either side of the core component
Sean McGrath http://www.propylon.com 39
Document Fulcra
Input
Doc
Scatter
TIME
n fragments
Invoke t
t
t
t
t
t
n fragments
Gather
Output
Doc
Sean McGrath http://www.propylon.com 40
Document Fulcra
• Note the data domain de-composition –
[email protected] meets XML markup.
• Trivially parallelizable 
Sean McGrath http://www.propylon.com 41
Document Fulcra
• A good fulcra based scatter/gather will make
performance head north faster, cheaper and with a
high upper limit than any amount of hand-crafted,
genius level XML coding of your transformations
in horrid SAX or lexical parse mode.
– Massive Parallelism will kill all von Neumann
throughput arguments
• Documents per second, not seconds per document –
throughput is the true measure of XML processing speed
• Document fulcra – Locality of reference (Denning) applies to
XML processing (more on this later)
Sean McGrath http://www.propylon.com 42
More objections (with more
answers)
• It will be slow
Me at age 49
(Projected)
Speed of
D Spmodification
ev e
el ed
op o
m f
en
t
Me at age 39
Speed of
Execution
Me at age 26
The 3 Axes to Speed
– No it won’t Premature optimization
is the root of all evil!
– Speed is a three headed
monster. I’m old
enough to have left the
X axis and currently
heading for Y through
Z
Sean McGrath http://www.propylon.com 43
Some objections (with some
answers)
• Component based software? Harumph! We
have heard that one before…
– Pipelines are data flow based not API based
(COM, VBX, CORBA)
– Two pin interfaces and minimal “verbs”
– The XML “payload” is what is important – not
the API - RESTian
Sean McGrath http://www.propylon.com 44
Revisiting the XSLT/DOM ->
SAX non-sequiter
• XSLT and DOM are memory bound – trade
off between ease of use and resource usage
– ease of use favoured
• SAX is not memory bound – trade off
between ease of use and resource usage –
low resource usage favoured
• On xml-dev users often advised to rewrite
their apps using SAX! Ugh!
Sean McGrath http://www.propylon.com 45
XSLT/DOM -> pipeline
• Pipelines and scatter/gather allow you to keep the
ease of use of XSLT/DOM with the finite resource
utilization of SAX
• As long as you can identify a good fulcrum
function
– They exist more often than not
– If they exist, they are very easily found and “drop out”
of document analysis – eg: xpath expressions in XSLT
stylesheet templates
Sean McGrath http://www.propylon.com 46
Pipelining and Grids
• Grid Technologies – computational power
“on tap” (http://www.gridforum.org)
• A match made in heaven (bandwidth
permitting)
Sean McGrath http://www.propylon.com 47
An XML Processing Grid – on
demand
Out
In
Out
DMZ
Sean McGrath http://www.propylon.com 48
Grids - caveats
• For large data volumes it is simple not
feasible to shunt the data over the wire –
Jim Gray
• Organizations are sensitive about their data
going beyond firewalls
• Pay-per-use “racks” in your back-office a
better bet. – Rent a grid the way you would
rent a chainsaw.
Sean McGrath http://www.propylon.com 49
A Service Oriented Architecture
“service” = XML transformation with side optional effects
Sean McGrath http://www.propylon.com 50
Pipelines and Service Oriented
Architecture
• Can usefully blurr the distinction between a
message queue and a transformation pipe
• Services have the same XML-in, XML-out
interface
– All components can be services
– All pipes can be services
– All SOAs can be services…
Sean McGrath http://www.propylon.com 51
Federated SOA’s
Pipeline
transformation
Sean McGrath http://www.propylon.com 52
Musings #1 - Debugging
• Pipelines are very debugging friendly
– log2(N) time required for fault diagnosis
– “Probes” in the form of loggers, RelaxNG validators,
easily plug-inable (as transformation components) to a
pipe to watch what is going on.
– Pre/Post condition on/off switch is a useful “design by
contract” debugger
– XML-aware browsers as “breakpoints”
Sean McGrath http://www.propylon.com 53
Musings #2 – Validation –
grammers versus rules versus
FYI’s
• Pipelines make it natural to segregate
“business rules” from “grammar rules” and
can dramatically simplify both
• Some of the most useful business “rules”
are non dyadic. “FYIs” are really, really
useful monitoring/QA tools.
Sean McGrath http://www.propylon.com 54
Musings #3 – Inbetween-ing and
component development
• Transformation analysts spec the transformation
• Only need to code new components
• Spec == Documentation of what the transform
needs to do with pre/post etc. but no code
• Provides built in JIT-style acceptance test via the
pre/post conditions
• Outsource friendly, parallelisability friendly and
third-party market friendly
Sean McGrath http://www.propylon.com 55
Musing #4 - Web Services
• First generation will be a total blind alley –
RPC
• Document Oriented Messaging – not Object
Oriented Messaging -> SOAs
• The next stage in encapsulation and loose
coupling – something like pipelining will be
a pre-requisite in a doc/literal world.
Sean McGrath http://www.propylon.com 56
Musing #5 – naming and
parametric typing
• Naming components is a really hard problem
• Programmers don’t do metadata 
• Finding components to re-use is a real problem –
the Google lesson
• Numerous components that do the same thing but
optimized on different axes:
– Space
– Time
– Infoset considerations
Sean McGrath http://www.propylon.com 57
Musing #6 – Pre-validation
Transformation
• Killing ourselves seeking one-shot expressivity in
schema validation languages
• Many complex validations become a lot simpler if
you do some transformation(s) first
– Co-occurrence constraints
– Contextual constraints
• Clear analog with formatting (pre-flow
transformation(s) + flow = DSSSL/XSL)
Sean McGrath http://www.propylon.com 58
Musing #7 – grids, scheduling
and compilers
• Scheduling transformations on a pipeline grid is
hard – manufacturing lore needs to be brought to
bear (e.g. Flow Shop Scheduling).
• Pipe -> Component via compiler is a powerful
idea
– Both for grids (IO optimisation) and for general
program distribution
– Pipe compilation can beat the IO problems while
retaining the simple, componentised development
approach.
– Back to the future with Jackson’s
Program
Inversion
Sean McGrath
http://www.propylon.com
59
Musing #8 – Higher order
transformations
• What if, instead of transforming an
instance, you transformed a grammer?
• Auto-generation of instance transformation
primitives
• Limited to non-PCDATA transforms and
side-effect free transforms but useful
nonetheless
Sean McGrath http://www.propylon.com 60
Some pipeline-related open
source technologies
•
•
•
•
•
•
•
•
•
| - Unix Pipes
SAX Filters
XBeans
Cocoon
Xpipe (sadly under resourced)
axKit
xvif
DSDL
Ant, W3C Pipeline Note Sean McGrath http://www.propylon.com 61
Thank you
(question,answer?)*
Sean McGrath http://www.propylon.com 62
Descargar

Propylon Technology Presentation