XML Data Management
Ning Zhang
University of Waterloo
What is XML?

XML documents have elements and
attributes

Elements (indicated by begin & end tags)
can be nested but cannot interleave each other

can have arbitrary number of sub-elements
end

can have free text as values
elemen
<chap title = “Introduction To XML”>
t
some free text
<sect title = “What is XML?”> … </sect>
<sect title = “Elements”> … </sect>
<sect title = “Why XML?”> … </sect>
Elements
… possibly more free text
w/ same
</chap>

attribute
begin
element
name can
be nested
Why XML?

Database Side: XML is a new
way to organize data



Relational databases organize
data in tables
XML documents organize data
in ordered trees
Document Side: XML is a
semantic markup language


HTML focuses on presentation
XML focuses on
semantics/structure in the data
chap
sect
sect sect
sect
sect sect
<html>
<h1> Chapter 1… </h1>
some free text
<h2> Section 1… </h2>
some more free text
<h3> Section 1.1 </h3>
</html>
Data Management:
-- Relational vs. XML

Relational data are well organized – fully structured
(more strict):



E-R modeling to model the data structures in the
application;
E-R diagram is converted to relational tables and integrity
constraints (relational schemas)
XML data are semi-structured (more flexible):


Schemas may be unfixed, or unknown (flexible – anyone
can author a document);
Suitable for data integration (data on the web, data
exchange between different enterprises).
More about Relational vs. XML

XML is not meant to replace relational
database systems


RDBMSs are well suited to OLTP
applications (e.g., electronic banking)
which has 1000+ small transactions per
minute.
XML is suitable data exchange over
heterogeneous data sources (e.g., Web
services) that allow them to “talk”.
When should we use XML? (1)

Document representation language

XML can be transformed to other format
(e.g., by XSLT)




XML  HTML
XML  LaTeX, bibTeX
XML  PDF
DocBook (standard schema for authoring
document/book)
When should we use XML? (2)

Data integration and exchange language

Web services (SOAP, WSDL, UDDI)


Domain specific data exchange schemas
(>1000)



Amazon.com, eBay, Microsoft MapPoint, …
legal document exchange language
business information exchange …
RSS XML news feed

CNN, slashdot, blogs, …
When should we use XML? (3)

Any data having hierarchical structure

Email



Network log file


Header – from, to, cc, bcc…
Body – my message, replied email …
IP address, time, request type, error code
Advances of translating to XML

Exploit high-level declarative XML query
languages
XML Databases

Advantages:




Manage large volume of XML data
Provide high-level declarative language
Efficiently evaluate complex queries
XML Data Management Issues:



XML Data Model
XML Query Languages
XML Query Processing and Optimization
XML Data Model

Hierarchical data model



An XML document is an ordered tree;
Nodes in the tree are labeled with element
names.
Element nesting relationship corresponds to
parent-child relationship;
chap
@title some free sect
…
text
Introduc
@title
sect
tion to
…
@title
XML
What is XML?
…
XML Schema Languages

Schema languages defines the
structure:

Document Type Definition (DTD)



Context-free grammar
Structurally richer than relational schema
definition language because of recursion.
XML Schema


Also context-free
Richer than DTD because of data types
definition (integer, date, sequence).
XML Query Languages

XPath

13 axes (navigation directions in the tree)



child (/), descendant (//), following-sibling, following…
NameTest, predicates
E.g,
doc(“bib.xml”)//book[title=“Harry Potter”]/ISBN

XQuery (superset of XPath)

FLWOR expression
for $x in doc(“bib.xml”)//book[title
= “Harry Potter”]/ISBN,
$y in doc(“imdb.xml”)//movie
where $y//novel/ISBN = $x
return $y//title
Important Problems in XML Data
Management
What follows is not covered in COSC 3480!!
How to store XML data?
How to efficiently evaluate
XPath/XQuery languages?
1.
2.
•
•
3.
4.
5.
Efficient physical operators
Query optimization
How to support XML update
languages?
How to support transaction
management?
Recovery management?
Agenda
1.
2.
3.
XML Storage
XML Path Query Processing
XML Optimization
XML Storage

Extended Relational Storage


Native Storage


Convert XML documents to relational tables
Treat XML elements as first-class citizens
Hybrid of Relational and Native Storage

XML documents can be stored in columns of
relational tables (XML typed column)
Extended Relational Storage

Edge-based Storage Scheme (Florescu and
Kossman ‘99)




Each node has an ID
Each tuple in the edge table consists of:
(parentID, childID, type of data, reference to
data)
Pro: easy to convert XML to relational tables
Con: impossible to answer path queries such as
//a//b using SQL (needs transitive closure
operator)
Extended Relational Storage

Path-based Storage Scheme XRel
(Yoshikawa et al. ‘01)




Each node corresponds to a tuple in the table
Each tuple keeps a rooted path to the node
(e.g., [email protected])
Pro: also easy to convert XML to tables
Con: answering path queries, such as //a//b,
needs expensive string pattern matching
Extended Relational Storage

Node-based Storage Scheme: Niagara,
TIMBER etc. (Zhang et al. ’01)




Each node is encoded with a “begin” and “end”
integers.
Begin corresponds to the order of in-order
traversal of tree; end corresponds to the order in
post-order traversal.
Pro: checking parent-child/ancestor-descendant
relationships is efficient (constant time using
begin and end)
Con: inefficient for updating XML
Native Storage

Subtree partition-based scheme: Natix
(Kanne and Moerkotte ’00)




A large XML tree is partitioned into small
subtrees, each of which can be fit into one disk
page
Introducing aproxy and aggregate nodes to
connect different subtrees
Pro: easy to update and traversal
Con: complex update algorithm; frequent
deletion/addition may deteriorate page usage
ratio
Native Storage

Binary tree-based scheme: Arb (Koch ’03)





Convert a tree with arbitrary number of children
to a binary tree (first child translates to left child;
next sibling translate to right child)
Tree nodes are stored in document order
Each node has 2 bits indicating whether it has a
left & right child
Pro: easy to do depth-first search (DFS)
traversal
Con: inefficient to do next_sibling navigation and
hard to update
Native Storage

String-based scheme: NoK (Zhang ’04)

Convert a tree to a parenthesized string






E.g., a having b and c as children is converted to ab)c)), by
DFS of the tree and ‘)’ representing “end-of-subtree”
Tree can be reconstructed by the string
A long string can be cut into substrings and fit them into
disk pages
Page header can contains simple statistics to expedite
next_sibling navigation
Pro: particularly optimized for DFS navigational evaluation
plan
Con: inefficient to do for breadth-first search (BFS)
Hybrid of Relational and
Native Storage

All major commercial RDBMS
vendaors (IBM, Oracle, Microsoft and
Sybase) support XML type in their
RDBMS



A table can have a column whose type is
“XML”
When inserting a tuple in the table, the
XML field could be an XML document
XML documents are stored natively
Hybrid of Relational and
Native Storage

IBM DB2 UDB


Microsoft SQL Server


System RX – XML storage is similar to Natix
Uses BLOB (binary large object) to represent
XML documents
Oracle

Can use multiple format:



CLOB (character large object)
Serialized object
Shredded relational table
Agenda
1.
2.
3.
XML Storage
XML Path Query Processing
XML Optimization
XML Path Processing

Extended Relational Approach


Translate XML queries to SQL
statements
Native Approach (may be based on
extended relational storage)



Join-based approach
Navigational approach
Hybrid approach
Extended Relational Query
Processing

Regular expression based approach: XRel
(Yoshikawa et al. ‘01)




Linear path expression (without branches) are
translated to regular expressions on strings
(rooted paths)
Use the “like” predicate in SQL to evaluate
regular expressions
Pro: easy to implement
Con: cannot answer branching path queries
Extended Relational Query
Processing

Dynamic Interval based approach: DI
(DeHaan et al. ‘03)




Use the node labeling (begin,end) interval
storage scheme
Dynamically calculate (begin,end) intervals for
resulting nodes give a path/FLWOR expression
Pro: can handle all types of queries including
FLWOR expression
Con: inefficient for answering complex path
queries
Native Path Query Processing

Merge-Join based approach: Multipredicate Merge Join (MPMGJN)
algorithm (Zhang et al. ’01)



Modify the merge join algorithm to reduce
unnecessary comparisons
Keep to position p of the last successful
comparisons in the right input stream
The next item from the left input stream
starts scanning from position p.
Native Path Query Processing

Stack-based Structural Join (Wu et al.
’02)



Improve the MPMGJN algorithm
Do not look back but keep all ancestors
in a stack
When comparing the new item, just
compare it with the top of the stack
Native Path Query Processing

Holistic Twig Join (Bruno et al. ’02)



Improve the stack-based structure
algorithm
Use one join algorithm for the whole path
expression instead of one join for one
step
Reduce the overhead to produce and
store intermediate results
Native Path Query Processing

Natix (Brantner et al. ’05)



Translate each step into a logical
navigational operator Unnest-Map
Each unnest-map operator is translated
into a physical operator that performs
tree traversal on the Natix storage
Physical optimization can be performed
on the physical navigational operators to
reduce cross-cluster I/O.
Native Path Query Processing

IBM DB2 XNav (Josifovski et al. ’04)



XML path expressions are translated into
automata
The automaton is constructed
dynamically while traversing the XML tree
in DFS
Physical I/O can be optimized by
navigating to next_sibling without
traversing the whole subtree
Native Path Query Processing

Tree automata (Koch ’03)



The tree automaton needs two passes of
tree
The first traversal is a bottom-up
deterministic tree automaton to determine
which states are reachable
The second traversal is a top-down tree
automaton to prune the reachable states
and compute predicates.
Hybrid Processing

BlossomTree (Zhang ’04, Zhang’05)




Navigational approach is efficient for parentchild navigation
Join-based approach is efficient for ancestordescendant
BlossomTree approach identifies subexpressions, Next-of-Kin (NoK), that are efficient
for navigational approach.
Use navigational approach for NoK
subexpressions and use structural joins to join
intermediate results
XML Indexing

Structural Index



Clustering tree nodes by their structural
similarity (e.g., bisimilarity and F&B
bisimilarity)
Index is a graph, in which each vertex is
an equivalence class of similar XML tree
nodes
Path query evaluation amounts to
navigational evaluation on the graph
Agenda
1.
2.
3.
XML Storage
XML Path Query Processing
XML Optimization
Overview of Cost-based
Optimization

Query Optimization depends on:
1.
2.

How much knowledge about the data we have?
How intelligent we can make use of the
knowledge (within a time constraint)?
The cost of a plan is heavily dependent on:


The cost model of each operator
The cardinality/selectivity of each operator
Cardinality Estimation

Full path summarization: DataGuide
(Goldman ’97) and PathTree
(Aboulnaga ’01)


Summarize all distinct paths in XML
documents in a graph
Cardinality information is annotated on
graph vertices
Cardinality Estimation

Partial path summarization: Markov
Table (Aboulnaga ’01)



Keep sub-paths and cardinality
information in a table
Cardinality for longer paths are calculated
using partial paths.
Can use additional compression methods
to accommodate Internet scale database
Cardinality Estimation

Structural clustered summarization:
XSketch (Neoklis ’02) and TreeSketch
(Neoklis ’04)



Similar idea as clustered-based index
XSketch uses forward and backward
stability, and TreeSketch uses count
stability as similarity measurement
Heuristics to reduce graph to fit memory
budget
Cardinality Estimation

Decompression-based approach:
XSEED (Zhang ’06)



XML documents are compressed into a
small kernel with edge cardinality labels
Kernel can be decompressed into XML
document with cardinality annotations
Navigational path operator can be reused
on the decompressed XML document for
cardinality estimation
Cost Modeling

Statistical Learning Cost Model: COMET
(Zhang ’05)



Relational operator cost modeling is performed
by analyzing the source code
XML operators are much more complex than
relational operators; therefore analytical
approach is too time-consuming
Statistical learning approach needs a training set
of queries and learn the cost model from the
input parameters and real cost.
Descargar

XML Data Management - University of Houston