BenchmarX 2010
Benchmarking Holistic Approaches to XML
TPQ Processing
Jiaheng Lu
Renmin University of China
BenchmarX 10 Keynote
A little bit of history

Database world




1970 relational databases
1990 object oriented database
1995 semi-structured databases
Document world



1974 SGML (Structured Generalized Markup
Language)
1990 HTML (Hypertext MarkupLanguage)
1992 URL (Universal Resource Locator)
1996 XML (eXtensible Markup Language)
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
2
BenchmarX 10 Keynote
What is XML


The eXtensible Markup Language (XML) is the
universal format for structured documents and
data on the Web.
Advantages of XML:
 Human- and machine-readable format
 More flexible than HTML, not so complicated
as SGML
 Unlike relational table, XML can describe tree
and graph structural data
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
3
BenchmarX 10 Keynote
What is XML

Basic Specification:

XML 1.0, W3C Recommendation Feb’98
<book year=“1967”>
<title>The politics of experience</title>
<author>
<firstname>Ronald</firstname>
<lastname>Laing</lastname>
</author>
</book>
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
4
BenchmarX 10 Keynote
XML Tree

An XML document is commonly modeled as a
rooted, ordered tree.
book
“year” is an
attribute
Jiaheng Lu
@yea
r
title
“1967”
“The politics…”
author
firstnam
e
lastname
“Ronald”
“Lazing”
Benchmarking Holistic
Approaches to TPQ Processing
5
BenchmarX 10 Keynote
XML query language



Major standards for querying XML data
 XPath and XQuery
“XPath is a language for addressing parts of an
XML document ” XPath 1.0 W3C, Nov 1999
 E.g. paper [title=“XML”]/author
“XQuery is an XML query language which
provide features for retrieving and interpreting
information from XML documents. ” XQuery 1.0
Nov 2005
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
6
BenchmarX 10 Keynote
An XQuery example
Create a flat list of all the title-author pairs for
every book in bibliography.
XQuery:
<results>
{
for $b in doc("bib.xml")/bib//book,
$t in $b/title,
$a in $b/author,
return
<result> { $t } { $a } </result>
}
</results>
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
7
BenchmarX 10 Keynote
XML Twig Pattern


XML Twig Pattern Query (TPQ) is a core
operation in XPath and XQuery
Definition of XML twig pattern : an XML twig
pattern is a small tree whose nodes are tags,
attributes or text values; and edges are either
parent-child (P-C) or ancestor-descendant (A-D)
relationships
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
8
BenchmarX 10 Keynote
An XML twig pattern example
Create a flat list of all the titleauthor pairs for every book in
bibliography.
XQuery:
<results>
{
for $b in doc("bib.xml")/bib//book,
$t in $b/title,
$a in $b/author,
return
<result> { $t } { $a } </result>
}
</results>
Jiaheng Lu
To answer the XQuery, we need to
first match the following XML twig
pattern:
bib
$t:
Benchmarking Holistic
Approaches to TPQ Processing
$b
book
title
$a:
author
9
BenchmarX 10 Keynote
Research Problem


Given an XML twig pattern Q, and an XML database D,
we need to find ALL the matches of Q on D efficiently.
E.g. Consider the following twig pattern and document:
Twig pattern:
An XML tree:
Query solutions:
s1
section
t1
t2
title
Jiaheng Lu
figure
s2
p1
(s1, t1, f1)
(s2, t2, f1)
(s1, t2, f1)
f1
Benchmarking Holistic
Approaches to TPQ Processing
10
BenchmarX 10 Keynote
Why research XML twig pattern match

An XML query includes two parts: value match and
twig match.
XPath: paper [title=“XML”]/author
paper
Value (content)
match
Jiaheng Lu
title
autho
r
Twig Match:
New challenge!
Benchmarking Holistic
Approaches to TPQ Processing
11
BenchmarX 10 Keynote
Approach Overview


(1) Labeling: Assign each element in the XML
document tree an integer label to capture the
structural information of documents
(2) Computing: Use labels to answer the twig
pattern without traversing the original document
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
12
BenchmarX 10 Keynote
Related work graph
XML TPQ
Algorithms
Labeling schemes
Dewey scheme
[ SIGMOD’02 ]
Computing algorithms
Stack-merge
[ICDE ’02]
Containment
scheme
[SIGMOD’01]
XPath-SQL
[SIGMOD ’02]
TwigStack [SIGMOD ’02]
Dynamic Dewey
scheme
[ SIGMOD’09 ]
TJFast [VLDB ’05]
Twig2Stack [VLDB’06]
TreeMatch[ TKDE’2010]
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
13
BenchmarX 10 Keynote
Approach Overview


(1) Labeling
Region encoding (or called containment) labeling
scheme (start,end,level)
An example XML tree with region encoding labels
(1,12,1)
s1
(2,3,2)
t1
(5,6,3)
t2
(4,11,2)
s2
p1
(7,10,3)
f1
(8,9,4)
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
14
BenchmarX 10 Keynote
Approach Overview


(1) Labeling
Dewey (or called prefix) labeling scheme: integer
sequence
An example XML tree with Dewey labels
ε
s1
0
t1
1.0
t2
1
s2
p1
f1
Jiaheng Lu
1.1
1.1.0
Benchmarking Holistic
Approaches to TPQ Processing
15
BenchmarX 10 Keynote
Approach Overview

(2) Computing

Inverted data list: each data list contains all labels of
elements with the same tag name
An XML tree:
Query:
s
(1,12,1)
s1
(2,3,2)
t1
t
f
Data lists:
t2
(5,6,3)
(4,11,2)
s2
p1
(7,10,3)
s
(1,12,1), (4,11,2)
t
(2,3,2), (5,6,3)
f
(8,9,4)
f1
(8,9,4)
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
16
BenchmarX 10 Keynote
Previous work: TwigStack [1]
(2) Computing
 TwigStack [1] is a holistic algorithm for XML twig
matching on containment labeling scheme.

Two steps in TwigStack :
 (1) intermediate path solutions are output to match
each query root-to-leaf path; and
 (2) these intermediate path solutions are merged to
get the final results.
[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal
xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
17
BenchmarX 10 Keynote
Running example: TwigStack algorithm
State of stacks:
Query:
Data streams:
s
(1,12,1) (4,11,2)
t
(2,3,2) (5,6,3)
s
t
f
Output path intermediate solutions:
s//t:
s//f:
Final results:
f
(8,9,4)
(1,12,1) (2,3,2)
(1,12,1) (8,9,4)
(1,12,1) (2,3,2) (8,9,4)
(1,12,1) (5,6,3)
(4,11,2) (8,9,4)
(1,12,1) (5,6,3) (8,9,4)
(4,11,2) (5,6,3)
Jiaheng Lu
(4,11,2) (5,6,3) (8,9,4)
Benchmarking Holistic
Approaches to TPQ Processing
18
BenchmarX 10 Keynote
Limitations of TwigStack



(1) TwigStack may output many useless
intermediate results for queries with parent-child
relationship
(2) TwigStack cannot process XML twig queries
with ordered predicates, like “Proceeding”,
“Following” in XPath
(3) TwigStack cannot answer queries with
wildcards in branching nodes.
E.g.
*
The parent of B should be an
B
Jiaheng Lu
C
ancestor of C
Benchmarking Holistic
Approaches to TPQ Processing
19
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
20

TwigStack is inefficient to answer twig query with
parent-child edges
# of intermediate path
BenchmarX 10 Keynote
Inefficiency of TwigStack
80000
70000
60000
50000
Useful
Useless
40000
30000
20000
10000
0
Q1
Q2
Q3
Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP in Tree Bank data
More than 99% intermediate results are useless, TwigStack wastes
too much time to output useless intermediate results!
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
21
BenchmarX 10 Keynote
Example to illustrate the inefficiency of TwigStack for
queries with P-C edge
Twig pattern:
An XML tree:
A1
A
B
TwigStack outputs the
useless root-to-leaf
intermediate path solutions:
B1
Bn-1
E1
B2
Bn
D1
D
(A1, B1, C1), (A1, B2, C1) ……
(A1, Bn, Cn)
……
C
C1
Cn-1
C2
Cn
The reason for the inefficiency of TwigStack :
TwigStack assumes that all edges are A-D relationships in
Holisticlevel information
the first step and does Benchmarking
not consider
Jiaheng Lu
Approaches to TPQ Processing
22
BenchmarX 10 Keynote
Naïve improvement is incorrect
Twig pattern:
Naïve improvement:
An XML tree:
A1
A
Bn-1
B1
E1
D
B
B2
Bn
……
C
C1
C2
D1
because A1 is not the parent
of D1 , we do not output the
following path solutions
(A1, B1, C1), (A1, B2,C1) ……
(A1, Bn, Cn) by considering
level information
Cn-1
Cn
But this
naïve
approach is
NOT
correct for
some
cases!
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
23
BenchmarX 10 Keynote
Problem of naïve approach


Naïve approach possibly make a wrong decision about whether the
current element contributes to final results
Example:
An XML tree:
Twig pattern:
A1
A
B
B1
C
C2
D2
Cn
D
D1
Jiaheng Lu
C1
When we read A1, B1, C1 and
D1, since C1 is not the parent
D1 , according to the naïve
approach, we decide that C1
and D1 do not belong to query
answers.
……
Dm
Benchmarking Holistic
Approaches to TPQ Processing
But it
is
wron
g!
24
BenchmarX 10 Keynote
Our solution: Look-ahead

New technique used in our new algorithm called
TwigStackList: Look-ahead
Twig pattern:
An XML tree:
A1
A
B1
B
C
C2
D
Cn
D1
Jiaheng Lu
C1
……
Dm+1
Dm
When we read A1, B1, C1 and D1,
we do not hurriedly decide
whether C1 or D1 belongs to final
solutions, but buffer C1 to Cn in
the a main-memory list structure.
Since Cn is the parent, we are
sure that (A1, B1, Cn , D1) is a
real match.
Why not buffer D1 to
Dm? Too many!
Benchmarking Holistic
Approaches to TPQ Processing
25
BenchmarX 10 Keynote
Running example: TwigStackList algorithm
XML tree:
Query:
(1,11,1)
(2,2,2)
A
B1
C
B
C1
(4,8,3)
C2
(5,7,4)
D
Data streams:
A1
C3
(3,10,2)
D2
(9,9,3)
(6,6,5)
A
(1,11,1)
B
(2,2,2)
C
(3,10,2) (4,8,3) (5,7,4)
D1
D (6,6,5) (9,9,3)
SA
SB
(3,10,2)
(5,7,4)
SC
List LC
Output path solutions:
A//B
(1,11,1) (2,2,2)
SD
Jiaheng Lu
A//C/D
(1,11,1) (5,7,4) (6,6,5)
(1,11,1) (3,10,2) (9,9,3)
Benchmarking Holistic
Approaches to TPQ Processing
26
BenchmarX 10 Keynote
Features of TwigStackList

Main memory efficient



Size of stack and list is no more than |Depth(Tree)|
TwigStackList can process very large documents with
small main memory cost
I/O efficient


Each element is scanned once
For a large query class, TwigStackList guarantees that
each output path solution is useful to final answers.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
27
BenchmarX 10 Keynote
Optimal query classes

If an algorithm does not output any useless
intermediate path solution for a query Q for all
given documents, we call this algorithm is
optimal with respective to Q
If an algorithm has a
larger optimal query
class, this algorithm has
better ability to control
the size of intermediate
results
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
28
BenchmarX 10 Keynote
Optimal query classes
.
Optimal Class of
TwigStack
Only A-D in all
edges
Only A-D in
branching edges
A
Optimal Class of
TwigStackList
B
A
C
B
C
D
D
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
29
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
30
BenchmarX 10 Keynote
Motivation


TwigStack and TwigStackList cannot handle order-based
twig query.
XPath and XQuery includes ordered axes such as
following, preceding, following-sibling and preceding-sibling.
This symbol shows
that B and C are
ordered.
XPath expression
A
A/B[following-sibling::C]
B
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
<
C
31
BenchmarX 10 Keynote
Ordered twig query pattern


Ordered XML twig pattern : sibling query nodes should be
matched according to their order in the twig query.
Example
A
<
A1
D1
B
C
Jiaheng Lu
B1
D2
C1
D3
D
Only D2 and D3 contribute to
final results.
Benchmarking Holistic
Approaches to TPQ Processing
32
BenchmarX 10 Keynote
OrderedTJ


OrderedTJ, a new algorithm proposed for evaluating
ordered twig query pattern.
OrderedTJ, which extends TwigStackList, also uses stack
and list data structure
What’s the main
modification of
OrderedTJ over
TwigStackList?
Jiaheng Lu
OrderedTJ
additionally
checks the
order
conditions of
elements
before
outputting
intermediate
paths.
Benchmarking Holistic
Approaches to TPQ Processing
33
BenchmarX 10 Keynote
OrderedTJ
Before any element is pushed to the stack, OrderedTJ
checks the order condition

Data streams:
Query
A
Data
<
(1,9,1)
A1
D
B
D1
(2,2,2)
B1
C
(1,9,1)
B
(3,5,2)
(7,7,3) C
(4,4,3)
(6,8,2)
D2
(3,5,2)
C1
A
D3
(4,4,3)
SA
A/B/C
SD
Jiaheng Lu
(2,2,2) (6,8,2) (7,7,3)
Output intermediate path solutions:
SB
SC
D
(1,9,1) (3,5,2) (4,4,3)
A//D
(1,9,1) (6,8,2)
(1,9,1) (7,7,3)
Benchmarking Holistic
Approaches to TPQ Processing
34
BenchmarX 10 Keynote
The optimal query classes of OrderedTJ


OrderedTJ can guarantee the optimality for ordered queries
with A-D relationships from the second branching edges.
In other words, OrderedTJ is optimal for queries with P-C
relationship in the first branching edges.
A
B
A
C
B
Q1
C
Q2
TwigStackList is
non-optimal for
Q1.
Jiaheng Lu
<
OrderedTJ is
Optimal for Q2
Benchmarking Holistic
Approaches to TPQ Processing
35
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
36
BenchmarX 10 Keynote
iTwigJoin algorithm


TwigStack and OrderedTJ partition data to
streams according to their tag names alone
We propose two new data partition schemes



(1) Tag+level scheme
(2) Prefix path scheme
Potential benefits:


Enlarge the optimal query classes
Reduce I/O cost
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
37
BenchmarX 10 Keynote
Data partition scheme
Refined
Tag
partition
By level
Tag partition
A1
B1
C1
C2
C3
Tag +level
partition
TA1
TB
TB2
TC C1, C2, C3
Prefix path
partition
By path
Tag+Level partition
Prefix Path partition
TA A1
B1
Refined
A1
B1
TA
TAB
B1
TC2
C2
TAC
C2
TC3
C1, C3
TABC
C1
TACC
Jiaheng Lu
A1
Benchmarking Holistic
Approaches to TPQ Processing
C3
38
BenchmarX 10 Keynote
Property of three schemes
Tag scheme
Refined
Tag +level
Refined
scheme
By level
Prefix path
scheme
By path
1. the number of inverted lists : increasing (CPU cost increase
correspondingly)
2. the optimal query classes : enlarging (output cost decrease
correspondingly)
3. the number of elements scan : decreasing (input cost
decrease correspondingly)
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
39
BenchmarX 10 Keynote
The number of inverted lists : increasing
Tag partition
A1
B1
C1
C2
C3
Tag+Level partition
TA A1
TA1
TB
TB2
B1
TC C1, C2, C3
A1
B1
Prefix Path partition
TA
TAB
B1
TC2
C2
TAC
C2
TC3
C1, C3
TABC
C1
TACC
Jiaheng Lu
A1
Benchmarking Holistic
Approaches to TPQ Processing
C3
40
BenchmarX 10 Keynote
The optimal query classes : enlarging
Optimal class of
tag scheme
Only A-D in branching edges
and only P-C in all edges and only 1-branching
Optimal Class of
tag+level scheme
Optimal Class of
prefix path scheme
Only A-D in
branching edges
A
B
D
Jiaheng Lu
Only A-D in
branching edges and
only P-C in all edges
A
A
C
B
C
D
E
B
C
D
E
E
Benchmarking Holistic
Approaches to TPQ Processing
41
BenchmarX 10 Keynote
The number of elements scan : decreasing
A
C
B
Tag scheme
Query
1:
2:
3:
D1
A1
B1
C1
C2
Tag+Level scheme
TA A1
TA2
TB
TB3
B1
TC C1, C2
A1
B1
Prefix Path scheme
TDA
TDAB
A1
B1
TC2
C1
TDC
C1
TC3
C2
TDCC
C2
Data
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
42
BenchmarX 10 Keynote
iTwigJoin algorithm



A general algorithm which can be applied on all three
schemes
For different schemes, iTwigJoin achieves different
performance.
The main technical difficult in designing iTwigJoin is to
handle many current nodes for one tag name.
We classify the current
visited elements to
three categories:
current-match, currentuseless and currentblocked
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
43
BenchmarX 10 Keynote
Three kinds of elements



Current-match : the element is guaranteed to contribute
to final answers with current elements.
Current-useless : the element is guaranteed not to
contribute to final answers with current and remaining
elements.
Current-blocked: the element is neither current-match nor
current-useless.
Current-blocked
Matching data
appears
Match
Jiaheng Lu
Cannot get any
matching data
Useless
Benchmarking Holistic
Approaches to TPQ Processing
44
BenchmarX 10 Keynote
Example on three kinds of elements
A
Tag+level scheme
C
B
Query
1:
2:
3:
A1
B1
A2
A3
B2
Document
Jiaheng Lu
C1
C2
TA1
A1
TA2
A2, A3
TB2
B1
TB3
B2
TC2
C2
TC3
C1
Benchmarking Holistic
Approaches to TPQ Processing
Current-match:
A1,B1,C2
Current-blocked :
B2,C1
Current-useless :
A2
45
BenchmarX 10 Keynote
Example on three kinds of elements
A
Tag+level scheme
C
B
Query
1:
2:
3:
A1
B1
A2
A3
B2
Document
Jiaheng Lu
C1
C2
TA1
A1
TA2
A2, A3
TB2
B1
TB3
B2
TC2
C2
TC3
C1
Benchmarking Holistic
Approaches to TPQ Processing
B2 ,C1 are converted
from current-blocked
to current-match due
to the appearance of
A3.
46
BenchmarX 10 Keynote
Main flowchart of iTwigJoin
Are all elements scanned?
Y
N
Is there any current-useless element?
Y
End of the algorithm
See whether it
contributes
to previous match,
and advance
to the next element
N
Is there any current-match element?
N
Y
Output intermediate
path solutions,
and advance
to the next element
Choose the smallest current-blocked element
and output intermediate path solutions, then
advance to the next element
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
47
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
48
BenchmarX 10 Keynote
Motivation: new labeling scheme

TwigStackList, OrderedTJ and iTwigJoin are all based on
the containment labeling scheme
Why not try Dewey
labeling scheme for
XML twig pattern
query ?
Oh, it is really a
novel idea!
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
49
BenchmarX 10 Keynote
Original Dewey Labeling Scheme

In Dewey labeling scheme, each element is presented by an
integer sequence:



(i) the root is labeled by a empty stringε
(ii) for a non-root element u, label(u)= label(s).x, where u is the x-th
child of s.
For example:
ε
s1
1
2
t1
3
s2
f2
2.1
t2
Jiaheng Lu
2.2
f1
Benchmarking Holistic
Approaches to TPQ Processing
50
BenchmarX 10 Keynote
Main problem of the original Dewey

If we use the original Dewey labeling scheme to answer the
twig query, we need to read labels for all query node. Thus,
this is not a better solution than pervious algorithms.
Extend the original Dewey
labeling scheme so that given
the label of any element e, we
can know the path of e from
this label alone
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
51
BenchmarX 10 Keynote
Modular function




We need to know some schema information: DTD
(Document Type Definitions ) or XML schema
Given DTD information: book → author, title, chapter*
Our solution: using modular function, we create a match
between an element tag and an integer number.
We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 =
2; where, Xt is the last integer of the label of tag t.
ε
book
The number of distinct
tags under book
Why not 3 as the
original Dewey ?
0
author
Jiaheng Lu
1
title
2
chapter
Benchmarking Holistic
Approaches to TPQ Processing
5
chapter
52
BenchmarX 10 Keynote
Derive element tag



From a label , we can derive its tag name.
book → author, title, chapter*
Recall that we define: Xauthor mod 3 = 0 Xtitle mod 3 = 1
Xchapter mod 3 = 2.
ε
book
0
author
?
Jiaheng Lu
1
title
?
2
chapter
?
Benchmarking Holistic
Approaches to TPQ Processing
5
chapter
?
53
BenchmarX 10 Keynote
More examples for assigning labels



Let us consider a more complicated DTD
a → (b | c )*, d?, c+
We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2
(Why do we use mod 3 instead of 4?)
ε
a
0
b
Jiaheng Lu
2
d
4
c
Benchmarking Holistic
Approaches to TPQ Processing
7
c
54
BenchmarX 10 Keynote
Derive the path from a label


By following a finite state transducer (FST), we may recursively derive the whole
path from any extended Dewey label.
For example:
FST:
DTD:
Mod 3=0
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
Mod 2=0
Mod 3=2
chapter
chapter
author
book Mod 3=1
title
book
Document:
author
Mod 2=1
chapter
paragraph
Mod 2=0
section
Mod 2=1
title
section
section
Question: Given a label 5.1.0, what is
the corresponding path ?
section
paragraph
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
55
BenchmarX 10 Keynote
Derive the path from a label


By following a finite state transducer (FST), we may recursively
derive the whole path from any extended Dewey label.
For example:
FST:
DTD:
Mod 3=0
book → author, title, chapter*
chapter → (paragraph | section)*
book Mod 3=1
title
section → (paragraph | section)*
Mod 2=0
Mod 3=2
book
chapter
Document:
chapter
author
author
Mod 2=0
section
Mod 2=1
Mod 2=1
chapter
Following the above red path, we get
title
section
paragraph
section
5.1.0 denotes :
section
book/ chapter/section/paragraph
paragraph
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
56
BenchmarX 10 Keynote
Two properties of extended Dewey

Find Ancestor Label


Find Ancestor Name


From a label of any element, we can derive the labels of
its all ancestors.
From a label of any element, we can derive the tag
names of its all ancestors.
Two properties enable us to design a new and
efficient algorithm for XML twig pattern matching.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
57
BenchmarX 10 Keynote
A new algorithm: TJFast




For each node n in the query, there exists a corresponding
input stream Tn.
Tn contains the extended Dewey labels of elements of tag n.
Those labels are arranged by the document order.
For each branching node b of twig pattern, there is a
corresponding set Sb, which contains elements possibly
involving query answers. (Compared to TwigStackList,
what difference? )
During any point of computing, the size of set Sb is
bounded by the depth of the XML document.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
58
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
a1
0.0
a2
d1
0.0.1
0.3
{
Query:
0
…
0.5
a3
d2
0.3.2
b1
A
b2
d3
}
D
A set for the
branching node A
B
0.5.0
C
0.3.1
DTD:
c1
c2
0.3.2.1
0.5.0.0
a -> a*,d*, b*
b -> d*, c*
TD:
TC:
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
Jiaheng Lu
d -> c*
Why are there only two streams?
Benchmarking Holistic
Approaches to TPQ Processing
59
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
{
Query:
A
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
}
B
C
0.5.0
0.3.1
derive
c1
c2
0.3.2.1
0.5.0.0
0.0.1
derive
0.3.2.1
a1/a2/d1
a1/a3/b1/c1
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
Jiaheng Lu
By finite state transducer of extended
Dewey labeling scheme
Benchmarking Holistic
Approaches to TPQ Processing
60
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
{
Query:
A
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
0.5.0
}
B
C
0.3.1
c1
c2
0.3.2.1
0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Both a1 and a3 possibly involve in
query answers. (Why not a2 ?)
0.3.2.1, 0.5.0.0
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
61
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
Query:
0
a1
0.0
a2
d1
0.0.1
0.3
A
…
0.5
a3
d2
0.3.2
b1
D
b2
{a1,a3}
B
C
d30.5.0
0.3.1
c1
c2
0.3.2.1
0.5.0.0
Then we insert a1, a3 to the set,
Output Path solutions:
TD:
TC:
0.0.1 , 0.3.1, 0.5.0
A//D
(a1, d1)
A/B//C
(a3, b1, c1)
0.3.2.1, 0.5.0.0
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
62
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
A {a1,a3}
Query:
0
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
B
C
0.5.0
0.3.1
c1
c2
0.3.2.1
0.5.0.0
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
Jiaheng Lu
Move the cursor of TD from d1 to d2
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
Benchmarking Holistic
Approaches to TPQ Processing
A/B//C
(a3, b1, c1)
63
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
A {a1,a3}
Query:
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
B
C
0.5.0
0.3.1
c1
c2
0.3.2.1
0.5.0.0
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
Jiaheng Lu
Move the cursor of stream TD
from d2 to d3
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
(a1, d3)
Benchmarking Holistic
Approaches to TPQ Processing
A/B//C
(a3, b1, c1)
64
BenchmarX 10 Keynote
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
c1
c2
0.3.2.1
0.5.0.0
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
B
C
0.5.0
0.3.1
Jiaheng Lu
A {a1,a3}
Query:
Move the cursor of stream TC
from c1 to c2
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
(a1, d3)
Benchmarking Holistic
Approaches to TPQ Processing
A/B//C
(a3, b1, c1)
(a1, b2, c2)
65
BenchmarX 10 Keynote
Sort and merge-join in TJFast
Document:
a1
A
a3
a2
b2
Query:
D
d1
d2
b1
B
d3
C
c1
c2
Phase 1. Intermediate paths
Phase 2. Final solutions
A// D:
A/B//C:
<A, D, B,C>
<a1, d1>,
<a1,b2, c2>, Join <a1,d1,b2,c2>,<a1,d2, b2,c2>,
<a1, d2>,
<a3, b1,c1>
<a1,d3,b2,c2>,<a3,d2, b1,c1>,
<a1, d3>,
<a3, d2>
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
66
BenchmarX 10 Keynote
TJFast+L


Apply extended Dewey labeling scheme on
tag+level streaming scheme, we propose
TJFast+L algorithm by extending TJFast
Two benefits of TJFast+L over TJFast


reduce I/O cost by reading less elements
enlarge optimal query classes
Because by
Q: Why not
apply extended
Dewey on
Prefix-path
scheme ?
Jiaheng Lu
finite state
transducer, we
can know the
path
information…
Benchmarking Holistic
Approaches to TPQ Processing
67
BenchmarX 10 Keynote
Optimal query classes
.
Optimal Class of
TJFast
Only A-D in
branching edges
Only P-C in all
edges
A
Optimal Class of
TJFast+L
B
A
C
B
C
D
D
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
68
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
69
BenchmarX 10 Keynote
State-of-the-art: XML Query Processing
Path
Tree
(GTP)
Generalized Tree Pattern
Holistic Approach
PathStack [Bruno, et. al]
TwigStack [Bruno, et. al]
?
Twig2Stack
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
70
BenchmarX 10 Keynote
Processing Generalized Tree Pattern (GTP)
Queries
A
B
D
E
Mandatory Axis
C
Optional Axis
Return node
XQuery:
FOR $b in //A[E]//B,
$d in $b/$D
LET $c = $b/C
RETURN $b, $c, $d
Jiaheng Lu
Group return node
Benchmarking Holistic
Approaches to TPQ Processing
71
BenchmarX 10 Keynote
Motivation: PathStack [Bruno et.al]
a1

Query: //A//B; Data:
a2
b1
a2
a1
S[A]
b2
b1
S[B]
b2

Key observation: minimize intermediate results through compact representation
of path matches, by


Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2,
b2→a2
Intra-node: record AD relationship between elements within the same query nodes, e.g., b1,
b2


TwigStack [Bruno et.al] minimizes intermediate results through:




Output only those path matches that are in final twig results
However, such optimality cannot be guaranteed [Choi, et.al]
Not helpful for processing GTP queries
Question: can we minimize intermediate results for twig queries through
compact result encoding (similar to PathStack)?

Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
72
BenchmarX 10 Keynote
Hierarchical Stack Encoding
a1
a1
a2

Inter-node: //A//B


Can still use explicit edges
Intra-node: A


a2
a3
a4
HS[A]
a3
a4
Matching elements forms a tree structure as well
Associate each query node with a hierarchical stack

Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig
query rooted at E
 Matching can be determined when entire sub-tree of e seen
 Require post-order document traversal
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
73
BenchmarX 10 Keynote
Twig2Stack: Running Example
[1,20], 1
a1
A
[2,15], 2
B
a2
D
HS[A]
[16,19], 2
a2
C
b3
[17,18], 3
[3,14], 3
d3
b1
[4,11], 4
d1
[12,13], 4
c2
[5,10], 5
b2
b1
b2
[6,7], 6
d2
d3
HS[D]
Jiaheng Lu
c1
Merging
Stacks TwigStack needs to enumerate
3 matches for //A/B//D and 2 for
//A/B//C then join them together.
HS[B]
d1
d2
[8, 9], 6
c1
c2
Twig2Stack requires neither
path joins nor path enumeration!
HS[C]Benchmarking Holistic
Approaches to TPQ Processing
74
BenchmarX 10 Keynote
Not yet done: Memory Usage

Hierarchical Stack Encoding could hold entire document in memory in
the worst case

Unlike DOM approach, only matches need to be stored




Tag match
(Partial) twig match
Predicate evaluation
Early result enumeration dramatically reduces the memory usage


Enumerate query results before the end of document and release
buffer
Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack)
approaches
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
75
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
76
BenchmarX 10 Keynote
TreeMatch (TKDE 2010)
Twig pattern:
An XML tree:
A1
A
B
C2
A2
B1
C
B2
C1
Matching cross:
B1
B2
It is the real
reason for
suboptimality!
C1
Jiaheng Lu
C2
Benchmarking Holistic
Approaches to TPQ Processing
77
BenchmarX 10 Keynote
Bounded and Unbounded Matching Cross
Twig pattern:
An XML tree:
A1
A
B1
B
…
Bn
Bn+1
C
C2n
A2
C 2n-1
An
B2n
Unbounded Matching cross:
Jiaheng Lu
B1 ……
B2n
C1 ……
C2n
C1
…
Cn
Bounded Matching cross:
A1 ……
An
C1 ……
C2n
Benchmarking Holistic
Approaches to TPQ Processing
78
BenchmarX 10 Keynote
BMC and UMC

Bounded Matching Cross (BMC): Optimal class
 Store limited number of nodes in main memory

Unbounded Matching Cross (UMC): Sub-optimal
class, but not all
 Cannot guarantee to store limited number of nodes in
main memory, but a sub-class of UMC is still optimal
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
79
BenchmarX 10 Keynote
Unbounded Matching Cross with Mediator
Twig pattern:
An XML tree:
(output: node C)
A1
A
B1
B
C
C1
Jiaheng Lu
……
Bn+1
Cn
…
A2
Bn
Bn+1
Unbounded Matching cross:
B1 ……
…
C1
Cn
An
B2n
C n-1
Node A is a
mediator
node and we
do not need
to store all Bi
in main
memory!
Benchmarking Holistic
Approaches to TPQ Processing
80
BenchmarX 10 Keynote
Optimal query classes
Optimal Class of
TwigStack
Only A-D in all
edges
A
Optimal Class of
TwigStackList
B
Optimal Class of
TreeMatch
Only A-D in
branching edges
A
C
B
C
D
D
Only A-D in
non-output
branching edges
Jiaheng Lu
A
B
Benchmarking Holistic
Approaches to TPQ Processing
C
81
BenchmarX 10 Keynote
Outline


Introduction
Holistic algorithms:








TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
Benchmark experiments
Conclusions and future work
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
82
BenchmarX 10 Keynote
Experiment Setup

Implementation (Seven algorithms)








Datasets


TwigStack (SIGMOD2002)
TwigStackList (CIKM2005)
OrderedTJ (DEXA2006)
iTwigJoin (SIGMOD2005)
TJFast (VLDB2005)
Twig2Stack(VLDB2006)
TreeMatch (TKDE2010)
XMark, DBLP, TreeBank
Metrics


Query processing time
IO time
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
83
BenchmarX 10 Keynote
Experiments

Benchmarks



XMark: Synthetic Data
DBLP: Real Data for DBLP database
Treebank: Real Data from Wall Street Journal
XMark
DBLP
Treebank
Data size(MB)
582
130
82
Nodes(million)
8
3.3
2.4
Max/Avg depth
12/5
6/2.9
36/7.8
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
84
BenchmarX 10 Keynote
Tested queries
Some tested queries
Source
Twig Queries
Q1
DBLP
//proceedings//title[.//i]//sup
Q2
DBLP
//article[.//sup]//title//sub
Q3
Treebank
/S[.//VP/IN]//NP
Q4
Treebank
/S/VP/PP[IN]/NP/VBN
Q5
Treebank
//VP[DT]//PRP_DOLLAR_
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
85
BenchmarX 10 Keynote
Tested queries (Cont.)
Q1,Q2,Q3 are based on XMark data and Q4,Q5 Q6 are on TreeBank data.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
86

Experiment data: TreeBank
# of intermediate path
BenchmarX 10 Keynote
TwigStackList V.s.TwigStack
80000
70000
60000
50000
Useful
TwigStack
TwigStsackList
40000
30000
20000
10000
0
Q1
Q2
Q3
Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP
Compared to TwigStack, TwigStackList significantly reduces the size
of output useless elements.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
87


STW: Straightforward-TwigStack
STWL: Straightforward-TwigStackList
STW
STWL
STW
OrderedTJ
14
12
10
8
6
4
2
0
Execution time
(seconds)
Execution time (s)
BenchmarX 10 Keynote
TwigStackList V.s. OrderedTJ
Q1
Q2
Queries on XMark
STWL
OrderedTJ
35
30
25
20
15
10
5
0
Q3
Q4
Q5
Q6
Queries on Tree data
OrderedTJ is significantly better than two straightforward
method on XMark and TreeBank data
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
88
The decrease of the number of elements scanned

tag
tag+level
tag
prefix path
tag+level
prefix path
14
5
Bytes scanned (M)
Bytes scanned (M)
BenchmarX 10 Keynote
iTwigJoin
4
3
2
1
0
12
10
8
6
4
2
0
Q1
Q2
XMark data query
Q3
Q4
Q5
Treebank data query
Q6
More refined schemes scan less elements to answer a query.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
89
Performance of queries for three streaming schemes

Tag
Tag+level
Prefix path
Tag
Execution time(s)
10
Execution time
BenchmarX 10 Keynote
iTwigJoin
8
6
4
2
0
Q1
Q2
XMark queries
Q3
Tag+level
Prefix path
60
50
40
30
20
10
0
Q4
Q5
Quries on TreeBank
Q6
Prefix path scheme is suitable for large but shallow document, and
tag+level scheme generally works well even for complicated recursive
documents.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
90
TwigStack
Number of elements read
BenchmarX 10 Keynote
TwigStackList V.S. iTwigJoin
TwigStackList
iTwigJoin
1200000
1000000
800000
600000
400000
200000
0
Q3
Q4
Q5
TreeBank data
Observation: iTwigJoin scans far less elements than TwigStack and
TwigStackList in two twig queries.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
91
TwigStack
TwigStackList
iTwigJoin
TwigStack TwigStackList iTwigJoin
6
5
Execution time
(seconds)
Execution time (seconds)
BenchmarX 10 Keynote
TwigStackList V.S. iTwigJoin
4
3
2
1
0
Q1
Q2
20
15
10
5
0
Q3
DBLP data
Q4
Q5
TreeBank data
Observation: iTwigJoin has much better performance than that
of TwigStack/TwigStackList.
Explanation: iTwigJoin reduces I/O cost by reading less elements
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
92
iTwigJoin
TJFast
iTwigJoin
Twig2Stack
TJFast
Twig2Stack
18
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
16
Execution time (s)
Execution time (s)
BenchmarX 10 Keynote
iTwigJoin, TJFast, Twig2Stack,
14
12
10
8
6
4
2
0
1
2
1
DBLP data
2
3
TreeBank data
Observation: iTwigJoin/TJFast has better performance than that
of Twig2Stack
Reason: iTwigJoin/TJFast reduces I/O cost by reading less
elements
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
93
TJFastL
iTwigJoin
Execution time (seconds)
iTwigJoin
Execution time (seconds)
BenchmarX 10 Keynote
Experiments: TJFastL and iTwigJoin
1.2
1
0.8
0.6
0.4
0.2
0
Q1
DBLP data
Q2
TJFastL
9
8
7
6
5
4
3
2
1
0
Q3
Q4
Q5
TreeBank data
Observation: Both algorithms are based on tag+level scheme.
TJFastL has much better performance than iTwigJoin on tag+level
scheme.
Explanation: TJFast reduces I/O cost by reading less elements.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
94
TreeMatch
TJFast
Execution time (seconds)
TJFast
Execution time(seconds)
BenchmarX 10 Keynote
TJFast and TreeMatch
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Q1
Q2
DBLP data
TreeMatch
6
5
4
3
2
1
0
Q3
Q4
Q5
TreeBank data
Observation: TreeMatch has much better performance than that of
TJFast.
Explanation: TreeMatch reduces I/O cost over TJFast.
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
95
BenchmarX 10 Keynote
Conclusions



Efficient processing of twig queries is a core operation in
XPath and XQuery
We reviewed and compared seven holistic algorithms
 TwigStack(SIGMOD 2002)
 TwigStackList (CIKM2005)
 OrderedTJ (DEXA2006)
 iTwigJoin (SIGMOD2005)
 TJFast (VLDB2005)
 Twig2Stack(VLDB2006)
 TreeMatch (TKDE2010)
Comprehensive benchmark experiments show the
correctness and efficiency of holistic algorithms
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
96
BenchmarX 10 Keynote
Conclusions (Cont.)

Holistic TPQ processing, I/O cost takes most of time

TJFast reduces input data size

Twig2Stack reduces output size

TreeMatch reduces both input and output data size
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
97
BenchmarX 10 Keynote








Reference works
[1] J. Lu, T. W. Ling,Z. Bao and C. Wang Extended XML Tree Pattern Matching:
Theories and Algorithms IEEE TKDE Journal 2010 (to appear)
Propose TreeMatch algorithm
[2] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with
parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.
Propose TwigStackList algorithm
[3] J. Lu and T. W. Ling, Labeling and querying dynamic XML trees, In
Proceedings of the Sixth Asia Pacific Web Conference, 2004, 180–189
Propose a new labeling scheme for dynamic XML documents
[4] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching
using structural indexingtechniques. In SIGMOD, 2005.
Propose two new data streaming techniques
[5] J. Lu, T. W. Ling, C. Chan, and T. Chen, From region encoding to extended
dewey: On efficient processing of XML twig pattern matching, In Proceedings of
VLDB, 2005, pp. 193–204.
Propose TJFast algorithm
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
98
BenchmarX 10 Keynote
Reference works (Cont.)








[6] J. Lu, T. W. Ling, T. Yu, C. Li, and W. Ni, Efficient processing of
ordered XML twig pattern matching, Proceedings of DEXA, 2005, pp.
300–309
Propose OrderedTJ algorithm
[7] J. Lu, T. W. Ling, and T. Chen, TJFast: Effective processing of XML
twigpattern matching, Proceedings of WWW, 2005, pp. 1118–1119.
Propose extended Dewey labeling scheme
[8] T. Yu, T. W. Ling, J. Lu: TwigStackListNot: A Holistic Twig Join
Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA
249-263
Propose an algorithm for twig queries with NOT predicate
[9] J, Lu, R Yang, W. Ling, A. K.H Tung: Efficient XML tree pattern
matching: theory and algorithm Submit to IEEE TKDE Journal
Propose a theory and algorithm for extended XML tree pattern
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
99
BenchmarX 10 Keynote






Reference works (Cont.)
[10] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas,
D. Srivastava : Structural Joins: A Primitive for Efficient XML
Query Pattern Matching. ICDE 2002 141- 152
Propose StackTree algorithm
[11] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig
joins: optimal xml pattern matching. In Proceedings of ACM
SIGMOD, 2002.
Propose TwigStack algorithm
[12] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G.
M. Lohman, On supporting containment queries in relational
database management systems, In Proceedings of the ACM
SIGMOD International Conference on Management of Data,
2001, pp. 425–436.
Propose containment labeling scheme
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
100
BenchmarX 10 Keynote





Reference works (Cont.)
[13] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed
XML documents VLDB 2003
Propose TSGeneric algorithm
[14] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J.
Shekita, and C. Zhang, Storing and querying ordered XML using a
relational database system, In Proceedings of the ACM SIGMOD
International Conference on Management of Data, 2002, pp. 204–215.
Propose Dewey labeling scheme
[15] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic index
method for querying XML data by tree structures In SIGMOD
2003
Propose ViST system
[16] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S.
Beyer
Virtual Corsors for XML joins CIKM pages 523-532
2004
Propose Virtual cursor algorithm
Jiaheng Lu
Benchmarking Holistic
Approaches to TPQ Processing
101
Descargar

Benchmarking Holistic Approaches to TPQ Processing