KapQuilt: Semantic Caching for Quilt Queries
--- A New Quilt Query Answerable By Cached Ones?
Li Chen
Outline
•
•
•
•
•
•
•
•
•
Background
Motivation
Goals: semantic cache for quilt queries
Overall task list
Immediate task
Approaches for containment and rewriting
Case studies
Module design
Timetable
Background
query
efficiency
query performance
dynamically decide
mat views
query
quality
control concurrency
of queries & updates
Argos! :-))
ECA,
semantic
sweep
cache
mat view
answer queries
maintenance
using views
data mining
database
containment
web site
design
data warehousing
theory
&
algo
web site
integration
management
views
query
optimization
independence of
physical & logical data
Dimensions of Semantic Caching
languages
SQL - simple select-project-join SPJ
- group, aggregation, query blocks
- datalogs
OQL
TSL
Quilt
fully contained
outcomes
max-contained
containment
relationships
Motivations
• What’s new about semantic caching?
– Web proxies just cache web page hits, not real
computed queries
– Web information integration needs expressive
XML queries
– Semantic caching for XML queries is new
– Quilt is a full capability XML query language,
promising for the integration of web info, and
kweelt is a quilt query engine implemented!
Goals
• Goal! build a SC system for quilt queries
– to better answer populate queries
– quicker, less expansive and more up-to-date
• KapQuilt comes to rescue
limitation
we start from the core subset of quilt queries while ignoring
nesting queries and regular expression queris for now
remote query requests
KapQuilt System Architecture
KSP
Client
Kweelt Engine
Kweelt API
Parser
Query Decomposer
Query Matcher
KapQuilt
q1 q2
CIS
q3
Cached Views
Query Rewriter
Cost Estimator
Query plans
PQ
Evaluator
DOM
…
DOM
XML
Source 2
…...
RQ
Other Node Factories
XML Parser
XML
Source 1
DTDM
XML
Source n
Parser
Wrapper
Doc
RDB
Task List
• Answer whether a query is computable by cached ones
• If answerable, compute PQ (probe query) and RQ
(remainder query)
• If many PQ candidates, pick the one benefits most
• Decide whether a query is worth to cache, when to cache
• In case of cache space limitation, apply replace policy
• Decompose and coalesce the query segments in cache
• Concurrency control of queries and updates
• Analyze costs in various web query archs
• ?Keep cached view always fresh
– integrate Argos with KapQuilt
Immediate Task!
-- MQP project goal as well
• Design and impl core functions of KapQuilt
– input
• a set of cached queries S={s1,s2...}
• a new query q
– output
• a probe query (PQ)
– might be null if not answerable at all
– if not null, PQ 
Ac (s1  s2  …  sn)
• a reminder query (RQ)
– might be null if q fully contained in S
– if not null, RQ go down to query against data sources
Approaches
• Analyze quilt query process and its variable binding
mechanism
• Set up cache index structure (CIS) to represent elements of a
quilt query
• Warm up cache by initializing CIS with decomposed queries
• Implement the query containment and rewriting algorithm
for quilt
• Conduct experimental studies for cost analysis
• Integrate with Argos system for cached view maintenance
A Taxonomy for XML Query
SQL
XQL
XPointer
XSL Patterns
XPath
OQL
XML-QL
XQL-99
Quilt
Briefs on Quilt
• Quilt is a functional language
• A query is an expression, composed of
– FLWR Expressions
FOR ... LET ... WHERE ... RETURN
– Filters
– XPath expressions
document("bids.xml")//bid[itemno="47"]/bid_amount
– Operators and functions
– Element Constructors
<bid>
<userid> $u </userid> ,
<bid_amount> $a </bid_amount>
</bid>
Data Flow in a FLWR Expression
XML
FOR/LET
List of tuples of
bound variables
($x = value, $y = value, $z = value),
($x = value, $y = value, $z = value),
($x = value, $y = value, $z = value)
WHERE
List of tuples of
bound variables
RETURN
XML
Quilt Compared to XQL
• A superset of XQL
• Overcome shortcomings of XQL
– no variable bindings, joins, transformations,
ordering, aggregate functions, etc
– no data integration from multiple XML sources
– do semi-join, but in pretty non-intuitive syntax
book[author=//book[title='Moby Dick']/author]
• Cover queries on structured document
(including SGML), relational data even!
Quilt Query Process
• Variable binding is an important means
– a query can define multiple variables, in order
– dependency relationships exist among variables
– a tuple list is bound to each variable, condition
evaluation and return invocation are tuple-based
– tuple lists are handles to data tree components, of
which answer tree is composed
Cache Index Structure (CIS)
• A structure to capture the essential elements
of a quilt query
• What’s essential elements of a quilt query?
– variable bindings, conditions and returning nodes
all refer to some element nodes in dtd.
– a query can be identified by variable nodes V,
return nodes T, condition nodes F and their
dependency relationships
– each element node in a dtd tree can be assigned a
unique number (with unique absolute xpath)
Example DTD
<?xml version="1.0"?>
<!DOCTYPE bib [
<!ELEMENT bib (book* )>
<!ELEMENT book (title, (author+ | editor+ ), publisher, price )>
<!ATTLIST book year CDATA #REQUIRED >
<!ELEMENT author (last, first )>
<!ELEMENT editor (last, first, affiliation )>
<!ELEMENT title (#PCDATA )>
<!ELEMENT last (#PCDATA )>
<!ELEMENT first (#PCDATA )>
<!ELEMENT affiliation (#PCDATA )>
<!ELEMENT publisher (#PCDATA )>
<!ELEMENT price (#PCDATA )>
]>
1
2
year
title
3
4
CDATA
6
PCDATA
5
last
9
PCDATA
8
7
10
bib
book
author
first
11
PCDATA
editor
12
last
13
14
PCDATA
19
15
first
publisher
17
16
PCDATA
affiliation
18
PCDATA
21
price
20
PCDATA
22
PCDATA
//book
Quilt Query Sampler I
v1
v1/autho
Q1
<bib>
FOR $book IN document("bib.xml")//book[@year.>=.1991 AND publisher="Addison-Wesley"]
RETURN <book [email protected]>$book/title</book>
</bib>
b1
b2
v2
b3
variable nodes
v1 /bib/book
2
2
y1
v1
condition nodes
3
19
f1 v1 @ year.>.=1991
f2 v1 / publisher=“Addison-Wesley”
3
19
b1
b2
b3
p1
f1 (
2
p2
y2
3
)
y1[1993]
y2[1995]
1
1
y3[1990]
0
y3
p3
f2 ( 19 )
p1[…] 1
p2[…] 0
p3[…] 1
return nodes
3
5
t1 v1 @ year
t2 v1 / title
3
5
t1 t31
t2
t1[…]
y1[1993]
/bib/book
/book
2
r1
year
3
19
publisher
3
5
y1
5
t1
Quilt Query Sampler II
Q2
FOR $author IN DISTINCT document("bib.xml")//author,
$book IN document("bib.xml")//book[author = $author]
RETURN <result> $book/title, $author</result>
b1
b2
a1
variable nodes
7
2
7
2
a2
v1
v1 /bib/book/author
v2 /bib/book[author= v1]
a1
b3
a3
5
a1
a2
a3
7
t1 v2 / title
t2 v1
t2
b1
b2
b1
b3
b2
t1
7
a1
a2
a3
7
/result
author
5
v2 / title
/bib/book[author= v1]/title
r1
7
5
t1
t2
t1
t3
t2
/bib/book
2
v2 [author= v1]
2
7
return nodes
5
7
a2
r2
r3
r4
r5
v1
/bib/book/author
t1
a1
t2
a1
t1
a2
t3
a2
t2
a3
Quilt Query Sampler III
Q3
<results>
FOR $author IN DISTINCT document("bib.xml")//author
RETURN <result>
$author,
document("bib.xml")//book[author = $author]/title
</result>
</results>
b1
a1
a2
7
7
5
b3
a3
a2
7
a1
a2
a3
v1 /bib/book/author
t1
return nodes
7
5
a1
v1
variable nodes
7
b2
t2
7
a1
a2
a3
t1 v1
t2 /bib/book[author= v1]/title
5
t1
t2
t1
t3
t2
/result
/bib/book/author
r1
r2
r3
7
7
5
a1
t1
t2
a2
t1
t3
a3
t2
Quilt Query Sampler IV
Q4
<books-with-prices>
FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"],
$b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title]
RETURN <book-with-prices>
$b_book/title,
<price-amazon>$a_book/price/text()</price-amazon>,
<price-bn>$b_book/price/text()</price-bn>
</book-with-prices>
</books-with-prices>
Quilt Query Sampler IV
variable nodes
2
2’
2
2’
v1 bib/book[source = "www.amazon.com"]
v2 bib/book[source = "www.bn.com"][title = v1 /title]
b1
b2
b3
b1’
b2’
b3’
t1
t2
t3
t1 ’
t2’
t3’
return nodes
5’
22
22’
5’
22
22’
t1 v2 / title
t2 v1 /price/text()
t3 v2 /price/text()
v1
/bib/book
@source = "www.amazon.com"
/bib/book
@source = "www.bn.com"
2’
price-amazon
22
PCDATA
v2 [title = v1 /title]
b1’
b2’
b3’
22
5
$12.5
$23
$54
/ book-with-prices
5’
2’
b1
b2
b3
t2
2
2
t1
t
22’ 3
t1
t2
t3
$21
$22
$47
price-bn
22’
PCDATA
t1
t2
$12.5
$21
t3
$23
$22
$54
$47
More Quilt Query Sampler
variable nodes
Q5
v1 /bib/book
2
2
<bib>
FOR $book IN document("bib.xml")//book[price.<=.$50]
condition nodes
RETURN <book [email protected]><editors>$book/editor</editors></book>
f1 v1 / price.<.=$50
21 21
</bib>
return nodes
3
12
variable nodes
2
2
7
7
3
12
t1 v1 @ year
t2 v1 / editor
v1 /bib/book
v2 /bib[book= v1]//author
condition nodes
21
21
f1 v1 / price.<.=$50
return nodes
5
21
7
5
21
7
t1 v1 / title
t2 v1 / price
t3 v2
Q6
<bib>
FOR $book IN document("bib.xml")//book[price.<=.$50],
$author IN /bib[book=$book]//author[last=“Abiteboul”]
RETURN <book>$book/title, $book/price, $author</book>
</bib>
Query Containment for
Relational Queries
q
s
s
q
Fq
Fq
Fs
Fs
Tq
Tq
Ts
Ts
s
q
Fs
Fq
Tq
Ts
Our Containment Theorem
Given a set of cached queries S={s1,s2...}, and a new query q,
q can be fully answerable by S if
'
'
'
1  f  F q   f  F s , f  f or  t  T s ( s i , s j  S )
i
j
'
'
'
'
'
2  f i  Fq ,  t  T s  (  f i  F s , f i  f i ,  f k  F s   f k  Fq , f k  f k , )( s i , s j  S )
j
i
i
3  t  Tq   t  Ts ( si  S )
'
i
'
'
'
'
'
'
4  ti , t j  Tq   ti , t k  Ts ,  t j , t h  Ts , t k  t h (i  j , si , s j  S )
i
j
5  v  V q , v  t  Tq   v '  V s , v '  t '  Ts ( si  S )
i
6
i
 v  V q , v  t  T q  (  v  V s , v  t  T s , v  f  F s   v  f  Fq )( s i  S )
'
'
i
'
'
i
'
i
Explanations
 f  F q   f  F s , f  f or  t  T s ( s i , s j  S )
'
'
'
i
j
for every condition node f of q, it must either also be one condition node, with
loose predicates, of some si in the cache, or be one return node of some sj
 f i  F q ,  t  T s  (  f i  F s , f i  f i ,  f k  F s   f k  F q , f k  f k , )( s i , s j  S )
'
'
j
'
'
'
i
i
for every condition node fi of q, if it is not one of any return node of sj, then it must
be one condition node, with loose predicates, of some si in the cache, and any other
condition node fk of si should be one condition node fk of q
or
 Fs   Fs
'
, Ts  Ts
'
i , j ... k
, F q  F s  T s , F s  F q ( s i , s j ,... s k  S )
'
i , j ... k
'
'
there is a subset of S, whose condition nodes is a subset of those of q, but whose
condition nodes and return nodes are a superset of the condition nodes of q.
Explanations
 t  Tq   t  Ts ( si  S )
'
i
for every return node t of q, it must also be one return node of some si in the cache.
 ti , t j  Tq   ti , t k  Ts ,  t j , t h  Ts , t k  t h (i  j , si , s j  S )
'
'
'
'
'
i
'
j
for every pair of return nodes ti and tj of q, if their counterparts are in different
segments si and sj, then there must be a common return node in si and sj.
 v  V q , v  t  T q  (  v  V s , v  t  T s , v  f  F s   v  f  Fq )( s i  S )
'
'
i
'
'
i
'
i
for every return node t of q, if it is derived from a variable node v, then its
counterpart in the cache should be also derived from the same variable node,
and all the condition nodes derived from this v should also have their
counterparts derived from v in q.
Query Rewriting Rules
If a query is judged to be computable by cached views, the
following rules can be followed to figure out the rewritten q’
1. Decide which filters to keep(not evaluated by any cached query yet),
which filters to remove (evaluated by some cached query) and
remember those cached queries S with established F mappings.
• keep all those f that has t’ matches, and those f with a looser f ’ matches,
they would be still appearing as condition nodes in the probe query
• remove those f with exact f ’ matches
• for each non-exact f ’ match, remember its s so to know which s to associated
with those left over filters
Query Rewriting Rules (cont.)
2. We need to figure out the semantic meanings of newly constructed nodes
in the returning structure of each cached queries, they are associated
with new xpaths as the replacement of their old ones
….
A newly constructed node can be seen as the renaming of some old
element node. Return nodes usually appear under each newly constructed
node, hence a mapping of this new node to the old one can be inferred
from those return nodes
• replace in the new query q those old xpaths, with the new xpath to a
newly constructed node in cached views
3. In case of a query rewriting using joins of more than one s with common
t pair, be sure to add such joins as new conditions
• if there is no variable binding in the new q, a new binding should be produced
for one of the common t pair so that there is a way to join with its pair
Query Containment I
Suppose that we have queries of q1,q2,q3,q4,q5,q6 cached in
C={s1,s2,s3,s4, s5,s6}, a new query q comes in,
case q of
f1
<bib>
FOR $book IN document("bib.xml")//book[editor/affiliation=“WPI”]
RETURN <book [email protected]>$book/title</book>
</bib>
it does not even satisfy the first condition. s.t. not answerable
f1 refers to the element node of 17, which has no match in s1 to s6
f1
<bib>
FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"]
RETURN <book [email protected]>$book/title</book>
</bib>
it satisfies the first condition, but not the second one. s.t. not answerable
f1 < -- > f1’ in s1, but another condition node f2’ of s1 is not any condition node of q
Query Rewriting I
case q of
f1
f2
f3
<bib>
FOR $book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"]
RETURN <book [email protected]>$book/title</book>
</bib>
it satisfies all those conditions, s.t. is answerable
1) f1 < -- > f1’ in s1, f2< -- > t2’ of s1, f3 < -- > f2’ in s1,
2) there is no other f ’ in s1,
3) t1 < -- > t1’ in s1, t2< -- >t2’ in s1,
4) t1’ and t2’ are both from the same s1,
5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2
s1
$book IN …/bib/book [@year=1997 AND title like “JAVA*”
AND publisher="Addison-Wesley"]
/book
v1 /bib/book
t1 v1 @ year
3
5
t2 v1 / title
rewritten as
/book [source = "s1”]
Rewrite the query as
left over filters
<bib>
FOR $book IN /book [source = "s1"][@year=1997 AND title like “JAVA*”]
RETURN <book [email protected]>$book/title</book>
</bib>
Query Rewriting II
case q of
f1
f2
f3
<bib>
FOR $book IN … /bib/book [@year.>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50]
RETURN <book>$book/title,<editors>$book/editor</editors></book>
</bib>
it satisfies all the conditions, s.t. is answerable
1) f1 < -- > f1’ in s1, f2 < -- > f2’ in s1, f3< -- > f1’ of s5,
2) there is no other f ’ in s1 and s5
3) t1< -- >t2 in s1, t2 < -- > t2 in s5,
4) t1 in s1 = t1 in s5,
5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2, f3
s1
v1 /bib/book
/book
t1 v1 @ year
s5
v1 /bib/book
t1 v1 @ year
5
3
t2 v1 / title
/book
3
5
t2
v1 / editor
$book IN …/bib/book [@year .>=.1991 AND
publisher="Addison-Wesley” AND price.<=.$50]
rewritten as
/book1[source = "s1”] and /book2[source = "s5”]
Rewrite the query as
<bib>
FOR $book1 IN /book [source = "s1"],
$book2 IN /book [source = "s5"][title =$book1/title]
RETURN <book>$book1/title,<editors>$book2/editor</editors></book>
</bib>
Query Rewriting III
Suppose that we have q2 cached in S, but q3 is not cached,
instead, it is a new query,
case q of
<bib>
FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"]
RETURN <book [email protected]>$book/title</book>
</bib>
<bib>
FOR $book IN … //book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"]
RETURN <book [email protected]>$book/title</book>
</bib>
Query Rewriting IV
Suppose that we have q2 cached in S, but q3 is not cached in,
instead, it is a new query,
Q4
<books-with-prices>
FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"],
$b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title]
RETURN <book-with-prices>
$b_book/title,
<price-amazon>$a_book/price/text()</price-amazon>,
<price-bn>$b_book/price/text()</price-bn>
</book-with-prices>
</books-with-prices>
Input: Query q, Semantic Cache C
Output: Result of q
AnsweringQuery Procedure:
answerable, fullAns <--- False;
T <--- current timestamp;
C={s1,s2,..} <--- set up CIS for si segment;
s <--- set up CIS for q;
M <--- matched node set, set as null at the beginning ;
R = RENs <--- not matched node set;
RC <--- CENs of s;
si <--- look for the first q related si in C;
S <--- put si into a candidate set;
While (si can be found) {
answerable <--- True;
STs <--- T;
(MS, RM) <--- query_trimming(si, s);
MS <--- matching nodes of s and si;
RM <--- remaining nodes of s not covered by si;
R = R-RM; M = M+MS; RC = RC+RemainingCENS;
If (R=null) {
fullyAns <--- True;
break;
} si <--- next q related segment in C;
}
PQ <--- query_rewriting(S, JoinCISs, M, RC);
MatV <--- materialized view sets referred by S;
ResPQ, Result <--- process PQ against MatV;
If (fullyAns = False) {
RQ <--- query_rewriting(S, JoinCISs, M, RC);
ResRQ <--- process RQ at the server;
Result <--- coalesce(ResPQ, ResRQ);
}
create a new segment Snew contains the result of q;
Input: CIS structure for q s and a segment si;
current candidate set S and JoinCISs;
Output: judge whether si is related to q; if yes, add into S,
matching nodes MS and remaining nodes RM;
Query_trimming Procedure:
IsRealted <--- False;
MS <--- null;
RM <--- boundRENs of s;
RS <--- RENs in si;
RNS <--- RENs in all s of S;
RemainingCENS <--- boundCENs of s;
While (RM=\null) {
matchingRENS <-- match nodes in RM and RS by applying theorems;
if (matchingRENS =\null) {
commonSet <--- RNS ^ RS;
if (commonSet =\null) {
IsRelated <--- True;
<si, sj, commonSet> <--- sj is the segment in S has commonSet with si;
JoinCISs <--- add <si, sj, commonSet> into JoinCISs;
(MS, RM) <--- (matchingRENS, RM-M);
RemainingCENS <--- left-over ones except exact-match CENS;
}
}
}
return (S, JoinCISs, MS, RM, RemainingCENS).
DTD
DTDWalker
DTDTree
ENIndex (ElementNode)
to-be-cached queries
set up
QueryDecomposer
QueryIndex VENIndex CENIndex RENIndex
(VariableEleNode ReturnEleNode ConditionEleNode)
CacheIndexStructs
MatView
(ViewDTDTree)
new query
QueryDecomposer
AnsweringQuery
NewQuery
QueryTrimmer
contained?
NewQuery QueryRewritter
(MatchingCENPair
ProbeQuery
MatchingRENPair
MatchingENPair)
remainingRENs
fully contained?
Y
Result
N
QueryRewritter
RemainderQuery
ResultCoalescer
1
ElementNode
dtdElements:Vector
enId:int
dtdRef:DTDTree
enName:String
absXpath:String
parentEN: ElementNode
childrenEN:Vector
DTDTree
dtdName:String
dtdLoc:String
rootEN: ElementNode
elementNodes:Vector
equalTo(EN):boolean
relativeXpathForm(EN):String
VariableEleNode
venName:String
cisRef:CIS
parentVEN:VEN
childrenVEN:Vector
childrenCEN:Vector
childrenREN:Vector
ConditionEleNode
cenName:String
cisRef:CIS
condition:String
parentVEN:VEN
stricterThan(CEN):boolean
ENIndex
ReturnEleNode
ViewDTDTree
renName:String
cisRef:CIS
parentVEN:VEN
matchingENPairs:Vector
1
QueryIndex
ProbeQuery
cachedQueries:Hashtable
1
boundVENs:Vector
boundCENs:Vector
boundRENs:Vector
newDTD:ViewDTDTree
VENIndex
cachedVENinCISs:Hashtable
NewQuery
candidateCISSet:Vector
matchingCENPairs:Vector
remainingCENs:Vector
matchingRENPairs:Vector
remainingRENs:Vector
1
CENIndex
cachedCENinCISs:Hashtable
1
CacheIndexStructs
initialize cisId:String
RENIndex
cachedRENinCISs:Hashtable
hasMoreCENsThan(CIS):boolean
hasOverlapTENs(MatchingRENPair, MatchingRENPair):boolean
MatchingCENPair
newCEN: ConditionEleNode
oldCEN: ConditionEleNode
oldREN: ReturnEleNode
remainingCond:string
stricterThan(CEN):boolean
MatchingRENPair
newTEN: ReturnEleNode
oldTEN: ReturnEleNode
otherOldTENs:Vector
diffCISFrom(MatchingRENPair):boolean
overlapTENWith(MatchingRENPair):boolean
sameParentVEN():boolean
newWithPVENhasMoreCENs():boolean
qstring:String
viewDTD:ViewDTDTree
matRef: MatView
boundVENs:Vector
boundCENs:Vector
boundRENs:Vector
MatView
matId:String
cisRef:CacheIndexStructs
MatchingENPair
newConstructEN:ElementNode
originalDtdEN:ElementNode
source: MatView
Timetable
•
•
•
•
•
•
•
•
•
By 11/15: design due, implement starts
By 11/30: half finish coding
By 12/10: fully finish coding
By 12/20: finish integration
By 1/15: test designed cases
By 1/30: design and do experiments
By 2/15: collect experiment results
By 2/28: document code, writing
By 3/15: summarize
Task Assignment
• Lily:
– design classes, containment, rewriting and candidate picking
algorithms, design experiments
– ideas for query decomposition, result combination, cache
decomposing /coalesce, replacement policy, data updates handling
• Jake:
– implement containment algo, ...
• Ian:
– implement classes of EN, VEN, CEN, TEN, ...
• Amar:
– module of rewriting algo
Implementation Toolsuites
• JDK1.2, Servlet
• XML Parser, DTD Parser
• Quilt Parser, Kweelt(Quilt) Query Engine
Descargar

KapQuilt: Semantic Caching for Quilt Queries