PODS 2011
2011 ACM SIGMOD/PODS Conference
Athens, Greece
Finding a Minimal Tree Pattern
Under Neighborhood Constraints
Benny Kimelfeld
Yehoshua Sagiv
IBM Research – Almaden
The Hebrew University
of Jerusalem
carrey williams
DB (Graph) Search
widom ullman
search
search
XKeyword [Balmin et al.] (XML)
BANKS [Bhalotia & al.] (rel.)
eu brussels
search
[Achiezra, K, S] (XML)
DBXploer
[Agrawal et al.]
DISCOVER
[Hristidis et al.]
SPARK
[Lou et al.]
Blinks
[He et al.]
RDF search
[Tran et al.]
…
2
DB Search: 2 Approaches
graph
algorithm
DB
data graph
e.g., top-k
Steiner-trees
search
results
query1
query2
schema
query
gen.
query3
DB
query
engine
SQL,
XQuery, …
search
results3
Aided Query Formulation
ExQueX [K et al.]
QUICK [Zenz et al.]
4
Building Integration Forms
Q System
[Talukdar et al.]
concepts
(as keywords)
5
The Problem (Informal)
employee
schema:
eid name
type
dept
dept
id
name head
(Mary’s department
SELECT *
FROM
employee e, dept d
WHERE e.name=‘Mary’ and e.dept=d.id
{ Mary, dept }
{ Mary, manager }
(managers named Mary)
SELECT *
FROM
employee e
WHERE e.name=‘Mary’ AND e.type=‘manager’
(Mary’s manager)
SELECT *
FROM
employee e1, employee e2, dept d
WHERE e1.name=‘Mary’ AND e1.dept=d.id AND
d.head=e2.eid AND e2.type=‘manager’
{ Jacob, Mary, manager }
{ dept, dept }
{ Mary, dept, dept }
bags of labels
…
6
And in XML
DTD
<!ELEMENT department (facility|(branch*,headq))>
<!ELEMENT branch (facility)>
...
{ facility, headq }
department[branch/facility][headq]
department
{ facility, facility }
branch
headq
facility
department
branch
branch
facility
facility
department[branch/facility][branch/facility]
7
Techniques for Query Extraction
incremental query
construction (candidate networks)
subtree enumeration
on the schema graph
• Discover/XKeyword [Balmin et al.]
• DBXploer [Agrawal et al.]
• SPARK [Lou et al.]
• ExQueX [K et al.]
• QUICK [Zenz et al.]
• Q System [Talukdar et al.]
• Acyclic queries ─ tree patterns
• Ranking dominated by size: smaller = better
8
Example of Query Extraction
employee
eid name
dept
type
dept
id
e22 Marie manager
a
a
db
e22
e68 Jacob
b
b
sales
e78
regular
name head
regular
employee
regular
employee
regular
employee
regular
dept
manager
dept
employee
dept
regular
employee
employee
manager
regular
manager
manager
employee
manager
dept
employee
manager
9
Neighborhood Constraints
employee
dept
dept
regular
manager
dept
#≤1
#≤1
#=1
, ( regular ? | ( manager , dept* ) )
possible neighbors of employee
dept
employee
employee
#=1
possible neighbors of dept
10
Efficiency Matters!
Existing solutions usually experiment on tiny schemas; an exception:
Q System [Talukdar et al, VLDB’08]
• 408 relations
• 1366 table references
w/o taking constraints into
account, larger K needed!
Speed is important
since patterns are typically extracted following a user query…
Schemas can be much larger, e.g., SAP:
…company data model with over 13,000 database tables. To
manage the meta data (e.g., types and interrelationships) of
these tables…
Kemper et al., “Performance tuning for
SAP R/3”, IEEE Data Eng. Bull., 1999.
11
Techniques for Query Extraction
incremental query
construction (candidate networks)
subtree enumeration
on the schema graph
• Discover/XKeyword [Balmin et al.]
• DBXploer [Agrawal et al.]
• SPARK [Lou et al.]
• ExQueX [K et al.]
• QUICK [Zenz et al.]
• Q System [Talukdar et al.]
No guarantee on the
running time
Can construct exponentially many
intermediate, partial patterns, before
finding even 1 tree pattern
Repeated labels
are not allowed
employee
regular
dept
employee
manager
No neighborhood constraints
12
In the Paper
Finding a Minimal Tree Pattern
Under Neighborhood Constraints
• 1st provably efficient algorithms allowing
expressive constraints (and repeated labels)
• Future work: extending to top-k
– In the paper: subtle issues in defining top-k
• Existing techniques for top-k repeatedly solve
top-1, w/ additional constraints
–
–
–
–
Shortest simple paths [Yen, 1971]
Smallest Steiner trees [K & S, 2006]
Most probable answers over Markov sequences [K & Ré, 2010]
…
13
Abstraction: Schema (undirected)
finite set of labels
label1
label2
bag-of-labels1
bag-of-labels2
bag-of-labels3
bag-of-labels17
bag-of-labels18
bag-of-labels19
..
.
..
.
...
labeln
bag-of-labels54
bag-of-labels55
bag-of-labels56
..
.
neighborhood constrains
each constraint is a (possibly infinite) set of bags, in some rep. language
14
Problem Definition (undirected model)
label1
label2
bag-of-labels1
bag-of-labels2
bag-of-labels3
bag-of-labels17
bag-of-labels18
bag-of-labels19
..
.
...
..
.
schema
S
...
labeln
input
weights
allowed
bag Λ of labels
bag-of-labels54
bag-of-labels55
bag-of-labels56
e.g., { Mary, manager},
{country, country, border}
..
.
goal: minimal tree T, s.t.
• T contains Λ
• T is consistent with S
∀ nodes v∈T:
labels( nbrs(v) ) ⊆ some bag-of-labelsi
of those of label(v)
v
label2
⊆
bag-of-labels17
bag-of-labels18
bag-of-labels19
..
.
15
label1
label2
labeln
bag-of-labels
bag-of-labels11
bag-of-labels
bag-of-labels22
bag-of-labels
bag-of-labels3
bag-of-labels
bag-of-labels1717
bag-of-labels
bag-of-labels1818
bag-of-labels
bag-of-labels19
bag-of-labels
bag-of-labels5454
bag-of-labels
bag-of-labels5555
bag-of-labels
bag-of-labels56
Simple Example
publication
#=1 title
...
...
#≤1 journal
...
...
...
19
conference#≤1
VLDB
publication
...
...
3
PODS
56
author
author
journal
conference
publication
publication
VLDB
PODS
conference
conference
VLDB
PODS
publication
conference
conference
VLDB
PODS
16
Complexity Measures
input:
 Λ (bag of labels)
 S (schema)
output:
min T
s.t. Λ ⊆ T & T is consistent w/ S
(constraints in rep. language )
NP-hard, already under trivial rep. languages…
(even to approx.)
But Λ is typically tiny!
derived from a user-phrased (search) query
fixed |Λ|
“efficient” = e.g.,
O(|S||Λ|)
possible under a very general
condition on the rep. language
(not discussed, see paper)
even better: FPT, i.e.
Fixed-Parameter Tractable
“efficient” = e.g., O(2|Λ||S|2)
next slides
17
Rep: Mutual-Exclusion Graphs
#≤1 journal
publication
author
#≤5
conference#≤1
interval graph
#≤1 conf-date
title
#=
1
edition #≤1
Theorem: not FPT
(under standard assumptions; FPT reduction from parameterized
Independent-Set, which is W[1]-hard [Abrahamson et al., 93])
Theorem: FPT for classes of mux graphs:
disjoint cliques
more generally,
interval graphs
more generally,
circular-arc graphs
18
Rep: Regular Expressions
e := label | ε | e* | e? | e,e | e|e
publication
title, author*, (journal | (conference, conf-date) | edition)
Theorem: FPT
19
Proof Strategy (Algorithm)
input:
 Λ (bag of labels)
 S (schema)
output:
min T
s.t. Λ ⊆ T & T is consistent w/ S
(nontrivial) adaptation of Dreyfus &
Wagner’s algorithm for Steiner trees
reduction
labeled bag cover
generalized minimum set cover: bags instead of sets,
cover needs to satisfy a neighborhood constraint
dynamic prog. for disjoint-cliques mux graphs
works for interval graphs
regular expressions
Thanks: Ryan &
Virginia Williams
circular-arc graphs
20
Summary
Questions?
• Plethora of tools extract minimal tree patterns
from the schema (relational / XML /…)
– Existing solutions: exptime and/or no repeated labels;
not allowing even basic neighborhood constraints
• Simple & general abstraction of the problem
– allowing neighborhood constraints
• Algorithms
– In particular, our FPT alg.s find a min pattern under 2
constraint languages: regexp & mutual exclusion
• Next step: based on this work, top-k
21
Descargar

Document