TopX
Efficient & Versatile
Top-k Query Processing for
Semistructured Data
Martin Theobald
Max Planck Institute for Computer Science
Stanford University
Joint work with
Ralf Schenkel, Gerhard Weikum
//article[.//bib[about(.//item, “W3C”)]
]//sec[about(.//, “XML retrieval”)]
//par[about(.//, “native XML databases”)]
RANKING
article
article
title
“Current
Approaches
to XML Data
Manage- sec
ment”
VAGUENESS
XML Files”
bib
sec
par
“XML queries
with an expressive power
similar to that
of Datalog …”
“Data management systems control
data acquisition, storage, and
retrieval. Systems evolved from flat
files … ”
title
PRUNING
“The
Ontology
Game”
“Native XML
Data Bases.”
“Native XML data
base systems can store
schemaless
data ... ”
par
sec
sec
bib
title
title
par
title
“The
item
title
“The
Dirty Little
Secret”
par
inproc
“XML-QL: “Proc. Query
A Query
Languages
Language
Workshop,
for XML.”
W3C,1998.”
par
“Sophisticated
technologies
“There, I've said
developed by
it - the "O" word. If
smart people.” anyone is thinking
along ontology lines, I
would like to break
some old news …”
item
title
“XML”
url
“w3c.org/xml”
par
“What does XML
add for retrieval?
It adds formal ways …”
Goal: Efficiently retrieve the best
(top-k) results of a similarity query
• Extend existing threshold algorithms for inverted lists
[Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01]
to XML data and XPath-like full-text search
• Non-schematic, heterogeneous data sources
• Efficiently support IR-style vague search
• Combined inverted index for content & structure
• Avoid full index scans, postpone expensive random
accesses to large disk-resident data structures
• Exploit cheap disk space for redundant index structures
XML-IR: History and Related Work
1995
Web query languages:
IR on structured docs (SGML):
W3QS (Technion Haifa)
Araneus (U Roma)
Lorel (Stanford U)
WebSQL (U Toronto)
OED etc. (U Waterloo)
HySpirit (U Dortmund)
HyperStorM (GMD Darmstadt)
WHIRL (CMU)
XML query languages:
XML-QL (AT&T Labs)
2000 XPath 1.0 (W3C)
XPath 2.0 (W3C)
XQuery 1.0 (W3C)
2005
TeXQuery
(AT&T Labs)
NEXI (INEX
Benchmark)
IR on XML:
XIRQL & HyRex (U Dortmund)
XXL & TopX (U Saarland / MPII)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
JuruXML (IBM Haifa )
XSearch (Hebrew U)
Timber (U Michigan)
XRank & Quark (Cornell U)
FleXPath (AT&T Labs)
XKeyword (UCSD)
XPath 2.0 &
XQuery 1.0
Full-Text Commercial software:
MarkLogic, Verity?,
(W3C)
IBM?, Oracle?, ...
Frontends
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
Ontology/
Large Thesaurus
WordNet,
OpenCyc, etc.
Candidate
Queue
Candidate
Cache
SA
Scan
Threads
Random Access
Indexing Time
1
TopX
Query Processor
Sequential Access
Query Processing Time
• Web Interface
• Web Service
• API
Top-k
Queue
Auxiliary
Predicates
RA
Index Metadata
•Selectivities
•Histograms
•Correlations
DBMS / Inverted Lists
Unified Text & XML Schema
Indexer
/Crawler
RA
1
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
5
Experiments:
TREC & INEX
Benchmarks
Data Model
“xml data manage xml manage system
“xml data manage xml manage system
vary
wide
expressive
power
native
xml
vary
wide
expressive
power
native
xml
native
base
store
dataxml
basedata
native
xmlsystem
data base
system
store schemaless data“
ftf (“xml”,
<article>
<title>XML Data Management
article1 ) = 4
article
1
6
“native xml
database
base
xml data
native xml
xml data
database
base
system
system store
store
schemaless data“
data“
sec schemaless
</title>
<abs>XML management systems vary
widely in their expressive power.
title
abs
2
2 1 3
4
5
</abs>
<sec>
“xml
“xml manage
<title>Native XML Data Bases.
data
system vary
title
</title>
manage” wide expressive
5 3
<par>Native XML data base systems
power“
4
can store schemaless data.
“native xml
</par>
data base”
</sec>
</article>
ftf (“xml”,
sec ) = 2
 XML trees (no XLinks or ID/IDref attributes)
 Pre-/postorder node labels
 Redundant full-content text nodes (w/stemming, no stopwords)
par
6
4
“native xml data
base system
store
schemaless
data“
Scoring Model
[INEX ’06/’07]
 XML-specific extension to Okapi BM25
(originating from probabilistic IR on unstructured text)
 ftf instead of tf
 ef instead of df
 Element-type specific
length normalization
 Tunable parameters k1 and b
bib[“transactions”]
vs.
par[“transactions”]
Fagin’s NRA
Corpus: d1,…,dn
s(t1, d10) = 0.8
s(t2,d10) = 0.6
s(t3,d10) = 0.7
Query: q = (t1,t2,t3)
[PODS ´01]
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
at a Glance
NRA(q,L):
scan all lists Li (i = 1..m) in parallel & consider doc d at posi
E(d) := E(d)  {i};
highthe
Find
documents that maximize
i = s(ttop-k
i,d);
worstscore(d) := ∑ s(t,d) | E(d);
1 j
2 j
m j
bestscore(d)
:= worstscore(d)
+ ∑ high | E(d);
if worstscore(d) > min-k then
add d to top-k
min-k := min{worstscore(d’)
| d’  top-k};
 non-conjunctive
(“andish”)
evaluations
else if bestscore(d) > min-k then
candidates := candidates  {d};
if max {bestscore(d’) | d’ candidates}  min-k then
return top-k;
s(t ,d ) + s(t ,d ) + ... + s(t ,d )
Inverted Index
t1
t2
t3
d78
0.9
d64
0.8
d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d13
0.2
d99
0.2
d88
0.2
d78
0.1
d34
0.1
…
…
…
kNaive
=1
Rank Doc Worst- BestRank Doc score
Worst- BestRank Doc Worst-score
Bestscore
score
“Merge-then-Sort” score score
1
d78 0.9
2.4
Scan
1
d78
1.4
2.0
Scan
approach
in between
1
d10
2.1
2.1
Scan
2
d64
0.8
2.4
depth
1
depth
2 3 O(mn
222) runtime
d23
1.9
O(mn)
depthand
d78 1.4
1.4
2.0
3
d10 0.7
2.4
2.1
1.8
0.7
2.1
1.2
2.0
and O(mn) access
33 cost
d64
0.8
d23
1.4
STOP!
44
d10
d64
Inverted Block-Index for
Content & Structure
//sec[about(.//, “XML”) and
about(.//title, “native”]
//par[about(.//, “retrieval”)]
sec[“xml”]
title[“native”]
par[“retrieval”]
eid docid score pre post max-SA
score
46
2
0.9
2
15
9
2
0.5
10
8
171
5
0.85
1
20
84
3
0.1
1
12

eid docid score pre post max-SA eid docid score
score
0.9
3
1
1.0
216 17
0.9 2 15
0.9
0.9 RA 72
28
2
0.8
3
0.8 14 10
0.8
0.85
0.75
51
2
0.5 4 12
0.5 RA182 5
0.1
671 31
0.4 12 23
0.4
96
4
0.75
pre post max-SA
score
1
8
3
6
21
14
7
4
Sorted (=sequential) access to large element blocks on disk
 Group elements in descending order of (maxscore, docid)
 Block-scan all elements per doc for a given (tag, term) key
 Stored as inverted files or database tables
 Two B+tree indexes over the full range of attributes (IOTs in Oracle)
1.0
0.8
0.75
0.75
RA
Navigational Element Index
//sec[about(.//title, “native”]
//par[about(.//, “retrieval”)]
sec
title[“native”]
par[“retrieval”]
eid docid pre post
eid docid score pre post max-SA
score
216 17
0.9
2 15
0.9
72
3
0.8 14 10
0.8
51
2
0.5
4 12
0.5 RA
671 31
0.4 12 23
0.4
eid docid score pre post max-SA
score
46
2
2
15
9
2
10
8
171
5
1
20
84
3
1
12
RA
3
28
182
96
1
2
5
4
 Additional index for tag paths
 RAs on B+tree index using (docid, tag) as key
 Few & judiciously scheduled “expensive predicate” probes
 Schema-oblivious indexing & querying
 Non-schematic XML data (no DTD required)
 Supports full NEXI syntax & all 13 Xpath axes (+level)
1.0
0.8
0.75
0.75
1
8
3
6
21
14
7
4
1.0
0.8
0.75
0.75
RA
TopX Query Processing Example
Top-2 results
//sec[about(.//, “XML”) and
about(.//title, “native”]
//par[about(.//, “retrieval”)]
171
46 worst=1.0
9
worst=0.9 46
worst=1.6
182
3
worst=0.5
worst=0.9
worst=1.7
worst=2.2
216
51 28
min-2=0.5
min-2=1.6
min-2=1.0
min-2=0.9
min-2=0.0
1.0 sec[“xml”]
0.9 eid docid score
0.85
0.1
pre post
1.0 title[“native”]
1.0 par[“retrieval”]
0.9 eid docid score pre post 1.0 eid docid score pre post
0.9
2
15
1
1.0
1
21
0.8 216 17
0.8 3
3
0.8
14 10
28
2
0.8
8
14
0.5 72
0.75
46
2
0.9
2
15
9
2
0.5
10
8
171
5
0.85
1
20
51
2
0.5
4
12
182
5
0.75
3
7
84
3
0.1
1
12
671
31
0.4
12
23
96
4
0.75
6
4
doc5
171
doc2
46
51
doc3
worst=2.2
worst=0.9
worst=1.7
best=2.9
best=2.8
best=2.7
best=2.5
best=2.2
worst=0.5
best=2.5
best=2.4
best=2.3
best=1.3
best=0.5
28
worst=0.8
best=2.65
best=2.45
best=1.6
72
9
doc17
worst=0.9
best=2.8
best=2.75
best=2.55
best=1.8
best=1.0
worst=0.1
best=0.9
Pseudodoc
worst=1.0
best=2.8
best=2.75
best=2.65
best=1.9
best=1.6
3
216
84
doc1
worst=0.0
best=2.9
best=2.8
best=2.75
best=2.65
best=2.45
best=1.7
best=1.4
best=1.35
Candidate
queue
worst=1.6
worst=0.85
best=2.75
best=2.65
best=2.45
best=2.15
best=2.1
182
1
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
5
Experiments:
TREC & INEX
Benchmarks
Index Access Scheduling
Inverted
 SA Scheduling
Block Index
0.9
SA
SA
SA
1.0
1.0
0.9
1.0
 Look-ahead Δi through precomputed
score histograms
0.9
 Knapsack-based optimization of
Score Reduction
Δ3,3 = 0.2
Δ1,3 = 0.8
0.7
0.9
[VLDB ’06]
0.8
 RA Scheduling
0.2
0.6
0.8
…
…
…
RA
 2-phase probing:
Schedule RAs “late & last”, i.e.,
cleanup the queue if
 Extended probabilistic cost model
for integrating SA & RA scheduling
1
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
5
Experiments:
TREC & INEX
Benchmarks
Probabilistic Candidate Pruning
[VLDB ’04]
 Convolutions of score distributions (assuming independence)
P [d gets in the final top-k] =
title[“native”]
eid
…
0.9
72
0.8
51
0.5
par[“retrieval”]
…
maxscore
3
1.0
28
0.8
182
0.75
sampling
216
eid
f1
maxscore
1
high1
0
f2
2
δ(d)
Probabilistic
pruning:
0
1
high2candidate
Drop d from the candidate queue if
Indexing Time P [d gets in the finalQuery
Time
top-k]Processing
<ε
With probabilistic guarantees for precision & recall
0
1
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
5
Experiments:
TREC & INEX
Benchmarks
Dynamic Query Expansion
fire
d78 d10 d11 d1 ...
d37 d42 d32 d87...
~disaster
accident
disaster
SA d42 d11 d92 d37 …
d42 d11 d92 d21 ...
 Incremental Merge operator
 Nested Top-k operator
(efficient phrase matching)
 Boolean (but ranked) retrieval mode
 Supports any sorted inverted index for
text, structured records & XML
tunnel d95 d17 d11 d99
...
 Specialized expansion operators
Top-k
(transport, tunnel,
~disaster)
SA
 Best-match score aggregation
TREC Robust
Topic #363
SA transport d66 d93 d95 d101...
 Incremental merging of
inverted lists for expansion ti,1...ti,m
in descending order of s(tij, d)
[SIGIR ’05]
Incremental Merge Operator
Thesaurus lookups/
Relevance feedback
Index list metadata
(e.g., histograms)
Initial high-scores
Expansion terms
~t = { t1, t2, t3 }
Large corpus
term correlations
Expansion similarities
sim(t, t1 ) = 1.0
t1
sim(t, t2 ) = 0.9
t2
sim(t, t3 ) = 0.5
t3
d78 d23 d10
0.9 0.8 0.8
d1
0.4
0.9
0.4
d64 d23 d10 d12
0.8 0.8 0.7 0.2
0.18
0.72
d11 d78 d64 d99
0.9 0.9 0.7 0.7
0.45
0.35
d88
0.3
...
d78
0.1
...
d34
0.6
...
SA
~t
d78 d23 d10 d64 d23 d10 d11 d78 d1 d88
0.9 0.8 0.8 0.72 0.72 0.63 0.45 0.45 0.4 0.3 ...
Meta histograms
seamlessly integrate Incremental Merge operators
into probabilistic scheduling and candidate pruning
1
Top-k
XPath
Processing
2
Probabilistic
Index Access
Scheduling
3
Probabilistic
Candidate
Pruning
4
Dynamic
Query
Expansion
5
Experiments:
TREC & INEX
Benchmarks
TREC Terabyte Benchmark ’05/’06
 Extensive crawl over the .gov domain (2004)
 25 Mio documents—426 GB text data
 50 ad-hoc-style keyword queries
 reintroduction of gray wolves
 Massachusetts textile mills
 Primary cost metrics
 Cost = #SA + cR/cS #RA
 Wall clock runtime
TREC Terabyte
Cost comparison of scheduling strategies
[VLDB 06]
TREC Terabyte
Wall clock runtimes
[VLDB ‘06/TREC ’06]
INEX Benchmark ‘06/’07
 New XMLified Wikipedia corpus
 660,000 documents w/ 130,000,000 elements—6.6 GB XML data
 125 NEXI queries, each as content-only (CO)
and content-and-structure (CAS) formulation
 CO: +“state machine” figure Mealy Moore
 CAS: //article[about(., “state machine” )]
//figure[about(., Mealy ) or about(., Moore )]
 Primary cost metric
Cost = #SA + cR/cS #RA
(Millions)
TopX vs. Full-Merge
40
35
CAS - Full Merge
CO - Full Merge
CAS - TopX - ε=0.0
CO - TopX - ε=0.0
CAS - TopX - ε=0.1
CO - TopX - ε=0.1
30
Cost
25
20
15
10
5
0
10
20
50
100
500
 Significant cost savings for large ranges of k
 CAS cheaper than CO !
1,000
k
(Millions)
Static vs. Dynamic Expansions
120
# RA
100
 Query expansions with up to
m=292 keywords & phrases
# SA
 Balanced amount of sorted vs.
random disk access
80
60
 Adaptive scheduling wrt.
40
cR/cS cost ratio
20
0
CAS Full
Merge
CAS CAS TopX - TopX Static Dynamic
 Dynamic expansions outperform
static expansions & full-merge in
both efficiency & effectiveness
Efficiency vs. Effectiveness
CAS - Rel. Precision
CO - Rel. Precision
CAS - Rel. Cost
CO - Rel. Cost
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
 Very good precision/runtime ratio for
probabilistic pruning
0.8
0.9
1.0
ε
Official INEX ’06 Results
Retrieval effectiveness
(rank 3-5 out of ~60 submitted runs)
Conclusions & Outlook
 Scalable XML-IR and vague search
 Mature system, reference engine for INEX topic development
& interactive tracks
 Efficient and versatile Java prototype
for text, XML, and structured data (Oracle backend)
 Very efficient prototype reimplementation for text data in C++
(over own file structures)
 C++ version for XML currently in production at MPI
 More features
 Graph top-k, proximity search, XQuery subset,…
Descargar

Slide 1