From Semistructured Data to XML
Dan Suciu
AT&T Labs
http://www.research.att.com/~suciu/vldb99-tutorial.pdf
How the Web is Today
• HTML documents
• all intended for human consumption
• many generated automatically by
applications
Easy to fetch any Web page, from any server, any platform
Limits of the Web Today
• application cannot consume HTML
• HTML wrapper technology is brittle
– screen scraping
• OO technology (Corba) requires
controlled environment
• companies merge, form partnerships;
need interoperability fast
people are inventive: send data by fax !
Paradigm Shift on the Web
• new Web standard XML:
– XML generated by applications
– XML consumed by applications
• data exchange
– across platforms: enterprise interoperability
– across enterprises
Web: from collection of documents to data and documents
Database Community Can Help
•
•
•
•
•
query optimization, processing
views, transformations
data warehouses, data integration
mediators, query rewriting
secondary storage, indexes
But Needs a Paradigm Shift Too
• Web data differs from database data:
– self-describing, schema-less
– structure changes without notice
– heterogeneous, deeply nested, irregular
– documents and data mixed together
• designed by document, not db experts
• need Web data management
What This Tutorial is About
• what the database community has done
– semistructured data model
– query languages, schemas
• what the Web community has done:
– data formats/models: XML, RDF
– transformation language (XSL), schemas
• where they meet and where they differ
Outline
•
•
•
•
•
Semistructured data and XML
Query languages
Schemas
Systems issues
Conclusions
Part 1
Semistructured Data and XML
Semistructured Data
Origins:
• integration of heterogeneous sources
• data sources with non-rigid structure
• biological data
• Web data
The Semistructured Data Model
Bib
&o1
complex object
paper
paper
book
references
&o12
&o24
references
author
title
year
&o29
references
author
http
page
author
title publisher
title
author
author
author
&o43
&25
&96
1997
last
firstname
lastname
atomic object
firstname
lastname
&243
“Serge”
“Abiteboul”
“Victor”
Object Exchange Model (OEM)
first
&206
“Vianu”
122
133
Syntax for Semistructured Data
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
Syntax for Semistructured Data
May omit oid’s:
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
Characteristics of
Semistructured Data
•
•
•
•
missing or additional attributes
multiple attributes
different types in different objects
heterogeneous collections
self-describing, irregular data, no a priori structure
Comparison with Relational Data
row
nam e
phone
John
3634
Sue
6343
D ic k
6363
row
row
name phone name phone name phone
“John” 3634 “Sue” 6343 “Dick”
6363
{ row: { name: “John”, phone: 3634 },
row: { name: “Sue”, phone: 6343 },
row: { name: “Dick”, phone: 6363 }
}
XML
• a W3C standard to complement HTML
• origins: structured text SGML
• motivation:
– HTML describes presentation
– XML describes content
•
HTML4.0
 XML  SGML
• http://www.w3.org/TR/REC-xml (2/98)
From HTML to XML
HTML describes the presentation
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
XML Terminology
•
•
•
•
•
•
tags: book, title, author, …
start tag: <book>, end tag: </book>
elements: <book>…<book>,<author>…</author>
elements are nested
empty element: <red></red> abbrv. <red/>
an XML document: single root element
well formed XML document: if it has matching tags
More XML: Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
attributes are alternative ways to represent data
More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
oids and references in XML are just syntax
XML Data Model
• does not exists
• Document Object Model (DOM):
–
–
–
–
http://www.w3.org/TR/REC-DOM-Level-1 (10/98)
class hierarchy (node, element, attribute,…)
objects have behavior
defines API to inspect/modify the document
XML Parsers
• traditional: return data structure (DOM?)
• event based: SAX (Simple API for XML)
– http://www.megginson.com/SAX
– write handler for start tag and for end tag
XML Namespaces
• http://www.w3.org/TR/REC-xml-names (1/99)
• name ::= [prefix:]localpart
<book xmlns:isbn=“www.isbn-org.org/def”>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>
XML Namespaces
• syntactic: <number> , <isbn:number>
• semantic: provide URL for schema
<tag xmlns:mystyle = “http://…”>
defined here
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
XML v.s. Semistructured Data
• both described best by a graph
• both are schema-less, self-describing
Similarities and Differences
<person id=“o123”>
{ person: &o123
<name> Alan </name>
{ name: “Alan”,
<age> 42 </age>
age: 42,
<email> [email protected] </email>
email: [email protected] }
</person>
}
<person father=“o123”> …
</person>
father
person
{ person: { father: &o123 …}
}
person
father
name age email
name
Alan
age
42
email
[email protected]
Alan
similar on trees, different on graphs
42 [email protected]
More Differences
• XML is ordered, ssd is not
• XML can mix text and elements:
<talk> Making Java easier to type and easier to type
<speaker> Phil Wadler </speaker>
</talk>
• XML has lots of other stuff: entities,
processing instructions, comments
RDF
• http://www.w3.org/TR/REC-rdf-syntax
(2/99)
• purpose: metadata for Web
– help search engines
• syntax in XML
• semantics: edge-labeled graphs
RDF Syntax
<rdf:Description about=“www.mypage.com”>
<about> birds, butterflies, snakes </about>
<author> <rdf:Description>
<firstname> John </firstname>
<lastname> Smith </lastname>
</rdf:Description>
</author>
</rdf:Description>
RDF Data Model
www.mypage.com
about
author
birds, butterflies, snakes
firstname
John
lastname
Smith
the RDF Data Model is very close to semistructured data
More RDF Examples
related
www.mypage.com
about
www.anotherpage.com
author
author
birds, butterflies, snakes
author
Joe Doe
firstname
John
lastname
Smith
<rdf:Description about=“www.mypage.com”>
<about> birds, butterflies, snakes </about>
<author> <rdf:Description ID=“&o55”>
<firstname> John </firstname>
<lastname> Smith </lastname>
</rdf:Description> </author>
</rdf:Description>
<rdf:Description about=“www.anotherpage.com”>
<related> <rdf:Description about=“www.mypage.com”/>
</related>
<author rdf:resource=“&o55”/>
<author> Joe Doe </author>
</rdf:Description>
RDF Terminology
subject
predicate
object
statement
O E M
node
la b e l
s o u rc e /la b e l/d e s t
edge
R D F
re so u rc e
p ro p e rty
s u b je c t/p re d ic a te /o b je c t
s ta te m e n t
More RDF: Containers
• bag, sequence, alternative
<rdf:Description> <a> <rdf:Bag>
<rdf:li> s1 </rdf:li>
<rdf:li> s2 </rdf:li>
</rdf:Bag>
</a>
</rdf:Description>
RDF Containers (cont’d)
a
rdf:type
rdf_1
Bag
s1
rdf_2
s2
More RDF: Higher Order
Statements
“the author of www.thispage.com says: ‘the
topic of www.thatpage.com is environment’ “
www.thispage.com
www.thatpage.com
topic
author
says
environment
RDF uses reification
Summary of Data Models
• semistructured data, XML, RDF
• data is self-describing, irregular
• schema embedded in the data
Part 2
Query Languages
•
•
•
•
•
Semistructured data and XML
Query languages
Schemas
Systems issues
Conclusions
Query Languages: Motivation
• granularity of the HTML Web: one file
• granularity of Web data varies:
– single data item: “get John’s salary”
– entire database: “get all salaries”
– aggregates: “get average salary”
• need query language to define granularity
Query Languages: Outline
• for semistructured data:
– Lorel
– UnQL
– StruQL
• for XML: XML-QL
• a different paradigm
– structural recursion
– XSL
Lorel
• part of the Lore system (Stanford)
• adapts OQL to semistructured data
example:
select X.title
from Bib.paper X
where X.year > 1995
select Bib.paper.title
abbreviated to:
from Bib.paper
where Bib.paper.year > 1995
Lorel v.s. OQL
• implicit coercions: 1995 to “1995”
• missing attributes
– empty answer v.s. type error
• set-valued attributes
– in X.year>1995, X may have several years
• regular path expressions (next)
Regular Path Expressions
select X.title
from Bib.paper X, Bib.(paper|book) Y
where Y.author.lastname? = “Ullman”
and Y.reference+ X
Useful for:
• syntactic substitute for inheritance: paper|book
• navigating partially known structures: lastname?
• transitive closure: reference+
UnQL
• Unstructured Query Language
• patterns, templates, structural recursion
• patterns:
select T
where Bib.paper: { title: T, year: Y, journal: “TODS”}
and Y > 1995
UnQL: Templates
select result: { fn: F, ln: L, pub: { title: T, year: Y }}
where Bib.paper: { title: T, year: Y, journal: “TODS”}
and Y > 1995
Result looks like:
{ result: { fn: “John”, ln: “Smith”,
pub: { title: “P equals NP”, year: 2005}},
result: { fn: “Joe”, ln: “Doe”,
pub: { title: “Errata to P=NP”, year: 2006}}
…}
Skolem Functions
• Maier, 1986
– in OO systems
• Kifer et al, 1989
– F-logic
• Hull and Yoshikawa, 1990
– deductive db (ILOG)
• Papakonstantinou et al., 1996
– semistructured db (MSL)
• illustrate with Strudel (next)
Skolem Functions in StruQL
• Strudel: a Web Site Management System
• StruQL: its query language
Example: Bibliography Data
{Bib: { paper: { author: “Jones”,
author: “Smith”,
title: “The Comma”,
year: 1994
}
},
{ paper: ….. }
}
Example: A Complex Web Site
person
Root()
person
person
HomePage(“Smith”)
yearentry
HomePage(“Jones”)
yearentry yearentry
author
YearPage(“Smith”,
1994)
yearentry
publication
PubPage(“The Comma”)
title
yearentry
YearPage(“Jones”,
1994)
YearPage(“Smith”,
1996)
author
HomePage(“Mark”)
YearPage(“Mark”,
1996)
YearPage(“Jones”,
1998)
publication
publication
PubPage(“The Dot”)
publication
title
publication
author
Example:
Skolem Functions in StruQL
where Root -> “Bib” -> X, X -> “paper” -> P,
P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y
create Root(), HomePage(A), YearPage(A,Y), PubPage(P)
link
Root() -> “person” -> HomePage(A),
HomePage(A) -> “yearentry” -> YearPage(A,Y),
YearPage(A,Y) -> “publication” -> PubPage(P),
PubPage(P) -> “author” -> HomePage(A),
PubPage(P) -> “title” -> T
XML-QL:
A Query Language for XML
• http://www.w3.org/TR/NOTE-xml-ql
(8/98)
• features:
– regular path expressions
– patterns, templates
– Skolem Functions
• based on OEM data model
Pattern Matching in XML-QL
where <book language=“french”>
<publisher>
<name> Morgan Kaufmann </name>
</publisher>
<author> $a </author>
</book> in “www.a.b.c/bib.xml”
construct $a
Simple Constructors in XML-QL
where <book language = $l>
<author> $a </>
</> in “www.a.b.c/bib.xml”
construct <result> <author> $a </> <lang> $l </> </>
Note: </> abbreviates </book> or </result> or ...
<result> <author>Smith</author><lang>English</lang></result>
<result> <author>Smith</author><lang>Mandarin</lang></result>
<result> <author>Doe</author><lang>English</lang></result>
Skolem Functions in XML-QL
where <book language = $l>
<author> $a </>
</> in “www.a.b.c/bib.xml”
construct <result> <author id=F($a)> $a</>
<lang> $l </>
</>
<result> <author>Smith</author>
<lang>English</lang> <lang>Mandarin</lang>
</result>
<result> <author>Doe</author> <lang>English</lang> </result>
A Different Paradigm:
Structural Recursion
Data as sets with a union operator:
{a:3, a:{b:”one”, c:5}, b:4} =
{a:3} U {a:{b:”one”,c:5}} U {b:4}
Structural Recursion
Example: retrieve all integers in the data
f(T1 U T2) =
f({L: T})
=
f({})
=
f(V)
=
a
3
f(T1) U f(T2)
f(T)
{}
if isInt(V) then {result: V} else {}
b
a
result
b
c
“one”
5
4
3
result
5
result
4
standard textbook programming on trees
Structural Recursion
Example: increase all engine prices by 10%
f(T1 U T2) = f(T1) U f(T2)
f({L: T}) = if L= engine
then {L: g(T)}
else {L: f(T)}
f({})
= {}
f(V)
= V
engine
part
price
100
body
price part
1000
g(T1 U T2) = g(T1) U g(T2)
g({L: T}) = if L= price
then {L:1.1*T}
else {L: g(T)}
g({})
= {}
g(V)
= V
engine
price
price 1000
100
part
price
110
body
price part
1100
price
price 1000
100
XSL
• two W3C drafts: XSLT and XPATH
– http://www.w3.org/TR/xpath, 7/99
– http://www.w3.org/TR/WD-xslt, 7/99
• in commercial products (e.g. IE5.0)
• purpose: stylesheet specification language:
– stylesheet: XML -> HTML
– in general: XML -> XML
XSL Templates and Rules
• query = collection of template rules
• template rule = match pattern + template
Retrieve all book titles:
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match = “/bib/*/title”>
<result> <xsl:value-of/> </result>
</xsl:template>
XPath Expressions in Match
Patterns
bib
*
/
/bib
bib/paper
bib//paper
//paper
paper|book
@price
[email protected]
matches a bib element
matches any element
matches the root element
matches a bib element under root
matches a paper in bib
matches a paper in bib, at any depth
matches a paper at any depth
matches a paper or a book
matches a price attribute
matches price attribute in book, in bib
Flow Control in XSL
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match=“a”> <A><xsl:apply-templates/></A>
</xsl:template>
<xsl:template match=“b”> <B><xsl:apply-templates/></B>
</xsl:template>
<xsl:template match=“c”> <C><xsl:value-of/></C>
</xsl:template>
<a> <e> <b> <c> 1 </c>
<c> 2 </c>
</b>
<a> <c> 3 </c>
</a>
</e>
<c> 4 </c>
</a>
<A> <B> <C> 1 </C>
<C> 2 </C>
</B>
<A> <C> 3 </C>
</A>
<C> 4 </C>
</A>
XSL is Structural Recursion
Equivalent to:
f(T1 U T2) = f(T1) U f(T2)
f({L: T}) = if L= c then {C: t}
else L= b then {B: f(t)}
else L= a then {A: f(t)}
else f(t)
f({})
= {}
f(V)
= V
XSL query = single function
XSL query with modes = multiple function
XSL and Structural Recursion
XSL:
• trees only
• may loop
Structural Recursion:
• arbitrary graphs
• always terminates
add the following rule:
<xsl:template match = “e”>
<xsl:apply-patterns select=“/”/>
</xsl:template>
stack overflow on IE 5.0
Summary of Query Languages
•
•
•
•
•
studied extensively in semistructured data
some quite powerful features
no standard for XML QL yet (WG soon)
XSL available today (for stylesheets)
XSL = structural recursion
Part 3
Schemas
•
•
•
•
•
Semistructured data and XML
Query languages
Schemas
Systems issues
Conclusions
Schemas
• why ?
here lies our interest
– XML: to describe semantics
– semistructured data: to improve processing
• what ?
– semistructured data: foundational
– XML: several concrete proposals
Schemas
• when ?
– semistructured data, XML: a posteriori
– RDBMS: a priori, to interpret binary data
• how ?
– semistructured data: schema is independent
– XML: schema is hardwired with the data
Outline
• schemas for semistructured data:
– foundations
– schema extraction
• schemas for XML:
– DTD
– XML-Schema
– RDF-Schema
Schemas: An Example
Some database:
&r1
person
person
company person
manages
company
works-for
employee
&p1
&c1
&p2
&c2 c.e.o.&p3
c.e.o.
works-for
position
phone name address
namepositionworks-for
name
name
nameaddress
&s0 &s1
&s2
&s3
&s4
&s5
&s6 url &s7 &s8
&s9 description
“Smith” “Manager”
“Widget”
“Trenton”
“Jones”
description
“Sales”
&s10
eval &a5 1998
&a4
1997
&a7
&a3 task
&a6 “below target”
“www.gp.fr”
&a1
salesrep
procurement
“Paris” “Dupont”
“5552121” “Gadget”
&a2
contact
“on target”
Lower-Bound Schemas
person
Root
company
works-for
managed-by
Company
c.e.o.
name
address
name
Employee
string
Upper Bound Schemas
person
Root
company
works-for
managed-by
Company
Employee
c.e.o. | employee
name | address | url name | phone | position
description
string
-
Any
The Two Questions to Ask
Conformance: does that data conform to
this schema ?
Classification: if so, then which objects
belong to what classes ?
Graph Simulation
Definition Two edge-labeled graphs G1, G2
A simulation is a relation R between nodes:
• if (x1, x2) in R, and (x1,a,y1) in G1,
then exists (x2,a,y2) in G2 (same label)
s.t. (y1,y2) in R
x1
G1
R
x2
a
a
y1
R
G2
y2
Note: a simulation can be efficiently computed [Henzinger, et a. 1995]
Using Simulation
Data graph D, schema S
• upper bound schema:
– conformance: find simulation R from D to S
– classification: check if (x,c) in R
• lower bound schema
– conformance: find simulation R from S to D
– classification: check if (c,x) in R
[Buneman et al 1997]
Example
person
Root
company
works-for
managed-by
Company
c.e.o.
“Smith” “Manager”
name
address
name
Employee
&r1
person
person
companyperson
manages
company
works-for
&p1 c.e.o. &c1employee&p2
&c2 c.e.o.&p3
works-for
works-for
phone name address position
name
position
name
nameaddress name
&s0 &s1
&s2
&s3
&s4
&s5 &s6 url &s7 &s8 &s9description
string
“Widget”
“Trenton”
“Jones”
&a1
procurement
&a2
“Paris”“Dupont” “Sales”
“5552121”“Gadget”
description
salesrep
contact
works-for
managed-by
Company
Employee
c.e.o. | employee
name | address | url
name | phone | position
&s10
eval &a5 1998
&a4
1997
&a7
&a3 task
&a6 “below target”
“www.gp.fr”
person
Root
company
description
string
-
“on target”
Lower Bound
Database
Upper Bound
simulation: efficient technique for checking conformance to schema
Any
Application 1:
Improve Secondary Storage
Company
person
Root
company
works-for
managed-by
Company
nam e
…
…
a d d re ss
…
…
c .e .o .
…
…
name
address
name
Employee
c.e.o.
o id
…
…
string
Lower-bound schema
Employee
o id
…
…
nam e
…
…
m a n a g e d -b y
…
…
w o rk s-fo r
…
…
Store rest in overflow graph
Application 2:
Query Optimization
Bib
paper
year
int
journal
select X.title
from Bib._ X
where X.*.zip = “12345”
book
address
title
string
string
title
author
string
string
last first
zip city streetname name
string
string
string
string
Upper-bound schema
select X.title
from Bib.book X
where X.address.zip = “12345”
[Fernandez, Suciu 1998]
Schema Extraction
(From Data)
Problem statement
• given data instance D
• find the “most specific” schema S for D
In practice: S too large, need to relax
[Nestorov et al. 1998]
Schema Extraction: Sample
Data
&r
employee
employee employee
employee employee
manages
employee
employee
manages
manages
manages
manages
&p1
&p2
managedby
&p3
company worksfor worksfor
&p4
&p5
&p6
&p7
managedby
managedby
managedby
worksfor
employee
managedby
worksfor worksfor
worksfor
worksfor
worksfor
&c
&p8
Lower Bound Schema Extraction
Root
&r
employee
company
employee
Bosses
&p1,&p4,&p6
worksfor
Company
&c
manages
managedby
worksfor
Regulars
&p2,&p3,&p5,&p7,&p8
Upper Bound Schema Extraction:
Data Guides
Root
&r
company
managedby
employee
Employees
&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
manages
worksfor
Bosses
&p1,&p4,&p6
worksfor
Company
&c
manages
managedby
worksfor
Regulars
&p2,&p3,&p5,&p7,&p8
Schemas in XML
• Document Type Definition (DTD)
• XML Schema
• RDF Schema
Document Type Definition:
DTD
• part of the original XML specification
• an XML document may have a DTD
• terminology for XML:
– well-formed: if tags are correctly closed
– valid: if it has a DTD and conforms to it
• validation is useful in data exchange
DTDs as Grammars
<!DOCTYPE paper [
<!ELEMENT paper (section*)>
<!ELEMENT section ((title,section*) | text)>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT text
(#PCDATA)>
]>
<paper> <section> <text> </text> </section>
<section> <title> </title> <section> … </section>
<section> … </section>
</section>
</paper>
DTDs as Schemas
Not so well suited:
• impose unwanted constraints on order
<!ELEMENT person (name,phone)>
• references cannot be constraint
• can be to vague:
<!ELEMENT person ((name|phone|email)*)>
XML Schemas
•
•
•
•
•
very recent proposal
unifies previous schema proposals
generalizes DTDs
uses XML syntax
two documents: structure and datatypes
– http://www.w3.org/TR/xmlschema-1
– http://www.w3.org/TR/xmlschema-2
XML Schemas
<elementType name=“paper”>
<sequence>
<elementTypeRef name=“title”/>
<elementTypeRef name=“author” minOccurs=“0”/>
<elementTypeRef name=“year”/>
<choice> <elementTypeRef name=“journal”/>
<elementTypeRef name=“conference”/>
</choice>
</sequence>
</elementType>
DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>
RDF Schemas
• http://www.w3.org/TR/PR-rdf-schema
(3/99)
• object-oriented flavor
RDF Schemas
• recall RDF data:
subject
predicate
object
statement
– resources
– properties
• RDF schema:
– classes
– properties
RDF Schemas
Data:
<rdf:Description ID=“car001”>
<name> My Honda </name>
<miles> 50000 </miles>
<rdf:type resource=“#MotorVehicle”/>
</rdf:Description>
RDF Schemas
Schema:
<rdf:Description ID=“MotorVehicle”>
<rdf:type resource=“#Class”/>
<rdf:subClassOf resource=“#Resource”/>
</rdf:Description>
<rdf:Description ID=“Truck”>
<rdf:type resource=“#Class”/>
<rdf:subClassOf resource=“#MotorVehicle”/>
</rdf:Description>
RDF Schemas
car001
name
type
Truck
subClassOf
type
MotorVehicle
type
Class
miles
My Honda 50000
RDF Schemas
• different from object-oriented systems:
– OO: define a class by set of properties
– RDF: define a property in terms of its classes
• metadata in RDF:
– an RDF schema described as an RDF data
Summary of Schemas
• in SS data:
– graph theoretic
– data and schema are decoupled
– used in data processing
• in XML
– from grammar to object-oriented
– schema wired with the data
– emphasis on semantics for exchange
Part 4
Systems Issues
•
•
•
•
•
Semistructured data and XML
Query languages
Schemas
Systems issues
Conclusions
Systems Issues
• servers
• mediators
Servers for
Semistructured Data / XML
• storage
• index
• query evaluation [McHugh, Widom 1999]
XML Storage
•
•
•
•
•
text file (XML)
store in ternary relation
use DTD to derive schema
mine data to derive schema
build special purpose repository (Lore)
XML Storage: Text File
• advantages
– simple
– less space than one thinks
– reasonable clustering
• disadvantage
– no updates
– require special purpose query processor
Store XML in Ternary Relation
Ref
S o u rc e
&o1
&
&
&
&
&
paper
&o2
title
&o3
author
author
&o4
“The Calculus” “…”
year
o
o
o
o
o
1
2
2
2
2
L abel
D est
paper
title
a u th o r
a u th o r
year
&
&
&
&
&
o
o
o
o
o
2
3
4
5
6
Val
&o5
“…”
[Florescu, Kossman 1999]
&o6
“1986”
N ode
&
&
&
&
o
o
o
o
3
4
5
6
V a lu e
T h e C a lc u lu s
…
…
1986
Use DTD to derive Schema
• DTD:
<!ELEMENT employee (name, address, project*)>
<!ELEMENT address (street, city, state, zip)>
• ODMG classes:
class Employee public type tuple
(name:string, address:Address, project:List(Project))
class Address public type tuple (street:string, …)
• [Christophides et al. 1994 , Shanmugasundaram et al. 1999]
Mine Data to Derive Schema
paper
paper paper
Paper1
paper
year
author
title
author
authortitle authortitleauthor title
fn
ln fn
ln
fn
ln
fn
fn 1
ln 1
fn 2
ln 2
title
year
X
X
X
X
X
X
X
-
X
-
X
X
X
X
-
ln
Paper2
a u th o r
X
[Deutsch et al. 1999]
title
X
Indexing Semistructured Data
• coercions: 1995 v.s. “1995”
• regular path expressions
– data guides [Goldman, Widom, 1997]
– T-indexes [Milo, Suciu, 1999]
Indexing All Paths in the Data
1
t
t
2
a
t
3
b
7
a
8
t
t
4
c
a
d
5
a
9
10
11
12
6
a
b
13
Semistructured Data
t
1
t
23456
a
7 8 10 12 13
d
c
b
7 13
Data Guide
a
9
11
7 13
1
23456
b
c
b
8 10 12
T-Index
d
9
11
Mediators for
Semistructured Data / XML
• XML = virtual view of Relational/OO/OR sources
• mediator = translation, integration
• issues:
– query composition and rewriting [Papakonstatinou, et al. 1996]
– limited source capabilities [Yerneni, et al. 1999]
Example: An XML Mediator
Store
• relational database:
• virtual XML view:
s id
…
…
SB
n a m e
…
…
s id
…
…
Book
b id
…
…
<store> <name> n1 </name>
<book> ... </book>
<book> ... </book>
...
</store>
<store> <name>n2 </name>
<book> ... </book>
<book> ... </book>
…
</store>
b id
…
…
title
…
…
Example: An XML Mediator
• specify mediator declaratively (a view):
from
where
Store, SB, Book
Store.sid=SB.sid and
SB.bid=Book.bid
construct <store ID=f(Store.sid)>
<name> Store.name </name>
<book> Book.title </book>
</store>
Example: An XML Mediator
• users ask XML-QL queries:
– find stores who sell “The Calculus”
where
<store> <name> $n </name>
<book> The Calculus </book>
<store>
construct <result> $n </result>
Example: An XML Mediator
• system composes query with view:
from Store, SB, Book
where Store.sid=SB.sid and
SB.bid=Book.bid and
Book.title=“The Calculus”
construct <result> Store.name </result>
Summary of Systems
• unclear today how XML will be used
– materialized ? Need servers
– virtual ? Need mediators
• most work is still ahead
Part 5
Conclusions
•
•
•
•
•
Semistructured data and XML
Query languages
Schemas
Systems issues
Conclusions
Summary
•
•
•
•
XML = what is out there
semistructured data = what we can process
paradigm shift, for both Web and db
covered in tutorial:
– data models, queries, schemas
Current and Future Technologies
• Web applications possible today:
– export relational data to XML (e.g. Oracle)
– import XML directly into applications
• Web applications in the future:
– mediator technology (XML view)
– store/process native XML data
– compress XML
– mine/analyze XML
Why This Is Cool for Database
Researchers
• put to work what you teach in CS101 !
– tree traversals (structural recursion, XSL)
– automata theory (DTD’s, path expressions)
– graph theory (simulation)
• adapt old DB tricks to new kind of data
• save the trees: from fax to XML
The End
Further Readings
www. w3.org/XML
www-db.stanford.edu/~widom
www-rocq.inria.fr/~abiteboul
db.cis.upenn.edu
www.research.att.com/~suciu
Abiteboul, Buneman, Suciu
Data on the Web: From Relational to Semistructured to XML
Morgan Kaufmann, 1999 (appears in October)
Descargar

From Semistructured Data to XML