Models and languages for
semistructured data
Bridging documents and
databases
Lectures
1. Introduction to data models
2. Query languages for relational databases
3. Models and query languages for object
databases
4. Models and query languages for
semistructured data, XML
5. Embedded query languages
6. Guest lecture on Object Role Modelling
Why do we like types?
Types facilitate understanding
Types enable compact representations
Types enable query optimisation
Types facilitate consistency enforcement
Background assumptions for
typed data
Data stable over time
Organisational body to control data
Exercise: Give an example of a context
where these assumptions do not hold
Semistructured data
Semistructured data is schemaless and
self describing
The data and the description of the data
are integrated
An example
{name: {first: “John”, last: “Smith”},
tel: 112233,
email: [email protected]}
name
tel
email
112233 [email protected]
first
last
“John”
“Smith”
Another example
person
person
child
&o1
name
“Eva”
&o2
age
40
name
“Abel”
{person:
&o1{name: “Eva”, age: 40, child: &o2},
person:
&o2{name: “Abel”, age: 20}}
age
20
An object identifier, such as
&o1, before a structure, binds
the object identifier to the
identity of that structure. The
object identifier can then be
used to refer to the structure.
Terminology
The following is an ssd-expression:
&o1{name: “Eva”, age: 40, child: &o2}
Object
identifier
Label
Value
A database
author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
Darwin
n2
title
Origin
1848
date
book
author
…….
n3
Marx
title
Kapital
date
1860
Path expressions
A path expression is a sequence of labels:
l1.l2…ln
A path expression results in a set of nodes
Path properties are specified by regular
expressions on two levels: on the alphabet of
labels and on the alphabet of characters that
comprise labels
A path expression
biblio.book.author
author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
Darwin
n2
title
Origin
1848
date
book
author
…….
n3
Marx
title
Kapital
date
1860
A path expression
author
biblio.(book l paper).author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
Darwin
n2
title
Origin
1848
date
book
author
…….
n3
Marx
title
Kapital
date
1860
Examples of path expressions
biblio.book.author - authors of books
biblio.paper.author - authors of papers
biblio.(book l paper).author - authors of
books or papers
biblio._.author - authors of anything
biblio._*.author - nodes at the ends of
paths starting with biblio, ending with
author, and having an arbitrary sequence
of labels between
Example of a label pattern
((b l B)ook l (a l A)uthor) (s)? - book,
Book, author, Author, books, Books,
authors, Authors
An exercise
biblio._*.author.(“[s l S]ection”)
Which ones of the following paths match
the path expression above?
1. Biblio.author.Section
2. Biblio.cat.rat.hat.author.section
3. Biblio.author
4. Biblio.cat.author.section.Section
A simple query
Select author: X
from biblio.book.author X
Result:
{author: “Darwin”, author: “Marx”}
A query with a condition
select row: X
from biblio._ X
where “Crick” in X.author
Result:
{row:
{author: “Crick”,
author: “Wallace”,
date: 1956,
title: “The spiral DNA”}, …}
Two exercises
select row: {title: Y, date: Z}
from biblio.paper X, X.title Y, X.date Z
select row: {author: Y, date: Z}
from biblio.book X, X.author Y, X.date Z
A database
select row: {title: Y, date: Z}
from biblio.paper X, X.title Y, X.date Z
author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
Darwin
n2
title
Origin
1848
date
book
author
…….
n3
Marx
title
Kapital
date
1860
A database
author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
Darwin
n2
title
Origin
1848
date
book
author
…….
n3
Marx
title
Kapital
date
1860
Nested queries
select row: (select author: Y
from X.author Y)
from biblio.book X
Three exercises
Which authors have written a book or a
paper in 1992?
Which authors have written a book
together with Jones?
Which authors have written both a book
and a paper?
Expressing relations
r1
r2
a
1
3
4
b
2
2
3
c
3
2
1
{ r1: { row: {a: 1, b:2, c:2},
row: {a: 1, b:2, c:2},
row: {a: 1, b:2, c:2} },
r2: { row: {b: 1, d:2, e:2},
row: {b: 1, d:2, e:2},
row: {b: 1, d:2, e:2} } }
b
1
3
2
d
1
4
3
e
3
2
1
Expressing relational joins
select
from
where
a: A, d: D
r1.row X
r2.row Y
X.a A, X.b B, Y.b B’, Y.d D
B = B’
Label variables
select L: X
from biblio._*.L X
where matches(“.*Shakespeare.*”, X)
Shakespeare
author
db
biblio
book
n2
title
Label variable
Macbeth
1622
date
book
author
…….
n3
Best of Shakespeare
Smith
title
date
1992
Label variables
select L: X
from biblio._*.L X
where matches(“.*Shakespeare.*”, X)
{author: “Shakespeare”,
title: “Best of Shakespeare”}
Turning labels into data
select publ: {type: L, author: A}
from biblio.L X, X.author A
{publ: {type: “paper”, author: “Crick”},
publ: {type: “paper”, author: “Wallace”},
publ: {type: “book”, author: “Darwin”}
author
n1
Crick
DNA
spiral
title
1956
author
Wallace
paper
date
author
db
biblio
book
n2
Darwin
title
Origin
date
1848
An exercise
List all publications in 1992, their types,
and titles.
Basic XML syntax
XML is a textual representation of data
An element is a text bounded by tags
<name> John </name>
start-tag
content
end-tag
<name> </name> can be abbreviated as <name/>
element
Basic XML syntax
Elements may contain subelements
<person>
<name> John </name>
<tel> 112233 </tel>
<email> [email protected] </email>
</person>
XML attributes
An attribute is defined by a name-value pair
within a tag
<price currency = “dollar”> 500 </price>
<length unit = “cm”> 25 </length>
XML attributes and elements
<product>
<name> widget </name>
<price> 10 </price>
</product>
<product price = “10”>
<name> widget </name>
</product>
<product name = “widget” price = “10”/>
XML and ssd-expressions
<person>
<name> John </name>
<tel> 112233 </tel>
<email> [email protected] </email>
</person>
{person: {name: “John”, tel: 112233, email: [email protected]}}
XML references
<person id = “p1”>
<name> John </name>
<tel> 112233 </tel>
</person>
<person id = “p2”>
<name> Peter </name>
<tel> 998877 </tel>
<boss idref = “p1”/>
</person>
element identifier
reference attribute
Document Type Definitions
<!DOCTYPE db [
<!ELEMENT db (person*)>
<!ELEMENT person (name, age, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
An exercise on DTDs
as schemas
<db> <r1> <a> a1 </a> <b> b1 </b> </r1>
<r1> <a> a2 </a> <b> b2 </b> </r1>
<r2> <c> a1 </c> <d> b1 </d> </r1>
<r2> <c> c2 </c> <d> d2 </d> </r1>
<r3> <a> a1 </a> <c> b1 </c> </r1>
</db>
Write down a DTD for the data above!
Attributes in DTDs
<product>
<name language = “Swedish” department = “music”>
trumpet </name>
<price currency = “dollar”> 500 </price>
<length unit = “cm”> 25 </length>
</product>
<!ATTLIST name language
department
<!ATTLIST price currency
<!ATTLIST length unit
CDATA
CDATA
CDATA
CDATA
#REQUIRED
#IMPLIED>
#REQUIRED>
#REQUIRED>
Reference attributes in DTDs
<!DOCTYPE people [
<!ELEMENT people (person*)>
<!ELEMENT person (name)>
<!ELEMENT name (PCDATA)>
<!ATTLIST person id
boss
friends
]>
ID
IDREF
IDREFS
#REQUIRED
#REQUIRED
#IMPLIED>
An exercise
<people>
<person> id = “sven” boss = “olle”>
<name> Sven Svensson </name>
</person>
<person> id = “olle” friends = “nils eva”>
<name> Olle Olsson </name>
</person>
<person> id = “pelle” boss = “nils eva”>
<name> Per Persson </name>
</person>
<people>
Does this XML element conform to the previous DTD?
Limitations of DTDs as
schemas
DTDs impose order
No base types
The types of IDREFs cannot be constrained
XSL - extensible stylesheet
language
<bib> <book> <title> t1 </title>
<author> a1 </author>
<author> a2 </author>
</book>
<paper>
<title> t2 </title>
<author> a3 </author>
<author> a4 </author>
</paper>
<book> <title> t3 </title>
<author> a5 </author>
<author> a6 </author>
</book>
</bib>
Template rules and
XSL patterns
<xsl: template>
<xsl: apply-templates/>
</xsl: template>
}
Template rule
XSL pattern
<xsl: template match = “bib/*/title”>
<result>
<xsl: value-of/>
</result>
</xsl: template>
<result> t1 </result>
<result> t2 </result>
<result> t3 </result>
Two exercises
select row: {title: Y, date: Z}
from biblio.paper X, X.title Y, X.date Z
{row:
{title: “The spiral DNA”,
date: 1956},
{title: “Origin”,
date: 1848},
{title: “Kapital”,
date: 1860}}
select row: {author: Y, date: Z}
Which authors have written a book or a paper in 1992?
select author: X
from biblio.(book | paper) Y, Y.author X
where Y.date = 1992
Which authors have written a book together with Jones?
select author: X
from biblio.book Y, Y.author X
where “Jones” in Y.author
Which authors have written both a book and a paper?
select author: A
from biblio.book B, biblio.paper P, B.author A
where B.author = P.author
select author: A1
from biblio.book B, biblio.paper P, B.author A1, P.author A2
where A1 = A2
List all publications in 1992, their types, and titles.
select publ: {type: L, title: T}
from biblio.L X, X.title T
where X.date = 1992
<!DOCTYPE db [
<!ELEMENT db (r1*, r2*, r3*)>
<!ELEMENT r1 (a, b)>
<!ELEMENT r2 (c, d)>
<!ELEMENT r3 (a, c)>
<!ELEMENT a (#PCDATA)>
<!ELEMENT b (#PCDATA)>
<!ELEMENT c (#PCDATA)>
<!ELEMENT d (#PCDATA)>
]>
<db> <r1> <a> a1 </a> <b> b1 </b> </r1>
<r1> <a> a2 </a> <b> b2 </b> </r1>
<r2> <c> a1 </c> <d> b1 </d> </r1>
<r2> <c> c2 </c> <d> d2 </d> </r1>
<r3> <a> a1 </a> <c> b1 </c> </r1>
</db>
Descargar

Models and languages for semistructured data