Web Data Management
Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla
[email protected]
[email protected], UMR
WWW
• Huge, widely distributed,
heterogeneous collection of semistructured multimedia documents in
the form of web pages connected via
hyperlinks.
[email protected], UMR
World Wide Web
• Web is fast growing
• More business organizations putting
information in the Web
• Business on the highway
• Myriad of raw data to be processed for
information
[email protected], UMR
As WWW grows, more chaotic it
becomes
• Web is fast growing, distributed, nonadministered global information resource
• WWW allows access to text, image, video,
sound and graphic data
• More business organizations creating web
servers
• More chaotic environment to locate
information of interest
• Lost in hyperspace syndrome
[email protected], UMR
Characteristics of WWW
• WWW is a set of directed graphs
• Data in the WWW has a heterogeneous
nature, self-describing and schema less
• Unstructured information , deeply nested
• No central authority to manage information
• Dynamic verses static information
• Web information discoveries - search
engines
[email protected], UMR
Web is Growing!
•
•
•
•
•
•
•
In 1994, WWW grew by 1758 % !!
June 1993 - 130
June 1994 - 1265
Dec. 1994 - 11,576
April 1995 - 15,768
July 1995 - 23,000+
2000 - !!!!!
[email protected], UMR
‘COM’ domains are increasing!
• As of July 1995, 6.64 million host
computers on the Internet:
–
–
–
–
–
–
1.74 million are ‘com’ domains
1.41 million are ‘edu’ domains
0.30 million are ‘net’
0.27 million are ‘gov’
0.22 million are ‘mil’
0.20 million are ‘org’
[email protected], UMR
Top web countries
1. Canada (1) 80% 9. New Zealand(7)101
2. US (4) 140%
10. Sweden (9) 101%
3. Ireland (3) 110% 11. Israel (12) 112%
4. Iceland (2) 68%
12. Cyprus (8) 72%
5. UK (14) 336 %
13. Hong Kong (15)148%
6. Malta (5) 155%
14. Norway (10) 64%
7. Australia (6) 133% 15. Switzerland (13) 75%
8. Singapore (11) 207% 16. Denmark (16) 105%
[email protected], UMR
How users find web sites
•
•
•
•
•
•
•
•
Indexes and search engines
UseNet newsgroups
Cool lists
New lists
Listservers
Print ads
Word-of-mouth and e-mail
Linked web advertisement
[email protected], UMR
75
44
27
24
23
21
17
4
Limitations of Search Engines
• Do not exploit hyperlinks
• Search is limited to string matching
• Queries are evaluated on archived data
rather than up-to-date data; no indexing on
current data
• Low accuracy
• Replicated results
• No further manipulation possible
[email protected], UMR
Limitations of Search Engines
•
•
•
•
ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery
[email protected], UMR
More PROBLEMS
• specifying/understanding what information
is wanted
• the high degree of variability of accessible
information
• the variability in conceptual vocabulary or
“ontology” used to describe information
• complexity of querying unstructured data
[email protected], UMR
• complexity of querying structured data
• uncontrolled nature of web-based
information content
• determining which information sources to
search/query
[email protected], UMR
Search Engine Capabilities
– Selection of language
– Keywords with disjunction, adjacency, presence, absence, ...
– Word stemming (Hotbot)
– Similarity search (Excite)
– Natural language (LycosPro)
– Restrict by modification date (Hotbot) or range of dates
(AltaVista)
– Restrict result types (e.g., must include images) (Hotbot)
– Restrict by geographical source (content or domain)
(Hotbot)
– Restrict within various structured regions of a document
(titles or URLs) (LycosPro); (summary, first heading, title,
URL) (Opentext)
[email protected], UMR
SEARCH & RETRIEVAL
Search Engines
Search engine
Hotbot
AltaVista
Northern Light
Excite
Infoseek
Lycos
% web covered
34
28
20
14
10
3
 using several search engines is better
than using only one
 Source: Lawrence, S., and Giles, C.L., “Searching the World Wide
Web,” Science 280, pp. 98-100, 1998.
[email protected], UMR
Key Objectives
• Design a suitable data model to represent
web information
• Development of web algebra and query
language, query optimization
• Maintenance of Web data - view
maintenance
• Development of knowledge discovery and
web mining tools
• Web warehouse
• Data integration , secondary storages,
indexes
[email protected], UMR
Web Data Representation
• HTML - Hypertext Markup Language
–
–
–
–
fixed grammar, no regular expressions
Simple representation of data
good for simple data
difficult to extract information
• SGML - Standard Generalized Markup
Language - good for publishing deeply
structured document
• XML - Extended Markup Language -a subset
of SGML
[email protected], UMR
Terminology
•
•
•
•
HTML - Hypertext Mark-up Language
HTTP - Hypertext Transmission Protocol
URL - Uniform Resource Locator
example <URL>:=<protocol>://<Host>/<path>/filen
ame>[<#location>] where
– <protocol> is http, ftp, gopher
– host is internet address …
– #location is a textual label in the file.
[email protected], UMR
• Links are specified as
<A HREF=“Destination URL”>Anhor Text</A>
• “destination URL is the URL of the destination
document and Anchor Text is the text that appears as an
anchor when displayed.
• Example:
• <A HREF=http://www.ntu.edu.sg/ >Nanyang
Technological University</A>
• Absolute and relative
• URL <A HREF="AtlanticStates/NYStats.html">New
York</A> is relative
• <A
HREF="http://www.ncsa.uiuc.edu/General/Internet/
WWW/HTMLPrimer.html"> NCSA's Beginner's
Guide to HTML</A>
absoluteUMR
address
[email protected],
World Wide Web
• Prevalent, persistent and informative
• HTML documents (soon, XML) created by
humans or applications.
• Accessed day in and day out by humans and
applications.
•
Persistent HTML documents!!!
Can database technology help?
[email protected], UMR
Current Research Projects
• Web Query System
– W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog, Araneus
• Semistructured Data Management
– LOREL, UnQL, WebOQL, Florid
• Website Management System
– STRUDEL, Araneus
• Web Warehouse
– WHOWEDA
[email protected], UMR
Main Tasks
• Modeling and Querying the Web
– view web as directed graph
– content and link based queries
– example - find the page that contain the word
“clinton” which has a link from a page
containing word “monica”.
[email protected], UMR
• Information Extraction and integration
– wrapper - program to extract a structured
representation of the data; a set of tuples from
HTML pages.
– Mediator - integration of data-softwares that
access multiple source from a uniform interface
• Web Site Construction and Restructuring
– creating sites
– modeling the structure of web sites
– restructuring data
[email protected], UMR
MEDIATOR ARCHITECTURE
User Interface
Mediator
(Query/Search/
Retrieval/Result)
Wrapper
Wrapper
...
[email protected], UMR
What to Model
• Structure of Web sites
• Internal structure of web pages
• Contents of web sites in finer granularities
[email protected], UMR
Data Representation of Web Data
• Graph Data Models
• Semistructured Data Models (also graph
based)
[email protected], UMR
Graph Data Model
• Labeled graph data model where node
represents web pages and arcs represent
links between pages.
• Labels on arcs can be viewed as attribute
names.
• Regular path expression queries
[email protected], UMR
Semistructured Data Models
• Irregular data structure, no fixed schema
known and may be implicit in the data
• Schema may be large and may change
frequently
• Schema is descriptive rather than
perspective; describes the current state of
data, but violations of schema is still
tolerated
[email protected], UMR
• Data is not strongly typed; for different
objects the values of the same attributes
may be of differing types. (heterogenious
sources)
• No restriction on the set of arcs that
emanate from a given node in a graph or on
the types of the values of attributes
• Ability to query the schemas; acr variables
which get bound to labels on arcs, rather
than nodes in the graph
[email protected], UMR
Graph based Query Languages
• Use graph to model databases
• Support regular path expressions and graph
construction in queries.
• Examples
Graph Log for hypertext queries
graph query language for OO
[email protected], UMR
Query Languages for SemiStructured data
• Use labeled graphs
• Query the schema of data
• Ability to accommodate irregularities in the
data, such as missing links etc.
• Examples : Lorel (Stanford) , UnQL
(AT&T), STRUQL (AT&T)
[email protected], UMR
Comparison of Query Systems
S ystem
D ata m odel L ang. style
P ath exp.
w ebsql
R elational
SQL
yes
No
W 3Q S
LM G
SQL
Y es
NO
W ebL O G R elational
D atalog
No
No
L orel
LG
OQL
Y es
No
w eboql
hypertrees
OQL
Y es
Y es
U nQ L
LG
R ecursion
Y es
Y es
F lorid
S trudel
A raneus
W how eda
F -logic
D atalog
Y es
LG
D atalog
Y es
pag e sch em es S Q L
Y es
relational [email protected],
SQL
Y es
UMR
G raph
No
Y es
Y es
Y es
Types of Query Languages
• First Generation
• Second generation
[email protected], UMR
First Generation Query
Languages
• Combine the content-based queries of
search engines with structure-based queries
• Combine conditions on text pattern in
documents with graph pattern describing
link structures
• Examples - W3QL (TECHNION, Israel)
WebSQL (Toronto), WebLOG (Concordia)
[email protected], UMR
Second generation languages
• Called web data manipulation languages
• Web pages as atomic objects with properties
that they contain or do not contain certain
text patterns and they point to other objects
• Useful for data wrapping, transformation,
and restructuring
• Useful for web site transformation and
restructuring
[email protected], UMR
How they Differ?
• Provide access to the structure of web
objects they manipulate - return structure
• Model internal structures of web documents
as well as the external links that connect
them
• Support references to model hyperlinks and
some support to ordered collections of
records for more natural data representation
• Ability to create new complex structures as
a result of a query
[email protected], UMR
Examples
• Web OQL
• STRUQL
• Florid
[email protected], UMR
W3QS (WWW Query System) at
Technion - Israel
• Content queries
• Structural Queries
• Interfacing with user written programs and
standard UNIX utilities
• Uses existing WWW indexes and search
Services
• Provides view update facility
[email protected], UMR
W3QS
• Accessible via any WWW browsers
• API can be used by programs running
anywhere in the Internet
• Support queries on the web structure by
specifying starting page, a search domain
and depth of links.
• File content analysis tools and filling up of
forms automatically
[email protected], UMR
File Types
• Strict Inner Structure files such as Unix
environment files - Semantics of the data is
clearly linked to the syntax
• Semi-structured files - text files containing
formatting codes such as Latex or HTML
files- possible to use formatting codes to
analyze their semantic content
• Raw Files - no relation between meaning of
file and its inner structure
[email protected], UMR
Content Queries
• Queries based on the content of a single
node of hypertext
• SQLCOND is used to evaluate boolean
expressions
• Example - node-format = Latex and
Node.author =“Sanjay”
[email protected], UMR
Structure Queries
• Information conveyed in the hypertext
organization itself is conveyed.
• The result is a set of nodes and links from
the hypertext structure that satisfy a given
graph pattern; graph with nodes and edges
are annotated with conditions.
• Components are pattern definition, search
engines and form completion
[email protected], UMR
Structure
Query
Node1.title
=“Good article”
Node2.author
= “Sanjay”
Link1.
rev=“doc”
Answer
URL http://../myarticles.html
URL http:///.tex
<Title> Good articles</Title>
\author {sanjay}
A HREF=//..rev=“doc”>
[email protected], UMR
Search for an article
• Select cp n2/* result
from n1, l2, n2
where n1 in importantindexs.url
Fill n1.form as IN importantindexes.fil with
Keyword = “sanjay” SQLCOND (n2.format
=Laytex) and (n2.author=“sanjay”)
[email protected], UMR
Query to search hypertext pattern
• Return all the articles cited in the first chapter
of the book. Each chapter includes several
pointers to the bibliography, for example
• <A HREF=“http//cs…/refrences.html#ref2”>
[Relativity]</> means link [Relativity] leads to
the label ref2 in the references.html file.
In the references.html file the labeled link looks
like <A HREF=“./relative.tex”name=“ref2”>
[relativity, sanjay]</A> this link points to
relative.tex
[email protected], UMR
• Select cp art/* result from Ind,
l1,chap,l2,ref,l3 art where SQLCOND
(ind.url = “http://) And (chap.url =/.chapter1.html/) AND l2.HREF = /.\#13.Name/)
USING BFS.
[email protected], UMR
Url
http://cs.tech/bookindex.html
Url
http://…/Chapter-1.html
INDEX
Chapter 1
ref 1
l1
ref 2
Chapter 2
ref 3
…
References
ref 1
l3
ref 2
ref 3
[email protected], UMR
http://relati
ve.tex
article
WebSQL-University of Toronto
• Model web as relational database
• Use two relations Document and Anchor
• Document relation has one tuple for each
document in the web and the anchor relation
has one tuple for each anchor in each
document
[email protected], UMR
WebSQL
• SQL-like query language for extracting
information from the web.
• Capable of systematic processing of
either all the links in a page, all the
pages that can be reached from a given
URL through paths that match a pattern,
or a combination of both.
• Provides transparent access to index
servers
[email protected], UMR
Document
U rl
title
te x t
L e n g th T y p e
h ttp ://
T itle 1
T ext 1 1234
te x t
1 -1 -9 6
h ttp ://
T itle 2 T e x t 2 2 3 4 5
te x t
2 -3 -9 7
[email protected], UMR
M o d if
Anchor
B ase
lab el
h ref
h ttp ://
L ab el 1
h ttp ://
h ttp ://
L ab el 1
L ab el 2
h ttp ://
h ttp ://
[email protected], UMR
• Give documents’s URLs which contain
same title and keyword(s)
• Select d1.url, d2.url from
document d1 such that d1 MENTIONS
“keyword1” and document d2 such that d2
MENTIONS “keyword1”
where d1.title = d2.title
and NOT (d1.url = d2.url)
[email protected], UMR
Find Labels of all Hyperlinks to
Postscript Files
SELECT a.label
FROM Anchor a SUCH THAT base =
"http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
[email protected], UMR
Documents about Databases
SELECT Document d.url, d.title
FROM d SUCH THAT
"http://www.OtherDoc.html" ->|=> d
WHERE d.title CONTAINS "databases";
Note :
-> path of length one within same server
=> path of length of one but different server
[email protected], UMR
Retrieve all the documents in the
same server that are pointed to
from the document
Whose URL is given
• Select d.url, d.title from
Document d SUCH THAT
“http//www. Cs.in -> d
[email protected], UMR
Find all broken links in a page
• SELECT a.href
FROM Anchor a SUCH THAT base =
"http://the.document.to.test"
WHERE protocol(a.href) = "http" AND
doc(a.href) = null;
[email protected], UMR
Web OQL (University of Toronto)
• Provides a framework that supports a large
class of data
• Restructuring operations.
• Simple semistructured data model for
documents and record-based data
• OQL-like syntax and regular expressions
• Serves as a two-way bridge between
databases and the Web.
[email protected], UMR
DATA MODEL
• Hypertrees are Ordered arc labeled trees
with two types of arcs ; internal and
external
• Internal arcs represent structured objects
• External arcs to represent refrences
(huperlinks) among objects.
• Records as labels in the arcs
• Sets of related hypertrees as Web
[email protected], UMR
ARCHITECTURE
•
•
Wrappers map all data sources to trees
The mapping can be done all at once or
on demand
[email protected], UMR
Example
• Extract from cspapers (paper database) title
and URL of the full version of papers of
“Smith”
• select [y.title,y’.URL]
from x in cs papers, y in x’
where y.authors ~”smith”
[email protected], UMR
Web Creation
• Create a new page for each research Group
(using the group name as URL). Each page
contains the publications of the
corresponding group.
• Select x’ as x.group from x in cspapers
• Select q1 as s1, q2 as s2, ...qm as sm
• where q’s are queries and each S’s is either
a string query or keyword schema. “as”
clause create a URL’s s1 , ..sm assigned to
each new page resulting from each query.
[email protected], UMR
ARANEUS
• Data Model called ADM for Web
Documents - nested web objects, page
schemas
• Several languages for wrapping, querying,
creating and updating web sites - object
algebra
• Methods and Techniques for Web Site
Design and Implementation
• Presentation in SIGMOD’99
• Software is available at their home site
[email protected], UMR
• Wrappers - map logical access to attribute
values in a page at the ADM level tp physical
access to text in the HTML source using
EDITOR
• ULIXES - SQL-like query languages
• PENELOPE - manipulation language
• Site integration, semantic heterogeneities
• Materialized views
• http://poincare.dia.uniroma3.it:8080/Araneous
[email protected], UMR
Lore - motivation
• The data may be irregular and thus not
conform to a rigid schema.
• Relational data model has null values, and
OO models have inheritance and complex
objects. Both have difficulties in designing
schemas to incorporate irregular data.
• It may be difficult to decide in advance on a
single, correct schema, The structure of the
data may evolve rapidly, data elements may
change types, or data not conforming to
previous structure may be added.
[email protected], UMR
• Thus, there is a need for management of
semi-structured data!
• Lore system manages semi-structured data. The
data managed by Lore is not confined to a
schema and it may be irregular or incomplete.
• OEM is the Lore’s data model. OEM - object
Exchange Model - graph based self-describing
object instance model where nodes are objects
and edges are labeled with attribute names and
leaf nodes have atomic values
• Lore is light weight object repository and Lorel
is Lore’s query [email protected],
language. UMR
Object Exchange Model - OEM
• Motivation - information exchange and
extraction
• Why a new data model? … it not a new
model.
•Each value exchanged is given an explicit
label.
Object temp-in-Fahrenheit, integer, 80 - “tempin-Fahrenheit” is the label. Each object is selfdescribing, with a label, type and value.
set-of-temps, set, {cmpnt1, cmpnt2} 
cmpnt1 is temp-in-Fahrenheit, integer, 80
cmpnt2 is temp-in-Celsius, integer, 20
[email protected], UMR
Labels
• Plays two roles
– identifying an object (component)
– identifying the meaning of an object (component)
person-record, set, {cmpnt1, cmpnt2, cmpnt3} 
cmpnt1 is person-name, string, ``Fred’’
cmpnt2 is office-num-in-bldg-5, integer, 333
cmpnt3 is department, string, ``toy’’
• Person-name both identifies cmpnt1 and
coveys its meaning.
• In relational data this corresponds to ….
[email protected], UMR
Labels - Issues
• What does the label mean?
– Database of labels
– Ontology of labels - within each source
•Labels are relative (more specific) to the
source of the data object.
•Similar labels from different sources need to
be resolved.
• Labels provide the flexibility in representing
object structure
[email protected], UMR
Self-describing data models
• Have been in existence for a long time? Why
additional interest now?
•Use the ``nature’’ of self-describing data model
for information exchange, and to extend the
model to include object nesting.
•To provide an appropriate object request
language (query facility)
[email protected], UMR
OEM - Specification
• Each object in OEM has the following
structure:
Label
Type
Value Object-ID
– Label: A variable character string describing
what the object represents.
– Type: The data type of the object’s value. Each is
either an atom type, or type set.
– Value: A variable-length value of the object.
– Object-ID: A unique variable-length identifier for
the object or null.
[email protected], UMR
OEM - Summary
• OEM is an information exchange model. It does
not specify how objects are stored at source.
• OEM does specify how objects are received at
a client, but after objects are received they can
be stored in any way the client likes.
• Each source has a distinguished object with
lexical identifier ``root’’.
• Note the schema-less nature of OEM is
particularly useful when a client does not
know in advance the labels or structure of
OEM objects. [email protected], UMR
• <biblio,set,{doc1,doc2,…,docn}>
• doc1 is <doc, set, {auths1, topic1, call• topic2 is <topic,
no1}>
string,``Algorithms’’>
•
auths1 is <auth-set,set {auth11}>
• call-no1 is <dewey•
auth11 is <auth-ln, string,
decimal, string,
``Ullman’’>
``BR273’’>
•
topic1 is <topic, string,``Databases’’>
• docn is <doc, set,
•
call-no1 is <internal-call-no, integer,
{authsn, topicn, call25>
non}>
• doc2 is <doc, set, {auths2, topic2, call•
authsn is
no2}>
<auth,string,
•
auths2 is <auth-set,set {auth21, auth22,
``Crichton’’>
auth23}>
• topic1 is <topic,
•
auth21 is <auth-ln, string, ``Aho’’>
string,``Dinosaurs’’>
•
auth22 is <auth-ln, string,
• call-no1 is <fictional``Hopcroft’’>
call-no, integer, 95>
•
auth23 is <auth-ln, [email protected],
string,
UMR
• biblio is the root object.
Example
OEM - QL
SELECT Fetch-expression
FROM Object
WHERE Condition
• The result of this query is itself an object, with
special label ``answer’’:
answer, set, {obj1, obj2, …, objn} 
• Each returned obji is a component of object
specified in the From clause of the query,
where the component is located by the Fetchexpression and satisfies the Condition.
[email protected], UMR
Path
• The notion of path is used in both FetchExpression in the Select clause and the condition in
the Where clause.
• Path describes traversals through an object using
subobject structure and labels.
• Example: ``biblio.doc.auth’’
• Paths are used in Fetch-Expression to specify
which components are are returned in the answer
object.
• Paths are used in the condition to qualify the
fetched objects or other (related) components in the
[email protected], UMR
same object structure.
Queries - Simple
• Retrieve the topic of each document for which
``Ullman’’ is one of the authors:
SELECT biblio.doc.topic
FROM root
WHERE biblio.doc.auth-set.auth-ln = ``Ullman’’
• Intuitively, the query’s where clause finds all paths
through subobject structure with the sequence of
labels [biblio,doc,auth-set,auth-ln] such that the
object at the end of the path has value ``Ullman.’’
<answer, set, {obj1, obj2}>
obj1 is <topic, string, ``Databases’’>
obj2 is <topic, string, “Algorithms”>
[email protected], UMR
Queries - ``wild-cards’’
• Retrieve all documents with internal call number:
SELECT biblio.?.topic
FROM root
WHERE biblio.?.internal-call-no
• ``?’’ label matches any label. For this query, the doc
labels can be replaced by any other strings and query
would produce the same result. By convention, two
occurrences of ? In the same query must match the
same label unless variables are used.
<answer, set, {obj1}>
obj1 is <topic, string, ``Databases’’>
[email protected], UMR
Queries - ``wild-paths’’
• Retrieve all documents with internal call number:
SELECT *.topic
FROM root
WHERE *.internal-call-no
• Symbol ``*’’ matches any path of length one or more.
The use of * followed by a single label is a
convenient and common way to locate objects with a
certain label in complex structure. Similar to ?, two
occurrences of * in the same query must match the
same sequence of labels, unless variables are used.
<answer, set, {obj1}>
obj1 is <topic, string, ``Databases’’>
[email protected], UMR
Queries - variables
• Retrieve each document for which both ``Hopcroft’’
and ``Aho’’ are co-authors:
SELECT biblio.doc
FROM root
WHERE biblio.doc.auth-set.auth-ln(a1)=``Aho’’ and
biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’
• Here, the query finds all the paths with structure
[biblio, doc, auth-set], and with two distinct path
completions with label auth with values ``Aho’’ and
``Hopcroft’’
<answer, set, {obj1}>
obj1 is the complete doc2
[email protected], UMR
An OEM Database
DBGroup
&1
Member
Member
Project
Member
Project
Member
&2
Name
Name
&7
“Clark”
&3
Age
&8
“Smith
”
&9
Name
Office
&4
&6
&15
&16
“Lore”
“Tsimmis
”
Project
Age
Office
&5
Project
Office
&10
&11
46 “Gates 252”
Building
&12
&13
“Jones”
28
Room
&14
Building
Room
&17
&18
&19
&20
“CIS”
“411”
“CIS”
252
[email protected], UMR
Lorel Queries - Simple Path
Expression
•Retrieve the offices of members with age greater than
30 years:
Query
SELECT DBGroup.Member.Office
WHERE DBGroup.Member.Age > 30
Result
Office “Gates 252”
Office
Building “CIS”
Room “411”
[email protected], UMR
Query
Queries - General Path
Expression
SELECT DBGroup.Member.Name
WHERE
DBGroup.Member.Office(.Room%|.Cubicle)?
Like “%252”
Result
Name “Jones”
Name “Smith”
•Room% matches all labels starting from Room,
like Room68. “|” stands for disjunction. “?”
indicates that the label pattern is optional. “like
%252” specifies that the data value should end
with string “252”.
[email protected], UMR
Queries - SubQueries
Retrieve Lore project members who work on other
projects
Query
SELECT M.Name,
( SELECT M.Project.Title
WHERE M.Project.Title != “Lore”)
FROM DBGroup.Member M
WHERE M.Project.Title = “Lore”
Result
Member
Name “Jones”
Title “Tsimmis”
[email protected], UMR
Lore - Summary
• Lore does facilitate query and updates on semistructural databases
• There has been more work done on
optimization using: data guides (vldb97).
• The system is up and running: http://WWWDB.Stanford.EDU/lore/demo/
• How is this related to WWW?
• XML-QL and related work provides the answer.
[email protected], UMR
Extraction and Integration
• OEM and subsequent LORE(L) can be used for
extracting information from multiple
information sources.
• OEM helps navigate through unknown objects
by
SELECT ?
FROM root
Thus help browsing and schema discovery
• Efficient implementations are possible using
partial fetch mechanism.
• Push and Pull information delivery systems are
[email protected], UMR
possible.
STRUDEL
• Web Site Management System
• web Site from multiple sources
• STruQL - based on OEM, graphs, regular
expressions, result as graph
• Example - return all the postscript papers
from homepages:
• Where homepages(p), p “paper”
q
• ispostscript(q) collect postscriptpages(p)
• Where C1,...Ck Create N1,...Nn link
L1,...Lp, Collect G1, …Gq
[email protected], UMR
Complex Constructors
Supported by Strudel: a Website Management System with
StruQL as query language
where Biblio(X), X -> “paper” -> P, P -> “author” ->A,
P -> “title” -> T, P -> “year” -> Y
create Root(), HomePage(A), YearPage(A,Y), PubPage(P)
link Root() -> “person” -> HomePage (A),
HomePage(A) ->”yearentry” -> YearPage(A,Y),
YearPage(A,Y) -> “publication” -> PubPage(P),
PubPage(P) -> “author” -> HomePage(A),
PubPage(P) -> “title” ->T
[email protected], UMR
WebDB
• View WWW as multimedia documents in
the form of web pages
• WQL supports selection, aggregation,
sorting, summary, grouping
• projection on title , URL, keywords, tables,
forms, images etc.
[email protected], UMR
Some More Results
• UnQL - AT&T
• AKIRA- Pennstate
• NoDose - SIGMOD’98
[email protected], UMR
HTML to XML
• HTML documents
• Emerging Web Standards - XML
• XML good for data interchange across
platforms enterprise wide
• conversion HTML to XML - IBM,
Microsoft
[email protected], UMR
XML - Motivation
• In HTML, both the tag semantics and tags are
fixed. There is limited and strict interpretation
of tags.
• HTML is widely successful in disseminating
documents across internet.
• Though data can be disseminated through
HTML, its extraction is painful, and laborious.
• EDI has been a predominate mode of
exchanging data among businesses. But it has
very rigid format that requires highly
customized applications.
[email protected], UMR
XML - Introduction
• XML aims to provide ease of authoring HTML
documents with ease of data exchange that is
possible with EDI.
• Tags are used to markup documents.
• XML is a meta-language for describing markup
languages.
• XML provides a facility to define tags and
structural relationships between them.
• No pre-defined tag set implied no preconceived
semantics, semantics of XML document is will
be defined by applications that process them or
[email protected], UMR
XML - Goals
• Straightforward to use over internet
• Support wide variety of applications, authoring,
browsing, content analysis, etc.
• Easy to write programs that process XML
documents and validate them.
• XML documents must be human-legible and
reasonably clear.
• Design of XML shall be formal and concise expressed as EBNF (extended Backus Naur
Form) - amenable to modern compiler tools and
techniques.
[email protected], UMR
XML-features
•
•
•
•
Some structure - not rigid
Extensibility - User defined tags
nested elements
validation - documents may specify their
own grammar
• DTP (Document Type Descriptor) - schema
exists with data as tag names
• Application -EDI - extraction, conversion, ,
transformation, integration
• can be modeled
using DOM
[email protected], UMR
More terminology
• RDF - Resource Description Framework - a
method to describe metdata for XML
documents
• XSL - Extensible Stylesheet Language language for transforming and formatting
XML.
• Transformation Language - XSLT, XPath,
XPointer
[email protected], UMR
Example-HTML
• Print - Sanjay Madria
Web Warehouse Tutorial, ADBIS’99
HTML
<H2> Sanjay Madria </H2>
<I> Web Warehouse Tutorial, ADBIS’99</I>
Very difficult to understand, structure is
hidden, describes only appearance
[email protected], UMR
XML
• <Ref>
<Speaker> <Firstname> Sanjay</firstname>
<Lastname> Madria</lastnaame>
</Speaker>
<Title > Web Warehouse Tutorial</Title>
<Conference> ADBIS’99</Conference>
</empty>
</Ref>
another format:
<Firstname Value “Sanjay”/>
[email protected], UMR
XML Data
• <book>
<title> database systems</title>
<author> John <lastname>
Korth</lastname></author>
<price currency = “USD”> 5.87</price>
</book>
• DTD
• <!ELEMENT book (title, author+, price)>
• <!ELEMENT title (#PCDATA)>
• <!ELEMENT author(#PCDATA)|lastname)*
[email protected], UMR
•
•
•
•
•
•
•
•
•
•
•
•
<tr> <td width="20%" valign="top"> Firma KarlHeinz Rosowski </td>
<td width="20%" valign="top"> Maikstraße 14 </td>
<td width="20%" valign="top"> 22041 Hamburg
HTML
</td>
Version
<td width="20%" valign="top"> 721 99 64 </td>
<td width="20%" valign="top"> 21110111 </td>
<?xml version="1.0"?>
</tr>
<Addresses>
<Address id="12359">
<Name>Firma Karl-Heinz
Rosowski</Name>
XML
<Street>Maikstraße 14</Street>
Version
<ZIP>22041</ZIP>
<City>Hamburg</City>
<Tel>721 99 64</Tel>
<Fax>21110111</Fax> <Email/>
</Address> …
[email protected], UMR
</Addresses>
XML - Document - Continued
•<?xml version="1.0"?> is the XML declaration.
•Elements:Most common form of markup.
<element> … </element>. For example
<name>Jack Lemon </name>
•Attributes: are name-value pairs that occur inside
start-tags after the element name. For example:
<Address id="12359"> attaches value 12359 to
attribute id of Address element.
•Entity References: to handle special characters of
XML like “<“ in the XML documents.
[email protected], UMR
• Comments: <!-- this is a comment --!>
• CDATA Sections: a CDATA (string of
characters) section instructs the parser to
ignore most markup characters. For
example source code, <![CDATA[ *p = &q;
b = (I <= 3);]]>, between [CDATA[ and ]]
all character data is passed to an
application, with out interpretation.
copy-right@sanjaymadria, UMR
XML - DTD - Element Type
Declarations
•Element type declarations: identify the names of
elements and the nature of their content. A typical
element type declaration looks like:
•<!Element Address (Name, Street, ZIP?, City,
Tel+, Fax*, Email?)>
•Address is the element name, and (Name, Street,
ZIP?, City, Tel+, Fax*, Email?) is the content
model. Every address must contain, Name, Street,
City and Tel. ZIP and Email are optional, whereas
there can be zero or more Fax numbers.
copy-right@sanjaymadria, UMR
• The declarations for Name, Street, ZIP …,
must also be given. For example
• <!Element Name (#PCDATA)>
• Attribute List Declarations: identify which
elements may have attributes, what values
the attributes may hold, and what value is
default. Attribute values appear only within
start-tags and empty-element tags.
• <Address id="12359">
copy-right@sanjaymadria, UMR
XML - Summary
• HTML describes presentation
• XML describes content
• XML vs. HTML
– users define new tags
– arbitrary nesting
– validation is possible
copy-right@sanjaymadria, UMR
XML and Semi Structural Data
Model
• XML data is fundamentally different than
relational and object oriented data.
• XML is not rigidly structured.
• In relational and OO data model every data
instance has a schema which is separate and
independent of the data.
• XML data is self describing and can naturally
model irregularities that cannot be modeled by
relational or OO data model.
copy-right@sanjaymadria, UMR
• For example, data items may have missing
elements or multiple occurrences of the
same element; elements may have atomic
values in some data items and structured
values in others; and collections of elements
can have heterogeneous structure.
• Even XML data that has an associated DTD
is self-describing (the schema is always
stored with the data) and, except for very
restricted forms of DTDs, may have all the
irregularities described above.
• XML is an instance of semistructured data.
copy-right@sanjaymadria, UMR
XML-QL
•
•
•
•
Regular path expression
pattern matching
used edge labeled graphs
extract data from existing XML documents
and construct new XML documents
• support for ordered and unordered views on
XML document
• simple and declarative
copy-right@sanjaymadria, UMR
XML-QL
• The simplest XML-QL queries extract data from
an XML document. Consider the following
DTD:
<!ELEMENT book (author+,title,publisher)>
<!ATTLIST Book year CDATA>
<!ELEMENT article (author+ title year?,
(shortversion |longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
copy-right@sanjaymadria, UMR
XML-QL Example Data
<bib>
<book year=“1995>
<title> An Introduction to DB Systems </title>
<author> <lastname> Date </lastname></author>
<publisher><name> Addison-Wesley</name>
</publisher>
</book>
<book year=“1995>
<title> Foundations for OR Databases </title>
<author> <lastname> Date </lastname></author>
<author> <lastname> Darwen
</lastname></author>
<publisher><name> Addison-Wesley</name>
</publisher>
</book>
</bib>
copy-right@sanjaymadria, UMR
Matching Data Using Patterns
• XML uses element patterns to match data in an XML document.
• Find all authors of books whose publisher is Addison-Wesley in
XML document www.a.b.c/bib.xml
WHERE <book>
<publisher><name>AddisonWesley</name></publisher>
<title> $t </title>
<author> $a </author>
</book> IN “www.a.b.c/bib.xml”
CONSTRUCT $a
matches every <book> element in the XML document that has
at least one <title> element, one <author> element , and one
publisher element whose <name> is Addison-Wesley. For
each such match it binds $t and $a to every title and author
copy-right@sanjaymadria, UMR
pair.
XML-QL Constructing XML Data
• Often we would like format the result.
• Find all authors and titles of books whose publisher is
Addison-Wesley in XML document www.a.b.c/bib.xml
WHERE <book>
<publisher><name>Addison-Wesley</></>
<title> $t </title>
<author> $a </author>
</book> IN “www.a.b.c/bib.xml”
CONSTRUCT
<result>
<author> $a </>
<title> $t </>
</>
copy-right@sanjaymadria, UMR
Constructing XML Data -cont.
Result of the query:
<result>
<author><lastname> Date </lastname></author>
<title> Introduction to Database Systems </title>
</result>
<result>
<author><lastname> Date </lastname></author>
<title> Foundations for OR Databases </title>
</result>
<result>
<author><lastname> Darwen </lastname></author>
<title> Foundations for OR Databases </title>
</result>
One result for each author, duplicating title information.
copy-right@sanjaymadria, UMR
XML-QL Nested Queries.
WHERE <book>
<title> $t </>
<publisher><name>Addison-Wesley</></>
</> CONTENT_AS $p IN “www.a.b.c/bib.xml”
CONSTRUCT <result>
<title> $t </>
WHERE <author> $a </> in $p
CONSTRUCT <author> $a </>
</>
<result>
<author><lastname> Date </lastname></author>
<title> Introduction to Database Systems </title>
</result>
<result>
<author><lastname> Date </lastname></author>
<author><lastname> Darwen </lastname></author>
<title> Foundations for OR Databases </title>
</result>
copy-right@sanjaymadria, UMR
XML-QL Join Queries
XML queries cab express “joins” by matching two or more
elements that contain same value. Find all articles that have at
least one author who has written a book since 1995.
WHERE <article>
<author>
<firstname> $f </> // firstname $f
<lastname> $l </> // lastname $l
</>
</> CONTENT_AS $a IN "www.a.b.c/bib.xml"
<book year=$y>
<author>
<firstname> $f </> // join on same firstname $f
<lastname> $l </> // join on same lastname $l
</>
</> IN "www.a.b.c/bib.xml",
y > 1995
CONSTRUCT <article> $a </>
copy-right@sanjaymadria, UMR
XML-QL Data Model for XML
•XML graph G in which each node is represented by a
unique string called object identifier (OID), G’s edges are
labelled with element tags, G’s nodes are labeled with sets
of attribute value pairs, G’s leaves are labeled with one
string value, and G has a distinguished node called root.
•
copy-right@sanjaymadria, UMR
XML-QL Data Model for XML
•The model allows several edges between the same two
nodes with the following restriction:
between any two nodes there can be at most one edge
with a given label
a node cannot have two leaf children with the same
label and same string value
•XML graphs are not only derived from XML
documents, but are also generated by queries.
copy-right@sanjaymadria, UMR
XML- Element Identity, Ids, and
IDREFS
•For element sharing XML reserves an attribute of
type ID which allows a unique key to be
associated with an element.
•An attribute of type IDREF allows an element to
refer to another element with the designated key,
and one of the type IDREFS may refer to multiple
elements.
copy-right@sanjaymadria, UMR
•
•
•
•
•
•
•
•
•
•
•
•
•
<!ATTLIST person ID #REQUIRED>
<!ATTLIST article author IDREFS #IMPLIED>
<person ID="o123">
<firstname>John</firstname>
<lastname>Smith<lastname>
</person>
<person ID="o234">
...
</person>
<article author="o123 o234">
<title> ... </title>
<year> 1995 </year>
</article>
copy-right@sanjaymadria, UMR
XML- Element Identity, Ids, and
IDREFS
copy-right@sanjaymadria, UMR
The following query produces all lastname, title pairs by joining the author
element's IDREF attribute value with the person element's ID attribute
value.
WHERE
<article author=$i>
<title> </> ELEMENT_AS $t
</>,
<person ID=$i>
<lastname> </> ELEMENT_AS $l
</>
CONSTRUCT <result> $t $l</>
The idiom <title></> ELEMENT_AS $t binds $t to a <title> element with
arbitrary contents. The element expression <title/> matches a <title>
element with empty contents.
copy-right@sanjaymadria, UMR
XML-QL- Advanced Examples
Tag Variables
Regular Path Expressions
Transforming XML Data (from one DTD to another)
Integrating Data from different XML sources
Embedding queries in data XML-QL
check http://www3.org/TR/NOTE-xml-ql
copy-right@sanjaymadria, UMR
Summary
•Even before you blink your eye…. Lot of work has gone in web
data models and query languages
•Some problems are addressed:
Semi-structural
semi-structural data model based query languages
schema inference from semi-structural data model
efficient processing of queries on semi-structural data
efficient indexing and storage structures
integration with XML
•Traditional
WebSQL/WebOQL
Web Warehousing
•Which way will you go?
copy-right@sanjaymadria, UMR
Further issues
•Distributed query processing
•Continuous result processing with push/pull result
replenishment
•Labels, labels every where, with XML more labels every where
… how are semantics of queries across multiple information
sources handled
•IR gives too many relevant/irrelevant results
•Query Processing requires some schema knowledge that is
difficult to handle across multiple sources
•Can these two be bridged? Cooperative solutions.
•Next Agents, Agents everywhere, What are they doing? Will it
work or Will it be a fad?
copy-right@sanjaymadria, UMR
Descargar

Web Warehousing : Design and Issues