Web Data Management
Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla
copy-right@sanjaymadria, UMR
• Huge, widely distributed,
heterogeneous collection of semistructured multimedia documents in
the form of web pages connected via
copy-right@sanjaymadria, UMR
World Wide Web
• Web is fast growing
• More business organizations putting
information in the Web
• Business on the highway
• Myriad of raw data to be processed for
copy-right@sanjaymadria, UMR
As WWW grows, more chaotic it
• Web is fast growing, distributed, nonadministered global information resource
• WWW allows access to text, image, video,
sound and graphic data
• More business organizations creating web
• More chaotic environment to locate
information of interest
• Lost in hyperspace syndrome
copy-right@sanjaymadria, UMR
Characteristics of WWW
• WWW is a set of directed graphs
• Data in the WWW has a heterogeneous
nature, self-describing and schema less
• Unstructured information , deeply nested
• No central authority to manage information
• Dynamic verses static information
• Web information discoveries - search
copy-right@sanjaymadria, UMR
Web is Growing!
In 1994, WWW grew by 1758 % !!
June 1993 - 130
June 1994 - 1265
Dec. 1994 - 11,576
April 1995 - 15,768
July 1995 - 23,000+
2000 - !!!!!
copy-right@sanjaymadria, UMR
‘COM’ domains are increasing!
• As of July 1995, 6.64 million host
computers on the Internet:
1.74 million are ‘com’ domains
1.41 million are ‘edu’ domains
0.30 million are ‘net’
0.27 million are ‘gov’
0.22 million are ‘mil’
0.20 million are ‘org’
copy-right@sanjaymadria, UMR
Top web countries
1. Canada (1) 80% 9. New Zealand(7)101
2. US (4) 140%
10. Sweden (9) 101%
3. Ireland (3) 110% 11. Israel (12) 112%
4. Iceland (2) 68%
12. Cyprus (8) 72%
5. UK (14) 336 %
13. Hong Kong (15)148%
6. Malta (5) 155%
14. Norway (10) 64%
7. Australia (6) 133% 15. Switzerland (13) 75%
8. Singapore (11) 207% 16. Denmark (16) 105%
copy-right@sanjaymadria, UMR
How users find web sites
Indexes and search engines
UseNet newsgroups
Cool lists
New lists
Print ads
Word-of-mouth and e-mail
Linked web advertisement
copy-right@sanjaymadria, UMR
Limitations of Search Engines
• Do not exploit hyperlinks
• Search is limited to string matching
• Queries are evaluated on archived data
rather than up-to-date data; no indexing on
current data
• Low accuracy
• Replicated results
• No further manipulation possible
copy-right@sanjaymadria, UMR
Limitations of Search Engines
ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery
copy-right@sanjaymadria, UMR
• specifying/understanding what information
is wanted
• the high degree of variability of accessible
• the variability in conceptual vocabulary or
“ontology” used to describe information
• complexity of querying unstructured data
copy-right@sanjaymadria, UMR
• complexity of querying structured data
• uncontrolled nature of web-based
information content
• determining which information sources to
copy-right@sanjaymadria, UMR
Search Engine Capabilities
– Selection of language
– Keywords with disjunction, adjacency, presence, absence, ...
– Word stemming (Hotbot)
– Similarity search (Excite)
– Natural language (LycosPro)
– Restrict by modification date (Hotbot) or range of dates
– Restrict result types (e.g., must include images) (Hotbot)
– Restrict by geographical source (content or domain)
– Restrict within various structured regions of a document
(titles or URLs) (LycosPro); (summary, first heading, title,
URL) (Opentext)
copy-right@sanjaymadria, UMR
Search Engines
Search engine
Northern Light
% web covered
 using several search engines is better
than using only one
 Source: Lawrence, S., and Giles, C.L., “Searching the World Wide
Web,” Science 280, pp. 98-100, 1998.
copy-right@sanjaymadria, UMR
Key Objectives
• Design a suitable data model to represent
web information
• Development of web algebra and query
language, query optimization
• Maintenance of Web data - view
• Development of knowledge discovery and
web mining tools
• Web warehouse
• Data integration , secondary storages,
copy-right@sanjaymadria, UMR
Web Data Representation
• HTML - Hypertext Markup Language
fixed grammar, no regular expressions
Simple representation of data
good for simple data
difficult to extract information
• SGML - Standard Generalized Markup
Language - good for publishing deeply
structured document
• XML - Extended Markup Language -a subset
copy-right@sanjaymadria, UMR
HTML - Hypertext Mark-up Language
HTTP - Hypertext Transmission Protocol
URL - Uniform Resource Locator
example <URL>:=<protocol>://<Host>/<path>/filen
ame>[<#location>] where
– <protocol> is http, ftp, gopher
– host is internet address …
– #location is a textual label in the file.
copy-right@sanjaymadria, UMR
• Links are specified as
<A HREF=“Destination URL”>Anhor Text</A>
• “destination URL is the URL of the destination
document and Anchor Text is the text that appears as an
anchor when displayed.
• Example:
• <A HREF=http://www.ntu.edu.sg/ >Nanyang
Technological University</A>
• Absolute and relative
• URL <A HREF="AtlanticStates/NYStats.html">New
York</A> is relative
• <A
WWW/HTMLPrimer.html"> NCSA's Beginner's
Guide to HTML</A>
World Wide Web
• Prevalent, persistent and informative
• HTML documents (soon, XML) created by
humans or applications.
• Accessed day in and day out by humans and
Persistent HTML documents!!!
Can database technology help?
copy-right@sanjaymadria, UMR
Current Research Projects
• Web Query System
WebLog, Araneus
• Semistructured Data Management
– LOREL, UnQL, WebOQL, Florid
• Website Management System
– STRUDEL, Araneus
• Web Warehouse
copy-right@sanjaymadria, UMR
Main Tasks
• Modeling and Querying the Web
– view web as directed graph
– content and link based queries
– example - find the page that contain the word
“clinton” which has a link from a page
containing word “monica”.
copy-right@sanjaymadria, UMR
• Information Extraction and integration
– wrapper - program to extract a structured
representation of the data; a set of tuples from
HTML pages.
– Mediator - integration of data-softwares that
access multiple source from a uniform interface
• Web Site Construction and Restructuring
– creating sites
– modeling the structure of web sites
– restructuring data
copy-right@sanjaymadria, UMR
User Interface
copy-right@sanjaymadria, UMR
What to Model
• Structure of Web sites
• Internal structure of web pages
• Contents of web sites in finer granularities
copy-right@sanjaymadria, UMR
Data Representation of Web Data
• Graph Data Models
• Semistructured Data Models (also graph
copy-right@sanjaymadria, UMR
Graph Data Model
• Labeled graph data model where node
represents web pages and arcs represent
links between pages.
• Labels on arcs can be viewed as attribute
• Regular path expression queries
copy-right@sanjaymadria, UMR
Semistructured Data Models
• Irregular data structure, no fixed schema
known and may be implicit in the data
• Schema may be large and may change
• Schema is descriptive rather than
perspective; describes the current state of
data, but violations of schema is still
copy-right@sanjaymadria, UMR
• Data is not strongly typed; for different
objects the values of the same attributes
may be of differing types. (heterogenious
• No restriction on the set of arcs that
emanate from a given node in a graph or on
the types of the values of attributes
• Ability to query the schemas; acr variables
which get bound to labels on arcs, rather
than nodes in the graph
copy-right@sanjaymadria, UMR
Graph based Query Languages
• Use graph to model databases
• Support regular path expressions and graph
construction in queries.
• Examples
Graph Log for hypertext queries
graph query language for OO
copy-right@sanjaymadria, UMR
Query Languages for SemiStructured data
• Use labeled graphs
• Query the schema of data
• Ability to accommodate irregularities in the
data, such as missing links etc.
• Examples : Lorel (Stanford) , UnQL
copy-right@sanjaymadria, UMR
Comparison of Query Systems
S ystem
D ata m odel L ang. style
P ath exp.
w ebsql
R elational
W 3Q S
Y es
W ebL O G R elational
D atalog
L orel
Y es
w eboql
Y es
Y es
U nQ L
R ecursion
Y es
Y es
F lorid
S trudel
A raneus
W how eda
F -logic
D atalog
Y es
D atalog
Y es
pag e sch em es S Q L
Y es
relational copy-right@sanjaymadria,
Y es
G raph
Y es
Y es
Y es
Types of Query Languages
• First Generation
• Second generation
copy-right@sanjaymadria, UMR
First Generation Query
• Combine the content-based queries of
search engines with structure-based queries
• Combine conditions on text pattern in
documents with graph pattern describing
link structures
• Examples - W3QL (TECHNION, Israel)
WebSQL (Toronto), WebLOG (Concordia)
copy-right@sanjaymadria, UMR
Second generation languages
• Called web data manipulation languages
• Web pages as atomic objects with properties
that they contain or do not contain certain
text patterns and they point to other objects
• Useful for data wrapping, transformation,
and restructuring
• Useful for web site transformation and
copy-right@sanjaymadria, UMR
How they Differ?
• Provide access to the structure of web
objects they manipulate - return structure
• Model internal structures of web documents
as well as the external links that connect
• Support references to model hyperlinks and
some support to ordered collections of
records for more natural data representation
• Ability to create new complex structures as
a result of a query
copy-right@sanjaymadria, UMR
• Web OQL
• Florid
copy-right@sanjaymadria, UMR
W3QS (WWW Query System) at
Technion - Israel
• Content queries
• Structural Queries
• Interfacing with user written programs and
standard UNIX utilities
• Uses existing WWW indexes and search
• Provides view update facility
copy-right@sanjaymadria, UMR
• Accessible via any WWW browsers
• API can be used by programs running
anywhere in the Internet
• Support queries on the web structure by
specifying starting page, a search domain
and depth of links.
• File content analysis tools and filling up of
forms automatically
copy-right@sanjaymadria, UMR
File Types
• Strict Inner Structure files such as Unix
environment files - Semantics of the data is
clearly linked to the syntax
• Semi-structured files - text files containing
formatting codes such as Latex or HTML
files- possible to use formatting codes to
analyze their semantic content
• Raw Files - no relation between meaning of
file and its inner structure
copy-right@sanjaymadria, UMR
Content Queries
• Queries based on the content of a single
node of hypertext
• SQLCOND is used to evaluate boolean
• Example - node-format = Latex and
Node.author =“Sanjay”
copy-right@sanjaymadria, UMR
Structure Queries
• Information conveyed in the hypertext
organization itself is conveyed.
• The result is a set of nodes and links from
the hypertext structure that satisfy a given
graph pattern; graph with nodes and edges
are annotated with conditions.
• Components are pattern definition, search
engines and form completion
copy-right@sanjaymadria, UMR
=“Good article”
= “Sanjay”
URL http://../myarticles.html
URL http:///.tex
<Title> Good articles</Title>
\author {sanjay}
A HREF=//..rev=“doc”>
copy-right@sanjaymadria, UMR
Search for an article
• Select cp n2/* result
from n1, l2, n2
where n1 in importantindexs.url
Fill n1.form as IN importantindexes.fil with
Keyword = “sanjay” SQLCOND (n2.format
=Laytex) and (n2.author=“sanjay”)
copy-right@sanjaymadria, UMR
Query to search hypertext pattern
• Return all the articles cited in the first chapter
of the book. Each chapter includes several
pointers to the bibliography, for example
• <A HREF=“http//cs…/refrences.html#ref2”>
[Relativity]</> means link [Relativity] leads to
the label ref2 in the references.html file.
In the references.html file the labeled link looks
like <A HREF=“./relative.tex”name=“ref2”>
[relativity, sanjay]</A> this link points to
copy-right@sanjaymadria, UMR
• Select cp art/* result from Ind,
l1,chap,l2,ref,l3 art where SQLCOND
(ind.url = “http://) And (chap.url =/.chapter1.html/) AND l2.HREF = /.\#13.Name/)
copy-right@sanjaymadria, UMR
Chapter 1
ref 1
ref 2
Chapter 2
ref 3
ref 1
ref 2
ref 3
copy-right@sanjaymadria, UMR
WebSQL-University of Toronto
• Model web as relational database
• Use two relations Document and Anchor
• Document relation has one tuple for each
document in the web and the anchor relation
has one tuple for each anchor in each
copy-right@sanjaymadria, UMR
• SQL-like query language for extracting
information from the web.
• Capable of systematic processing of
either all the links in a page, all the
pages that can be reached from a given
URL through paths that match a pattern,
or a combination of both.
• Provides transparent access to index
copy-right@sanjaymadria, UMR
U rl
te x t
L e n g th T y p e
h ttp ://
T itle 1
T ext 1 1234
te x t
1 -1 -9 6
h ttp ://
T itle 2 T e x t 2 2 3 4 5
te x t
2 -3 -9 7
copy-right@sanjaymadria, UMR
M o d if
B ase
lab el
h ref
h ttp ://
L ab el 1
h ttp ://
h ttp ://
L ab el 1
L ab el 2
h ttp ://
h ttp ://
copy-right@sanjaymadria, UMR
• Give documents’s URLs which contain
same title and keyword(s)
• Select d1.url, d2.url from
document d1 such that d1 MENTIONS
“keyword1” and document d2 such that d2
MENTIONS “keyword1”
where d1.title = d2.title
and NOT (d1.url = d2.url)
copy-right@sanjaymadria, UMR
Find Labels of all Hyperlinks to
Postscript Files
SELECT a.label
FROM Anchor a SUCH THAT base =
WHERE a.href CONTAINS ".ps.Z";
copy-right@sanjaymadria, UMR
Documents about Databases
SELECT Document d.url, d.title
"http://www.OtherDoc.html" ->|=> d
WHERE d.title CONTAINS "databases";
Note :
-> path of length one within same server
=> path of length of one but different server
copy-right@sanjaymadria, UMR
Retrieve all the documents in the
same server that are pointed to
from the document
Whose URL is given
• Select d.url, d.title from
Document d SUCH THAT
“http//www. Cs.in -> d
copy-right@sanjaymadria, UMR
Find all broken links in a page
• SELECT a.href
FROM Anchor a SUCH THAT base =
WHERE protocol(a.href) = "http" AND
doc(a.href) = null;
copy-right@sanjaymadria, UMR
Web OQL (University of Toronto)
• Provides a framework that supports a large
class of data
• Restructuring operations.
• Simple semistructured data model for
documents and record-based data
• OQL-like syntax and regular expressions
• Serves as a two-way bridge between
databases and the Web.
copy-right@sanjaymadria, UMR
• Hypertrees are Ordered arc labeled trees
with two types of arcs ; internal and
• Internal arcs represent structured objects
• External arcs to represent refrences
(huperlinks) among objects.
• Records as labels in the arcs
• Sets of related hypertrees as Web
copy-right@sanjaymadria, UMR
Wrappers map all data sources to trees
The mapping can be done all at once or
on demand
copy-right@sanjaymadria, UMR
• Extract from cspapers (paper database) title
and URL of the full version of papers of
• select [y.title,y’.URL]
from x in cs papers, y in x’
where y.authors ~”smith”
copy-right@sanjaymadria, UMR
Web Creation
• Create a new page for each research Group
(using the group name as URL). Each page
contains the publications of the
corresponding group.
• Select x’ as x.group from x in cspapers
• Select q1 as s1, q2 as s2, ...qm as sm
• where q’s are queries and each S’s is either
a string query or keyword schema. “as”
clause create a URL’s s1 , ..sm assigned to
each new page resulting from each query.
copy-right@sanjaymadria, UMR
• Data Model called ADM for Web
Documents - nested web objects, page
• Several languages for wrapping, querying,
creating and updating web sites - object
• Methods and Techniques for Web Site
Design and Implementation
• Presentation in SIGMOD’99
• Software is available at their home site
copy-right@sanjaymadria, UMR
• Wrappers - map logical access to attribute
values in a page at the ADM level tp physical
access to text in the HTML source using
• ULIXES - SQL-like query languages
• PENELOPE - manipulation language
• Site integration, semantic heterogeneities
• Materialized views
• http://poincare.dia.uniroma3.it:8080/Araneous
copy-right@sanjaymadria, UMR
Lore - motivation
• The data may be irregular and thus not
conform to a rigid schema.
• Relational data model has null values, and
OO models have inheritance and complex
objects. Both have difficulties in designing
schemas to incorporate irregular data.
• It may be difficult to decide in advance on a
single, correct schema, The structure of the
data may evolve rapidly, data elements may
change types, or data not conforming to
previous structure may be added.
copy-right@sanjaymadria, UMR
• Thus, there is a need for management of
semi-structured data!
• Lore system manages semi-structured data. The
data managed by Lore is not confined to a
schema and it may be irregular or incomplete.
• OEM is the Lore’s data model. OEM - object
Exchange Model - graph based self-describing
object instance model where nodes are objects
and edges are labeled with attribute names and
leaf nodes have atomic values
• Lore is light weight object repository and Lorel
is Lore’s query copy-right@sanjaymadria,
language. UMR
Object Exchange Model - OEM
• Motivation - information exchange and
• Why a new data model? … it not a new
•Each value exchanged is given an explicit
Object temp-in-Fahrenheit, integer, 80 - “tempin-Fahrenheit” is the label. Each object is selfdescribing, with a label, type and value.
set-of-temps, set, {cmpnt1, cmpnt2} 
cmpnt1 is temp-in-Fahrenheit, integer, 80
cmpnt2 is temp-in-Celsius, integer, 20
copy-right@sanjaymadria, UMR
• Plays two roles
– identifying an object (component)
– identifying the meaning of an object (component)
person-record, set, {cmpnt1, cmpnt2, cmpnt3} 
cmpnt1 is person-name, string, ``Fred’’
cmpnt2 is office-num-in-bldg-5, integer, 333
cmpnt3 is department, string, ``toy’’
• Person-name both identifies cmpnt1 and
coveys its meaning.
• In relational data this corresponds to ….
copy-right@sanjaymadria, UMR
Labels - Issues
• What does the label mean?
– Database of labels
– Ontology of labels - within each source
•Labels are relative (more specific) to the
source of the data object.
•Similar labels from different sources need to
be resolved.
• Labels provide the flexibility in representing
object structure
copy-right@sanjaymadria, UMR
Self-describing data models
• Have been in existence for a long time? Why
additional interest now?
•Use the ``nature’’ of self-describing data model
for information exchange, and to extend the
model to include object nesting.
•To provide an appropriate object request
language (query facility)
copy-right@sanjaymadria, UMR
OEM - Specification
• Each object in OEM has the following
Value Object-ID
– Label: A variable character string describing
what the object represents.
– Type: The data type of the object’s value. Each is
either an atom type, or type set.
– Value: A variable-length value of the object.
– Object-ID: A unique variable-length identifier for
the object or null.
copy-right@sanjaymadria, UMR
OEM - Summary
• OEM is an information exchange model. It does
not specify how objects are stored at source.
• OEM does specify how objects are received at
a client, but after objects are received they can
be stored in any way the client likes.
• Each source has a distinguished object with
lexical identifier ``root’’.
• Note the schema-less nature of OEM is
particularly useful when a client does not
know in advance the labels or structure of
OEM objects. copy-right@sanjaymadria, UMR
• <biblio,set,{doc1,doc2,…,docn}>
• doc1 is <doc, set, {auths1, topic1, call• topic2 is <topic,
auths1 is <auth-set,set {auth11}>
• call-no1 is <dewey•
auth11 is <auth-ln, string,
decimal, string,
topic1 is <topic, string,``Databases’’>
• docn is <doc, set,
call-no1 is <internal-call-no, integer,
{authsn, topicn, call25>
• doc2 is <doc, set, {auths2, topic2, call•
authsn is
auths2 is <auth-set,set {auth21, auth22,
• topic1 is <topic,
auth21 is <auth-ln, string, ``Aho’’>
auth22 is <auth-ln, string,
• call-no1 is <fictional``Hopcroft’’>
call-no, integer, 95>
auth23 is <auth-ln, copy-right@sanjaymadria,
• biblio is the root object.
SELECT Fetch-expression
FROM Object
WHERE Condition
• The result of this query is itself an object, with
special label ``answer’’:
answer, set, {obj1, obj2, …, objn} 
• Each returned obji is a component of object
specified in the From clause of the query,
where the component is located by the Fetchexpression and satisfies the Condition.
copy-right@sanjaymadria, UMR
• The notion of path is used in both FetchExpression in the Select clause and the condition in
the Where clause.
• Path describes traversals through an object using
subobject structure and labels.
• Example: ``biblio.doc.auth’’
• Paths are used in Fetch-Expression to specify
which components are are returned in the answer
• Paths are used in the condition to qualify the
fetched objects or other (related) components in the
copy-right@sanjaymadria, UMR
same object structure.
Queries - Simple
• Retrieve the topic of each document for which
``Ullman’’ is one of the authors:
SELECT biblio.doc.topic
FROM root
WHERE biblio.doc.auth-set.auth-ln = ``Ullman’’
• Intuitively, the query’s where clause finds all paths
through subobject structure with the sequence of
labels [biblio,doc,auth-set,auth-ln] such that the
object at the end of the path has value ``Ullman.’’
<answer, set, {obj1, obj2}>
obj1 is <topic, string, ``Databases’’>
obj2 is <topic, string, “Algorithms”>
copy-right@sanjaymadria, UMR
Queries - ``wild-cards’’
• Retrieve all documents with internal call number:
SELECT biblio.?.topic
FROM root
WHERE biblio.?.internal-call-no
• ``?’’ label matches any label. For this query, the doc
labels can be replaced by any other strings and query
would produce the same result. By convention, two
occurrences of ? In the same query must match the
same label unless variables are used.
<answer, set, {obj1}>
obj1 is <topic, string, ``Databases’’>
copy-right@sanjaymadria, UMR
Queries - ``wild-paths’’
• Retrieve all documents with internal call number:
SELECT *.topic
FROM root
WHERE *.internal-call-no
• Symbol ``*’’ matches any path of length one or more.
The use of * followed by a single label is a
convenient and common way to locate objects with a
certain label in complex structure. Similar to ?, two
occurrences of * in the same query must match the
same sequence of labels, unless variables are used.
<answer, set, {obj1}>
obj1 is <topic, string, ``Databases’’>
copy-right@sanjaymadria, UMR
Queries - variables
• Retrieve each document for which both ``Hopcroft’’
and ``Aho’’ are co-authors:
SELECT biblio.doc
FROM root
WHERE biblio.doc.auth-set.auth-ln(a1)=``Aho’’ and
• Here, the query finds all the paths with structure
[biblio, doc, auth-set], and with two distinct path
completions with label auth with values ``Aho’’ and
<answer, set, {obj1}>
obj1 is the complete doc2
copy-right@sanjaymadria, UMR
An OEM Database
46 “Gates 252”
copy-right@sanjaymadria, UMR
Lorel Queries - Simple Path
•Retrieve the offices of members with age greater than
30 years:
SELECT DBGroup.Member.Office
WHERE DBGroup.Member.Age > 30
Office “Gates 252”
Building “CIS”
Room “411”
copy-right@sanjaymadria, UMR
Queries - General Path
SELECT DBGroup.Member.Name
Like “%252”
Name “Jones”
Name “Smith”
•Room% matches all labels starting from Room,
like Room68. “|” stands for disjunction. “?”
indicates that the label pattern is optional. “like
%252” specifies that the data value should end
with string “252”.
copy-right@sanjaymadria, UMR
Queries - SubQueries
Retrieve Lore project members who work on other
( SELECT M.Project.Title
WHERE M.Project.Title != “Lore”)
FROM DBGroup.Member M
WHERE M.Project.Title = “Lore”
Name “Jones”
Title “Tsimmis”
copy-right@sanjaymadria, UMR
Lore - Summary
• Lore does facilitate query and updates on semistructural databases
• There has been more work done on
optimization using: data guides (vldb97).
• The system is up and running: http://WWWDB.Stanford.EDU/lore/demo/
• How is this related to WWW?
• XML-QL and related work provides the answer.
copy-right@sanjaymadria, UMR
Extraction and Integration
• OEM and subsequent LORE(L) can be used for
extracting information from multiple
information sources.
• OEM helps navigate through unknown objects
FROM root
Thus help browsing and schema discovery
• Efficient implementations are possible using
partial fetch mechanism.
• Push and Pull information delivery systems are
copy-right@sanjaymadria, UMR
• Web Site Management System
• web Site from multiple sources
• STruQL - based on OEM, graphs, regular
expressions, result as graph
• Example - return all the postscript papers
from homepages:
• Where homepages(p), p “paper”
• ispostscript(q) collect postscriptpages(p)
• Where C1,...Ck Create N1,...Nn link
L1,...Lp, Collect G1, …Gq
copy-right@sanjaymadria, UMR
Complex Constructors
Supported by Strudel: a Website Management System with
StruQL as query language
where Biblio(X), X -> “paper” -> P, P -> “author” ->A,
P -> “title” -> T, P -> “year” -> Y
create Root(), HomePage(A), YearPage(A,Y), PubPage(P)
link Root() -> “person” -> HomePage (A),
HomePage(A) ->”yearentry” -> YearPage(A,Y),
YearPage(A,Y) -> “publication” -> PubPage(P),
PubPage(P) -> “author” -> HomePage(A),
PubPage(P) -> “title” ->T
copy-right@sanjaymadria, UMR
• View WWW as multimedia documents in
the form of web pages
• WQL supports selection, aggregation,
sorting, summary, grouping
• projection on title , URL, keywords, tables,
forms, images etc.
copy-right@sanjaymadria, UMR
Some More Results
• UnQL - AT&T
• AKIRA- Pennstate
• NoDose - SIGMOD’98
copy-right@sanjaymadria, UMR
• HTML documents
• Emerging Web Standards - XML
• XML good for data interchange across
platforms enterprise wide
• conversion HTML to XML - IBM,
copy-right@sanjaymadria, UMR
XML - Motivation
• In HTML, both the tag semantics and tags are
fixed. There is limited and strict interpretation
of tags.
• HTML is widely successful in disseminating
documents across internet.
• Though data can be disseminated through
HTML, its extraction is painful, and laborious.
• EDI has been a predominate mode of
exchanging data among businesses. But it has
very rigid format that requires highly
customized applications.
copy-right@sanjaymadria, UMR
XML - Introduction
• XML aims to provide ease of authoring HTML
documents with ease of data exchange that is
possible with EDI.
• Tags are used to markup documents.
• XML is a meta-language for describing markup
• XML provides a facility to define tags and
structural relationships between them.
• No pre-defined tag set implied no preconceived
semantics, semantics of XML document is will
be defined by applications that process them or
copy-right@sanjaymadria, UMR
XML - Goals
• Straightforward to use over internet
• Support wide variety of applications, authoring,
browsing, content analysis, etc.
• Easy to write programs that process XML
documents and validate them.
• XML documents must be human-legible and
reasonably clear.
• Design of XML shall be formal and concise expressed as EBNF (extended Backus Naur
Form) - amenable to modern compiler tools and
copy-right@sanjaymadria, UMR
Some structure - not rigid
Extensibility - User defined tags
nested elements
validation - documents may specify their
own grammar
• DTP (Document Type Descriptor) - schema
exists with data as tag names
• Application -EDI - extraction, conversion, ,
transformation, integration
• can be modeled
using DOM
copy-right@sanjaymadria, UMR
More terminology
• RDF - Resource Description Framework - a
method to describe metdata for XML
• XSL - Extensible Stylesheet Language language for transforming and formatting
• Transformation Language - XSLT, XPath,
copy-right@sanjaymadria, UMR
• Print - Sanjay Madria
Web Warehouse Tutorial, ADBIS’99
<H2> Sanjay Madria </H2>
<I> Web Warehouse Tutorial, ADBIS’99</I>
Very difficult to understand, structure is
hidden, describes only appearance
copy-right@sanjaymadria, UMR
• <Ref>
<Speaker> <Firstname> Sanjay</firstname>
<Lastname> Madria</lastnaame>
<Title > Web Warehouse Tutorial</Title>
<Conference> ADBIS’99</Conference>
another format:
<Firstname Value “Sanjay”/>
copy-right@sanjaymadria, UMR
XML Data
• <book>
<title> database systems</title>
<author> John <lastname>
<price currency = “USD”> 5.87</price>
• <!ELEMENT book (title, author+, price)>
• <!ELEMENT title (#PCDATA)>
• <!ELEMENT author(#PCDATA)|lastname)*
copy-right@sanjaymadria, UMR
<tr> <td width="20%" valign="top"> Firma KarlHeinz Rosowski </td>
<td width="20%" valign="top"> Maikstraße 14 </td>
<td width="20%" valign="top"> 22041 Hamburg
<td width="20%" valign="top"> 721 99 64 </td>
<td width="20%" valign="top"> 21110111 </td>
<?xml version="1.0"?>
<Address id="12359">
<Name>Firma Karl-Heinz
<Street>Maikstraße 14</Street>
<Tel>721 99 64</Tel>
<Fax>21110111</Fax> <Email/>
</Address> …
copy-right@sanjaymadria, UMR
XML - Document - Continued
•<?xml version="1.0"?> is the XML declaration.
•Elements:Most common form of markup.
<element> … </element>. For example
<name>Jack Lemon </name>
•Attributes: are name-value pairs that occur inside
start-tags after the element name. For example:
<Address id="12359"> attaches value 12359 to
attribute id of Address element.
•Entity References: to handle special characters of
XML like “<“ in the XML documents.
copy-right@sanjaymadria, UMR
• Comments: <!-- this is a comment --!>
• CDATA Sections: a CDATA (string of
characters) section instructs the parser to
ignore most markup characters. For
example source code, <![CDATA[ *p = &q;
b = (I <= 3);]]>, between [CDATA[ and ]]
all character data is passed to an
application, with out interpretation.
copy-right@sanjaymadria, UMR
XML - DTD - Element Type
•Element type declarations: identify the names of
elements and the nature of their content. A typical
element type declaration looks like:
•<!Element Address (Name, Street, ZIP?, City,
Tel+, Fax*, Email?)>
•Address is the element name, and (Name, Street,
ZIP?, City, Tel+, Fax*, Email?) is the content
model. Every address must contain, Name, Street,
City and Tel. ZIP and Email are optional, whereas
there can be zero or more Fax numbers.
copy-right@sanjaymadria, UMR
• The declarations for Name, Street, ZIP …,
must also be given. For example
• <!Element Name (#PCDATA)>
• Attribute List Declarations: identify which
elements may have attributes, what values
the attributes may hold, and what value is
default. Attribute values appear only within
start-tags and empty-element tags.
• <Address id="12359">
copy-right@sanjaymadria, UMR
XML - Summary
• HTML describes presentation
• XML describes content
• XML vs. HTML
– users define new tags
– arbitrary nesting
– validation is possible
copy-right@sanjaymadria, UMR
XML and Semi Structural Data
• XML data is fundamentally different than
relational and object oriented data.
• XML is not rigidly structured.
• In relational and OO data model every data
instance has a schema which is separate and
independent of the data.
• XML data is self describing and can naturally
model irregularities that cannot be modeled by
relational or OO data model.
copy-right@sanjaymadria, UMR
• For example, data items may have missing
elements or multiple occurrences of the
same element; elements may have atomic
values in some data items and structured
values in others; and collections of elements
can have heterogeneous structure.
• Even XML data that has an associated DTD
is self-describing (the schema is always
stored with the data) and, except for very
restricted forms of DTDs, may have all the
irregularities described above.
• XML is an instance of semistructured data.
copy-right@sanjaymadria, UMR
Regular path expression
pattern matching
used edge labeled graphs
extract data from existing XML documents
and construct new XML documents
• support for ordered and unordered views on
XML document
• simple and declarative
copy-right@sanjaymadria, UMR
• The simplest XML-QL queries extract data from
an XML document. Consider the following
<!ELEMENT book (author+,title,publisher)>
<!ATTLIST Book year CDATA>
<!ELEMENT article (author+ title year?,
(shortversion |longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
copy-right@sanjaymadria, UMR
XML-QL Example Data
<book year=“1995>
<title> An Introduction to DB Systems </title>
<author> <lastname> Date </lastname></author>
<publisher><name> Addison-Wesley</name>
<book year=“1995>
<title> Foundations for OR Databases </title>
<author> <lastname> Date </lastname></author>
<author> <lastname> Darwen
<publisher><name> Addison-Wesley</name>
copy-right@sanjaymadria, UMR
Matching Data Using Patterns
• XML uses element patterns to match data in an XML document.
• Find all authors of books whose publisher is Addison-Wesley in
XML document www.a.b.c/bib.xml
WHERE <book>
<title> $t </title>
<author> $a </author>
</book> IN “www.a.b.c/bib.xml”
matches every <book> element in the XML document that has
at least one <title> element, one <author> element , and one
publisher element whose <name> is Addison-Wesley. For
each such match it binds $t and $a to every title and author
copy-right@sanjaymadria, UMR
XML-QL Constructing XML Data
• Often we would like format the result.
• Find all authors and titles of books whose publisher is
Addison-Wesley in XML document www.a.b.c/bib.xml
WHERE <book>
<title> $t </title>
<author> $a </author>
</book> IN “www.a.b.c/bib.xml”
<author> $a </>
<title> $t </>
copy-right@sanjaymadria, UMR
Constructing XML Data -cont.
Result of the query:
<author><lastname> Date </lastname></author>
<title> Introduction to Database Systems </title>
<author><lastname> Date </lastname></author>
<title> Foundations for OR Databases </title>
<author><lastname> Darwen </lastname></author>
<title> Foundations for OR Databases </title>
One result for each author, duplicating title information.
copy-right@sanjaymadria, UMR
XML-QL Nested Queries.
WHERE <book>
<title> $t </>
</> CONTENT_AS $p IN “www.a.b.c/bib.xml”
CONSTRUCT <result>
<title> $t </>
WHERE <author> $a </> in $p
CONSTRUCT <author> $a </>
<author><lastname> Date </lastname></author>
<title> Introduction to Database Systems </title>
<author><lastname> Date </lastname></author>
<author><lastname> Darwen </lastname></author>
<title> Foundations for OR Databases </title>
copy-right@sanjaymadria, UMR
XML-QL Join Queries
XML queries cab express “joins” by matching two or more
elements that contain same value. Find all articles that have at
least one author who has written a book since 1995.
WHERE <article>
<firstname> $f </> // firstname $f
<lastname> $l </> // lastname $l
</> CONTENT_AS $a IN "www.a.b.c/bib.xml"
<book year=$y>
<firstname> $f </> // join on same firstname $f
<lastname> $l </> // join on same lastname $l
</> IN "www.a.b.c/bib.xml",
y > 1995
CONSTRUCT <article> $a </>
copy-right@sanjaymadria, UMR
XML-QL Data Model for XML
•XML graph G in which each node is represented by a
unique string called object identifier (OID), G’s edges are
labelled with element tags, G’s nodes are labeled with sets
of attribute value pairs, G’s leaves are labeled with one
string value, and G has a distinguished node called root.
copy-right@sanjaymadria, UMR
XML-QL Data Model for XML
•The model allows several edges between the same two
nodes with the following restriction:
between any two nodes there can be at most one edge
with a given label
a node cannot have two leaf children with the same
label and same string value
•XML graphs are not only derived from XML
documents, but are also generated by queries.
copy-right@sanjaymadria, UMR
XML- Element Identity, Ids, and
•For element sharing XML reserves an attribute of
type ID which allows a unique key to be
associated with an element.
•An attribute of type IDREF allows an element to
refer to another element with the designated key,
and one of the type IDREFS may refer to multiple
copy-right@sanjaymadria, UMR
<!ATTLIST article author IDREFS #IMPLIED>
<person ID="o123">
<person ID="o234">
<article author="o123 o234">
<title> ... </title>
<year> 1995 </year>
copy-right@sanjaymadria, UMR
XML- Element Identity, Ids, and
copy-right@sanjaymadria, UMR
The following query produces all lastname, title pairs by joining the author
element's IDREF attribute value with the person element's ID attribute
<article author=$i>
<title> </> ELEMENT_AS $t
<person ID=$i>
<lastname> </> ELEMENT_AS $l
CONSTRUCT <result> $t $l</>
The idiom <title></> ELEMENT_AS $t binds $t to a <title> element with
arbitrary contents. The element expression <title/> matches a <title>
element with empty contents.
copy-right@sanjaymadria, UMR
XML-QL- Advanced Examples
Tag Variables
Regular Path Expressions
Transforming XML Data (from one DTD to another)
Integrating Data from different XML sources
Embedding queries in data XML-QL
check http://www3.org/TR/NOTE-xml-ql
copy-right@sanjaymadria, UMR
•Even before you blink your eye…. Lot of work has gone in web
data models and query languages
•Some problems are addressed:
semi-structural data model based query languages
schema inference from semi-structural data model
efficient processing of queries on semi-structural data
efficient indexing and storage structures
integration with XML
Web Warehousing
•Which way will you go?
copy-right@sanjaymadria, UMR
Further issues
•Distributed query processing
•Continuous result processing with push/pull result
•Labels, labels every where, with XML more labels every where
… how are semantics of queries across multiple information
sources handled
•IR gives too many relevant/irrelevant results
•Query Processing requires some schema knowledge that is
difficult to handle across multiple sources
•Can these two be bridged? Cooperative solutions.
•Next Agents, Agents everywhere, What are they doing? Will it
work or Will it be a fad?
copy-right@sanjaymadria, UMR

Web Warehousing : Design and Issues