Introduction to XML
and RSS
Data Management Issues
Types of data
Structured Data
data is organized in entities ( tables)
entities have attributes
Current Database World
– Structure
Relational Database Management System (DBMS):
everything is a table
– Query languages: SQL
– Software: MS Access, Oracle….
Example of a table (patients)
Example of
a group of
MS Access Table Links
World of Web Data
– Easy document exchange
– Unstructured (or poorly structured) data
Everything is a document
– No standard for query languages
World of Web Data
– An organization A publishes financial data
on its web pages (HTML), generated from
– A second organization B wants some
financial analyses; can access only web
Semi-structured Data
data can be of any type
not necessarily following any format
does not follow any rules
is not predictable
examples include
Characteristics of SemiStructured Data
structure is irregular: missing or
additional attributes
parts of data lack structure, e.g.,
some may yield little structure, e.g.,
plain text
Semi-structured Data Definition
Data that is inherently self-describing
and does not conform to an explicit and
fixed schema is known as
Semistructured Data
Data Structure is contained within data
Example of Semi-Structured Data
name: Peter Wood
 email: [email protected], [email protected]
[email protected] name:
• first name: Mark
• last name: Levene
 email: [email protected]
[email protected] name: Alex Smith
affiliation: StFX
IMDB – A Motivating Example
The Internet Movie Database is a
classical example of a collection of
semi-structured data
Although the information pertaining to
different movies may be essentially
similar, their structure may be
Let us consider an example movie
An Example Movie Database
IMDB-Irregularity In Structure
Different layout for movies and TV series
Movie entries show Director, Writers and Stars
TV entries show just Stars
Captain Phillips (Movie)
Lost (TV Series)
Traditional Data Management
Universe of Discourse
Model of
the UoD
Post-Internet Data Management
Universe of Discourse
XML – An Embodiment of
Semi-structured Data
XML can be used to represent
semistructured data
What is XML?
XML stands for EXtensible Markup
XML is a markup language much
like HTML (tags)
XML was designed to describe data
XML tags are not predefined. You
must define your own tags
The main difference
between XML and HTML
XML and HTML were designed with different
was designed to describe data and to
focus on what data is.
was designed to display data and to
focus on how data looks.
is important to understand that XML is
not a replacement for HTML.
XML does not DO anything
Maybe it is a little hard to understand, but XML DOES NOT DO
ANYTHING. XML is created to structure, store and to send
<body>Don't forget me this weekend!</body>
The note has a header and a message body. It also has sender and
receiver information. But still, this XML document does not DO
anything. It is just pure information wrapped in XML tags. Someone
must write a piece of software to send, receive or display it.
XML is free and extensible
XML tags are not predefined. You must
"invent" your own tags.
The tags used to mark up HTML documents and
the structure of HTML documents are predefined.
(like <b>, <i>, <h1>, etc.).
XML allows authors to define their own tags and
their own document structure.
The tags in the example above (like <to> and
<from>) are not defined in any XML standard.
These tags are "invented" by the author of the XML
XML is used to Exchange Data
With XML, data can be exchanged between
incompatible systems.
In the real world, computer systems and databases
contain data in incompatible formats. One of the
most time-consuming challenges for developers has
been to exchange data between such systems over
the Internet.
Since XML data is stored in plain text format, XML
provides a software- and hardware-independent
way of sharing data.
XML can be used to Create new
XML is the mother of WAP( Wireless
Application Protocol) and WML (The Wireless
Markup Language).
WML used to markup Internet applications for
handheld devices like mobile phones.
MathML, for creating Math formula and CML
(Chemical Markup language), comicML ( for
describing comic characters) and musicXML (for
musical notes) is written in XML.
XML and Microsoft Office
Starting with Office 2007, Microsoft changed the
format of all Office documents.
They are all saved in XML format.
So a Word file is a ZIP folder holding a number
of files including the text in XML format.
– Small file size
– Compatibility with other software
– Older Word files have the extension DOC,
new ones use DOCX
XML Syntax
The syntax rules of XML are very
simple and very strict. The rules are
very easy to learn, and very easy to
Because of this, creating software that
can read and manipulate XML is very
easy to do.
All XML elements must have a
closing tag
Elements or tags are basic blocks of any
XML document
With XML, it is illegal to omit the closing tag.
In HTML some elements do not have to have
a closing tag. The following code is legal in
<p>This is a paragraph
In XML all elements must have a closing tag,
like this:
XML tags are case sensitive
Unlike HTML, XML tags are case
With XML, the tag <Letter> is
different from the tag <letter>.
Opening and closing tags must
therefore be written with the same
<Message>This is incorrect</message>
<message>This is correct</message>
All XML elements must be properly
Improper nesting of tags makes no sense to XML.
In HTML some elements can be improperly nested within
each other like this:
<b><i>This text is bold and italic</b></i>
In XML all elements must be properly nested within each
other like this:
This text is bold and italic
All XML documents must
have a root element (tag)
All XML documents must contain a single tag
pair to define a root element.
All other elements must be within this root element.
All elements can have sub elements (child elements).
Sub elements must be correctly nested within their
parent element:
With XML, white space is preserved
With XML, white space is preserved
 With XML, the white space in your
document is not truncated.
 This is unlike HTML. With HTML, a sentence
like this:
my name is John,
will be displayed like this:
Hello my name is John,
because HTML strips off the white space.
Element Naming
XML elements must follow these naming
Names can contain letters, numbers, and
other characters
Names must not start with a number or
punctuation character
Names must not start with the letters xml
(or XML or Xml ..)
Names cannot contain spaces
Element Naming
Any name can be used, no words are
reserved, but the idea is to make names
XML documents often have a corresponding
database, in which fields exist
corresponding to elements in the XML
document. A good practice is to use the
naming rules of your database for the
elements in the XML documents.
Comments in XML
The syntax for writing comments in
XML is similar to that of HTML.
<!-- This is a comment -->
XML Attributes
elements can have attributes in the
start tag, just like HTML.
Attributes are used to provide additional
information about elements.
In HTML (and also in XML) attributes
provide additional information about
<img src="computer.gif">
<a href="demo.asp">
XML Attributes
Attribute values must always be
enclosed in quotes
<person sex="female">
XML Attributes Cont.
<?xml version="1.0" encoding="UTF-8"?>
<note date=12/11/2002>
---------------------------------------------------------------------<?xml version="1.0" encoding="UTF-8"?>
<note date="12/11/2002">
The error in the first document is that the date attribute in the note
element is not quoted.
The first line in the document is the XML declaration
Use of Elements vs.
can be stored in child elements or in attributes.
Take a look at these examples:
<person sex="female">
In the first example sex is an attribute. In the last, sex is a child element. Both
examples provide the same information.
Errors in XML will stop the XML
The World Wide Web Consortium (W3C) XML specification
states that a program should not continue to process an XML
document if it finds a validation error. The reason is that XML
software should be easy to write, and that all XML documents
should be compatible.
With HTML it was possible to create documents with lots of
errors (like when you forget an end tag). One of the main
reasons that HTML browsers are so big and incompatible, is
that they have their own ways to figure out what a document
should look like when they encounter an HTML error.
With XML this should not be possible.
XML and Web Browsers
Internet Explorer 5.0+, Google
Chrome & Firefox support XML
Viewing XML Files
If you open an XML document in IE ( or
other browsers), it will display the document
with color coded root and child elements. A
plus (+) or minus sign (-) to the left of the
elements can be clicked to expand or
collapse the element structure.
If you want to view the raw XML source,
you must select "View Source" from the
browser menu.
If an erroneous XML file is opened, the
browser will report the error.
Other Examples
Viewing some XML documents will
help you get the XML feeling.
An XML CD catalog
An XML plant catalog
A Simple Food Menu
This is some CD collection, stored as XML data
This is a plant catalog from a plant shop, stored as
XML data.
This is a breakfast food menu from a restaurant,
stored as XML data.
Why does XML display like this?
XML documents do not carry information
about how to display the data.
Since XML tags are "invented" by the author of
the XML document, browsers do not know if a
tag like <table> describes an HTML table or a
dining table.
Without any information about how to display
the data, most browsers will just display the
XML document as it is.
The XML Rules (Summary)
Single, unique root
Matching open/close
Correctly nested
Attribute values
enclosed in quotes
<?xml version=“1.0”?>
<company id=“4859”>
<type>Web Development</type>
<street>Wakefield st</street>
<country>New Zealand</country>
Authoring XML Documents
A basic XML document is an XML element that
can, but might not, include nested XML
<book ISBN=“123456789”>
<title> Second Chance </title>
<author> Matthew Dunn </author>
Use of XML and HTML together
This is pure data in XML file
This is a pure Format file to display the same
View the result with Google Chrome or IE 6+
Converting Relational Database to XML
Example: Export the following data into XML and group
books by store
 Relational Database:
Store (sid, name, phone)
Book (bid, title, authors)
StoreBook (sid , bid, price, stock)
Converting Relational
Database to XML (Cont’d)
<sid> 123 </sid>
<name> Chapter </name>
<phone> 429-8976</phone>
<title> The Da Vinci Code</title>
<authors> Dan Brown</authors>
<bid> 987</bid>
example of database
Example of database converted to XML
XML representation of a sample
Movie Database
<?xml version="1.0" encoding="ISO-8859-1“ standalone=“yes”?>
<Title> The Notebook</Title>
<Actor> Ryan Gosling</Actor>
<Actor> Rachel McAdams</Actor>
<Director> Nick Cassavetes</Director>
<Title> 300 </Title>
<Actor> Gerard Butler</Actor>
<Actor> Lena Headey </Actor>
<Director> Zack Snyder</Director>
<TVShow> FRIENDS </TVShow>
<TVShow> Seinfeld </TVShow>
Brief Introduction to RSS
RSS ( Really Simple Syndication)
RSS is a family of web feed formats used to publish
frequently updated digital content, such as blogs, news feeds
or podcasts.
Users of RSS content use programs called feed "readers" or
"aggregators": the user "subscribes" to a feed by supplying to
their reader a link to the feed; the reader can then check the
user's subscribed feeds to see if any of those feeds have new
content since the last time it checked, and if so, retrieve that
content and present it to the user.
RSS formats are specified in XML (a generic specification for
data formats). RSS delivers its information as an XML file
called an "RSS feed," "webfeed," "RSS stream," or "RSS
RSS Feed representation
On Web pages, web feeds (RSS) are
typically linked with the word
"Subscribe", an orange square,
or a rectangle with the letters
Many news aggregators such as publish subscription buttons
for use on Web pages to simplify the
process of adding news feeds.
A podcast is a media file that is distributed over
the Internet using syndication feeds, for playback
on portable media players and personal computers.
The term "podcast" is derived from Apple's
portable music player, the iPod.
Though podcasters' web sites may also offer direct
download or streaming of their content, a podcast
is distinguished from other digital audio formats by
its ability to be downloaded automatically, using
software capable of reading feed formats such as
Podcasting is an automatic mechanism
whereby multimedia computer files are
transferred from a server to a client, which
pulls down XML files containing the Internet
addresses of the media files. In general,
these files contain audio or video, but also
could be images, text, PDF, or any file type.
Example: StFX Posdcast
XML Joke
Question: When should I use XML?
Answer: When you need a buzzword
in your resume.

Semistructured Data