COMS E6125 Web-enHanced
Information Management
(WHIM)
Prof. Gail Kaiser
Spring 2008
5 February 2008
Kaiser: COMS E6125
1
Today’s Topic:
Markup Languages
• History of markup languages
• SGML = Standard Generalized
Markup Language
• HTML = HyperText Markup Language
• XML = eXtensible Markup
Language
5 February 2008
Kaiser: COMS E6125
2
What is Markup?
• Special text (“mark”) that is added to the
regular text of a document in order to
convey some information about it
• A markup language is a formalized way of
providing markup, and specifies:
–
–
–
–
what markup is allowed (the lexicon)
what markup is required
how markup is distinguished from content text
what the markup “means”
5 February 2008
Kaiser: COMS E6125
3
Specific Coding
• Historically, electronic manuscripts
contained procedural control codes
(markup) that caused the text to be
formatted in a particular way
– tj6
– troff
– TeX
5 February 2008
Kaiser: COMS E6125
4
Procedural Markup
• Advantages:
– Instructs agent how to process text
– Generally concerned with formatting and
presentation
– Is “efficient” because requires little further
interpretation
• Disadvantages
– Often specific to one proprietary processing
system
– Usually ties a document to a single purpose
• printing on a paper
• viewing on a screen
• provides no information on “meaning”
5 February 2008
Kaiser: COMS E6125
5
1.
Markup Steps
Author first analyzes the information
structure and other attributes of the
document; that is, s/he identifies each
meaningful separate element, and
characterizes it as a paragraph, heading,
ordered list, footnote, or some other
element type
2. Author then determines, from memory or
a style book, the processing instructions
(“marks”) that will produce the format
desired for that type of element
3. Finally, s/he inserts the chosen marks
into the text
5 February 2008
Kaiser: COMS E6125
6
Example Specific Coding
.SK 1 Text processing and word processing systems typically
require additional information to be interspersed among the
natural text of the document being processed. This added
information, called "markup", serves two purposes:
.TB 4
TaB stop
.OF 4
OFfset
.SK 1
1.#Separating the logical elements of the document; and
.OF 4
.SK 1
2.#Specifying the processing functions to be performed on those
elements.
.OF 0
.SK 1
SKipping vertical space
5 February 2008
Kaiser: COMS E6125
7
Generic Coding
• In contrast, generic (or generalized,
or descriptive) coding uses
descriptive tags (e.g., “heading”)
– Scribe
– LaTeX
– HTML
5 February 2008
Kaiser: COMS E6125
8
Descriptive Markup
• Advantages:
– Identifies the logical components of a
document
– Generally concerned with what text is
– Does not specify what procedures are to
be applied to text
– Therefore requires that other
process(es) supply formatting and
presentation
5 February 2008
Kaiser: COMS E6125
9
Descriptive Markup
• Disadvantages
– Is (usually) human and machine readable
– Identifies information content
– Is not directed towards a particular
purpose or rendition of the document
– Therefore can be non-proprietary
5 February 2008
Kaiser: COMS E6125
10
Markup Steps
1.
Author first analyzes the information
structure and other attributes of the
document; that is, s/he identifies each
meaningful separate element, and
characterizes it as a paragraph, heading,
ordered list, footnote, or some other
element type
same as above
2. Author then associates each significant
element with the mnemonic tag (“mark”)
that s/he feels best characterizes it
5 February 2008
Kaiser: COMS E6125
11
Example
Generic Coding
<p> Text processing and word processing systems
typically require additional information to be
interspersed among the natural text of the
document being processed. This added
information, called <em>markup</em>, serves
two purposes:
<ol>
<li>Separating the logical elements of the
document; and
<li>Specifying the processing functions to be
performed on those elements.
</ol>
5 February 2008
Kaiser: COMS E6125
12
The Case for
Generalized Markup
•
•
Markup should describe a document's
structure and other attributes rather
than specify processing to be performed
on it, so markup need be done only once
and will suffice for all future processing
Markup should be rigorous so that the
techniques available for rigorouslydefined objects like programs and data
bases can be used for processing
documents as well
5 February 2008
Kaiser: COMS E6125
13
Who Invented Markup?
• Specialized markup: ???
• Generalized markup:
– Many credit William Tunnicliffe, chairman of
the Graphic Communications Association
Composition Committee, who presented a talk
on the separation of information content of
documents from their format during a meeting
at the Canadian Government Printing Office,
September 1967
– Others credit Stanley Rice, a New York book
designer, who proposed the idea of a universal
catalog of parameterized editorial structure
macros in several articles, e.g., "Editorial Text
Structures," Memorandum to Standards
Planning and Requirements Committee, ANSI,
March 17, 1970
5 February 2008
Kaiser: COMS E6125
14
An Early Implementation
• At IBM in 1969, Charles Goldfarb, Ed Mosher and
Ray Lorie invented Generalized Markup Language
(GML) as part of a law office project integrating
text editing with information retrieval and page
composition
• Instead of a simple tagging scheme, GML
introduced the concept of a formally-defined
document type (DTD = Document Type Definition)
with an explicit nested element structure
• By 1971 developed first DTD, for the manuals for
IBM's “Telecommunications Access Method”,
which enabled all the headings of a given headlevel to be automatically formatted identically
• Productized in 1973 in IBM’s Document
Composition Facility (DCF)
5 February 2008
Kaiser: COMS E6125
15
Example GML
:h1.Chapter 1: Introduction
:p.GML supported hierarchical containers, such as
:ol
:li.Ordered lists (like this one),
:li.Unordered lists, and
:li.Definition lists
:eol.
as well as simple structures.
:p.Markup minimization (later generalized and
formalized in SGML), allowed the end-tags to be
omitted for the "h1" and "p" elements.
5 February 2008
Kaiser: COMS E6125
16
SGML = Standard GML
• Standardization effort started in 1978, when
ANSI (American National Standards Institute )
creates The Computer Languages for the
Processing of Text Committee
• Series of draft standards 1980-1986 (1983
version adopted by IRS and DoD), ISO
(International Standard Organization joins ANSI
effort in 1984
• Final international standard in 1986 based in part
on an SGML system developed by Anders
Berglund, then of the European Particle Physics
Laboratory (CERN)
• Hmm… isn’t CERN where Tim Berners-Lee
invented the “World Wide Web” in 1989?
5 February 2008
Kaiser: COMS E6125
17
SGML
• A metalanguage (grammar)
• How to write tags, how to define the document
structure
• Structural paradigm is that of
– an inverted tree structure, a root component
branching out into leaves
– or a series of nested containers
• Defines three kinds of objects
– Elements are the basic structural components
– Attributes are qualities of elements
– Entities are a short representation of special
characters
5 February 2008
Kaiser: COMS E6125
18
SGML Pro and Con
• Advantages:
– Documents held in a standards-based, non-proprietary,
platform-independent storage format
– Scope for document re-use and re-presentation,
enhancement of retrieval possibilities
– Easy to process
– Can (optionally) validate against DTDs
• Disadvantages:
– Remained a niche market in the 1980s, unknown to the
masses
– Not well supported by the major document processing
vendors, tools expensive
5 February 2008
Kaiser: COMS E6125
19
Then Came the Web…
• HyperText Markup Language (HTML)
is derived from SGML
• As an SGML-compliant language, it
has a DTD with a fixed set of tags
• Initially, the number of tags were
very limited ( ~ 10 ) and very easy to
remember and to use
5 February 2008
Kaiser: COMS E6125
20
HTML Example
<html>
<head> <title> My title </title> </head>
<body>
<h1> A huge heading </h1>
<h2> A smaller one </h2>
<ul>
<li> a list item in <b>bold</b> </li>
<li> a list item in <i>italics</i> </li>
</ul>
<p> A paragraph </p>
</body> </html>
5 February 2008
Kaiser: COMS E6125
21
Another HTML Example
• From original IETF Internet Draft for
HTML
See <A HREF="http://info.cern.ch/">CERN</A>'s
information for more details.
A <A NAME=serious>serious</A> crime is one which is
associated with imprisonment.
The Organization may refuse employment to anyone
convicted of a <a href="#serious">serious</A> crime.
Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This
must b e done by a qualified technician.
< A HREF="Go">< IMG SRC ="Button"> Press to start</A>
5 February 2008
Kaiser: COMS E6125
22
HTML Pro and Con
• Advantages
– Simple to learn and to use
– Easy to create from scratch or by
converting legacy text files
– Easy to parse and render
• Drawbacks
– Syntaxless
– Much more a presentation language than
a structural language
– Too limited, not a good substitute for a
word processor
5 February 2008
Kaiser: COMS E6125
23
HTML History
• 1990: First implementation by TBL on a
NeXT computer at CERN
– Used SGML tools to create original HTML
language (DTD, parser)
– Scalability and simplicity of HTML (and HTTP),
compared to OHS or Gopher part of the basis
for WWW success
• 1991-1992: Various text-only and graphical
browsers developed, latter usually platformspecific
5 February 2008
Kaiser: COMS E6125
24
HTML History
• 1993: NCSA Mosaic
– First widely available graphical WWW browser (Unix XWindows and Mac)
– Developed primarily by UIUC undergraduate Marc
Andreessen
– The killer application of the Internet is born and the
number of Web servers explode
• 1994: Competition
– Mosaic team leaves NCSA to found Netscape
– Microsoft adopts the Web (Internet Explorer bundled
with Windows 95)
– Divergence of supported HTML tags between Internet
Explorer and Netscape –> browser wars
– HTTP traffic becomes more common than telnet and ftp
5 February 2008
Kaiser: COMS E6125
25
HTML History
• 1994-1995: HTML 2.0 adds image maps,
forms
• 1995 and beyond: Commercial websites
– Java development started (as “Oak”) for
programming settop boxes in 1991, BIG
FAILURE - but launched on Web in March 1995
(in HotJava) and May 1995 (in Netscape), BIG
SUCCESS
– Amazon.com opens in July 1995
– “dot com” era begins (and soon ends)
5 February 2008
Kaiser: COMS E6125
26
HTML History
• Jan 1997: HTML 3.2 adds tables,
applets, text flow around images,
superscripts and subscripts
• Dec 1997: HTML 4.0 adds frames,
cascading style sheets, more
multimedia options, scripting
languages, web accessibility
conventions, internationalization
5 February 2008
Kaiser: COMS E6125
27
XHTML = eXtensible
HyperText Markup Language
• XHTML 1.0 W3C Recommendation January 2000,
revised August 2002 (XHTML 1.1 still working
draft)
• Made element and attribute names case-sensitive
(in particular, use lowercase)
• Include end tags, e.g., <p> … </p>
• Add a “/” to empty elements, e.g., <br/> and <hr/>
• Quote all attribute values, e.g.,
<img src="duck.jpg" alt="A Duck"/>
• Most browsers still work fine with older HTML
5 February 2008
Kaiser: COMS E6125
28
Where did the “X” come from?
• XML = eXtensible Markup Language
• XHTML is a reformulation of HTML 4.x in
XML
• XHTML can be used in conjunction with
other XML vocabularies
– SMIL (Synchronized Multimedia Integration
Language)
– SVG (Scalable Vector Graphics)
– MathML (Mathematical Markup Language)
– Plus hundreds dedicated to specific
applications (the extensible part)
5 February 2008
Kaiser: COMS E6125
29
What is XML for?
• The universal markup format for
structured documents and data on the
Web
• For data exchange (messages) and
persistent data
• Syntax
• Data Modeling
• Data Processing
5 February 2008
Kaiser: COMS E6125
30
XML History
• XML 1.0 became a W3C Recommendation in
February 1998, revised several times most recently September 2006
• XML 1.1 draft released Nov 2003,
recommendation last revised September
2006 (addresses various issues wrt
Unicode and mainframe compatibility)
• Conceptually an SGML descendant
• Unlike SGML, it quickly became widespread
5 February 2008
Kaiser: COMS E6125
31
SGML->XML
• Like SGML, XML is a grammar (or a
metalanguage), NOT a specific language
• Specification simplified
– SGML spec ~600 pages
– XML spec 36 pages (initial 1.0) ->
54 pages (1.1 2nd edition)
• Parsing made simpler through two-level
mechanism
– Well-formed
– Valid
5 February 2008
Kaiser: COMS E6125
32
Well-Formed
• (Optionally) starts with XML declaration
<?xml version="1.0"?>
• Rest of document inside the root element
<myroot>…</myroot>
• All text contained in some element
<someelement>text text text</someelement>
• Explicit empty elements
<anotherelement></anotherelement>
<anotherelement/>
5 February 2008
Kaiser: COMS E6125
33
Well-Formed
• Element tags must be properly nested (no
crossing tags)
NO <i><b>blah blah blah</i></b>
• Start and end tags must match exactly (same
case)
• Quotes placed around all attribute values
<a href=“stuff.html”>stuff</a>
5 February 2008
Kaiser: COMS E6125
34
Valid
• Well-formed, plus
• Conforms to a DTD or Schema
– tags and attributes are all declared
– tags and attributes are used correctly
• XML browsers and editors usually require
validity
• Other tools might not (e.g., search
engines)
5 February 2008
Kaiser: COMS E6125
35
XML Goes Beyond
Document Processing
• XML more oriented
to distributed
computing than to
document markup
• Thus complements
rather than
replaces HTML (or
XHTML)
5 February 2008
• DOM = Document
Object Model
• SAX = Simple API
for XML
• SOAP = Simple
Object Access
Protocol
• Web Services
Kaiser: COMS E6125
36
Let’s Reinvent XML
• Someone in the far future sends a message
in a virtual bottle, containing parts of the
universal library of human and post-human
literature, back into the 1970s when ...
• … the Web, XML, P2P, Java were unheard of
• ... computer manufacturers talked about mips
and kilobytes
• … music was played by rotating vinyl discs
under a diamond-tip stylus or on cassette
tapes
5 February 2008
Kaiser: COMS E6125
37
… and Microsoft looked like
5 February 2008
Kaiser: COMS E6125
38
The Message in the Bottle, 1st try
ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@^F^@^@^
@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@
^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á[email protected]^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@
^@^D^@^@Ç^G^@^@[email protected]+t+^@^@^@
[email protected] Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William
I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more
lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd
summer's lease hath all too short a date.^MSometime too hot the eye of heaven
shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some
declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal
summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall
Death brag thou wander'st in his shade^MWhile in eternal lines to time thou
growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this
gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of
Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
5 February 2008
Kaiser: COMS E6125
39
The Message in the Bottle, 2nd try
\documentclass{article}
\begin{document}
\title{Some Quotations from the Universal Library}
...
\section{Famous Quotes}
\subsection{By William I}
\textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-1609}}
\begin{verse}
Shall I compare thee to a summer's day?\\
Thou art more lovely and more temperate. \\
Rough winds do shake the darling buds of May, \\
…
\end{verse}
\bibliographystyle{abbrv}
\bibliography{msg}
\end{document}
5 February 2008
Kaiser: COMS E6125
40
The Message in the Bottle, finally
<?xml version=“1234.56"?>
<universal_library>
<books>
<book> <title>Some Quotations from the Universal Library</title>
<section> <title>Famous Quotes</title>
<subsection> <title>By William I</title>
<quote bibref="shakespeare-sonnets-1609">
<title>Sonnet XVIII</title>
<verse>
<line>Shall I compare thee to a summer's day?</line>
<line>Thou art more lovely and more temperate. </line>
<line>Rough winds do shake the darling buds of May, </line>
…
</verse>
</section>
</book>
…
</books>
</universal_library>
5 February 2008
Kaiser: COMS E6125
41
XML as a Self-Describing
Data Exchange Format
• Someone from the 1970s receives the message
in the virtual bottle, and it …
• … can be easily “understood” (even using CP/M
& edlin)
• … can be parsed easily
• … allows the application programmer to
rediscover schema and semantics (sort of…)
• … may include an explicit schema description
• … allows separation of marked-up content from
presentation
5 February 2008
Kaiser: COMS E6125
42
XML Anatomy
element name
element
attribute name
<bibliography>
attribute value
(attributes cannot
contain elements)
element content
<paper ID= “goto”>
<authors>
<author>Edsger W. Dijkstra </author>
</authors>
<title>Go To Statement Considered Harmful</title>
<booktitle>Communications of the ACM</booktitle>
<year>1968</year>
<fullPaper source=“harmful”/>
</paper>
</bibliography>
number content
5 February 2008
empty element
Kaiser: COMS E6125
character content
43
Perspectives on XML
• Document (SGML) Community
– data = linear text documents
– markup (annotate) text to describe context,
structure, semantics
• Database Community
– XML as a prominent example of the semistructured data model
– captures the whole spectrum from highly
structured, regular data to unstructured data
 XML is the cure for your data exchange,
information integration, e-commerce, …
problems” (also cures baldness, lose 28
pounds in 14 days, get rich quick, …)
5 February 2008
Kaiser: COMS E6125
44
Pure XML - Instance Model
• XML 1.0 implicit data model (infoset):
– nested containers ("boxes within boxes")
– labeled ordered trees (= semistructured data
model)
– relational, object-oriented easy to encode
<A>
<B>foo</B>
<C>bar</C>
<C>psl</C>
</A>
A
A:
B:
"foo"
C:
"bar"
C:
"psl"
5 February 2008
B
C
C
"foo"
"bar"
"psl"
Kaiser: COMS E6125
children are ordered
45
Identifying Vocabularies
• My element may not be your element:
– geometry context: <element>line</element>
– chemistry context:
<element>oxygen</element>
5 February 2008
Kaiser: COMS E6125
46
Identifying Vocabularies
• An XML Schema (with XML 1.1) defines a
vocabulary of names of type definitions,
element and attribute declarations [Schema ~=
new improved DTD]
• Use XML Namespaces (with XML 1.1) to
identify which vocabulary
– Simple method for qualifying element and attribute
names used in XML documents
– Useful when a single XML document contains
elements and attributes that are defined for and
used by multiple software modules
5 February 2008
Kaiser: COMS E6125
47
Namespace Scoping
• XML namespaces
are declared with
an xmlns
attribute, which
can associate a
prefix with the
namespace
• The declaration is
in scope for the
element containing
the attribute and
all its descendants
5 February 2008
<html:html xmlns:html='http://
www.w3.org/1999/xhtml'>
<html:head>
<html:title>Frobnostication
</html:title>
</html:head>
<html:body>
<html:p>Moved to
<html:a href='http://frob.
example.com'>here.
</html:a>
</html:p>
</html:body>
</html:html>
Kaiser: COMS E6125
48
Namespace Defaulting
<?xml version="1.1"?>
<!-- elements are in the HTML namespace, in this
case by default -->
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Frobnostication</title>
</head>
<body>
<p>Moved to
<a href='http://frob.example.com'>here</a>.</p>
</body>
</html>
5 February 2008
Kaiser: COMS E6125
49
Multiple Namespaces
All element types are prefixed
<bk:book xmlns:bk='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'
xmlns:money='urn:Finance:AllAboutMoney'>
<bk:title>Cheaper by the Dozen</bk:title>
<isbn:number>1568491379</isbn:number>
<bk:price
money:currencySymbol="$">99.99</bk:price>
</bk:book>
5 February 2008
Kaiser: COMS E6125
50
Namespace Defaulting with
Multiple Namespaces
Unprefixed element types are from books
<book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-363416'>
<title>Cheaper by the Dozen</title>
<isbn:number>1568491379</isbn:number>
</book>
5 February 2008
Kaiser: COMS E6125
51
Nested Scoping
<?xml version="1.1"?>
<!-- initially, the default namespace is "books" -->
<book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'>
<title>Cheaper by the Dozen</title>
<isbn:number>1568491379</isbn:number>
<notes>
<!-- make HTML the default namespace for
some commentary -->
<p xmlns='urn:w3-org-ns:HTML'>
This is a <i>funny</i> book!
</p>
</notes>
</book>
5 February 2008
Kaiser: COMS E6125
52
How to Define the Actual
Namespace
• W3C namespace specification doesn’t say (!)
• A namespace doesn’t actually have to exist as a
physical or conceptual entity
• All that is needed is a qualifier—the XML
namespace URI — that, in combination with an
element type or attribute name, creates a universal
(and universally unique) name
• In other words, there doesn’t actually have to be a
definition or anything else at that URI
5 February 2008
Kaiser: COMS E6125
53
XML Namespaces
• Allows mixing of different tag
vocabularies
• Only identifies the vocabulary
(lexicon)
• Additional mechanisms required for
structure and meaning of tags
5 February 2008
Kaiser: COMS E6125
54
Processing XML
• Non-validating parser:
– checks that XML doc is syntactically wellformed
• Validating parser:
– checks that XML doc is also valid wrt a
given XML Schema (or, historically, DTD)
5 February 2008
Kaiser: COMS E6125
55
Processing XML
• Tree representation:
– Document Object Model (DOM) API
– Cursor APIs, e.g., .NET’s XPathNavigator,
Java StAX
• Stream of events representation:
– Push Model, e.g., Simple API for XML
(SAX)
– Pull Model, e.g., Common API for XML Pull
Parsing (XmlPull)
• Others
5 February 2008
Kaiser: COMS E6125
56
Document Object Model
• Object-oriented approach to
traversing the XML document as a
tree
• Typically loads the entire XML
document into memory (random
access but memory intensive)
• Provides mechanisms for loading,
saving, accessing, querying,
modifying, and deleting nodes from
an XML document
5 February 2008
Kaiser: COMS E6125
57
DOM API
• Hierarchy of Node objects mapping to XML
concepts: document, element, attribute,
processing instruction, comment, …
• Language-independent API:
– get first/last child, previous/next sibling, set
of nodes
– insert before/after, replace
– getElementsByTagName
• W3C DOM offers fairly limited functionality, so
implementations often add helper method
extensions
5 February 2008
Kaiser: COMS E6125
58
Push Model
• XML producer (typically an XML parser)
controls the pace of the application and
informs the XML consumer when certain
events occur (e.g., reports events when
encountering begin/end tags)
• XML consumer registers callbacks with the
producer, which invokes the callbacks as
various parts of the XML document are seen
(as events are reported)
• Does not necessarily build a parse tree
5 February 2008
Kaiser: COMS E6125
59
Push Model Pro
• The entire XML document does not need to be
stored in memory, only the information about the
node currently being processed is needed
• This makes it possible to process large XML
documents without incurring massive memory costs
• Can also process XML streams whose contents arrive
over time
• Allows consumer to ignore less interesting data
5 February 2008
Kaiser: COMS E6125
60
Push Model Con
• Certain context and state information such as the
parents of the current node or its depth in the
XML tree must be tracked by the programmer
• Limited expressive power (query/update) when
working on streams
• To register callbacks one needs to create a class
devoted to handling events from the producer
• Many developers find callbacks to be an unintuitive
way to control program flow
5 February 2008
Kaiser: COMS E6125
61
Pull Model
• XML Consumer controls the program flow by
requesting events from the XML producer as
needed
• Operates in a forward-only, streaming
fashion while only showing information about
a single node at any given time
• Programmer creates a loop that continually
reads from the XML document until the end
of the document is reached, but acts solely
on items of interest as they are seen
5 February 2008
Kaiser: COMS E6125
62
Pull Model Comparison
• As memory efficient as push model
processing but with a more familiar
programming model
• Does not require a specialized class for
handling XML processing to implement
specific interfaces or subclass certain
classes to register callbacks
• The need to explicitly track application
states using boolean flags and similar
variables is significantly reduced
5 February 2008
Kaiser: COMS E6125
63
XML Cursors
• Cursor acts like a lens that focuses on one XML
node at a time, but, unlike pull-based or pushbased APIs, the cursor can be positioned
anywhere along the XML document at any given
time
• Allows one to navigate, query, and manipulate an
XML document loaded in memory
• Does not require the heavyweight interface of a
traditional tree model API, where every
significant token in the underlying XML must map
to an object
• Can create XML views of non-XML data
5 February 2008
Kaiser: COMS E6125
64
Other Alternatives
• Object to XML Mapping APIs
– Represent nodes and text as classes and
programming language primitives
– Cannot represent all XML information
with full fidelity, e.g., lose processing
instructions and comments, element
ordering
– Impedance mismatches between XML
Schema and object-oriented concepts
• XML-specific languages – XPath, XQuery,
XSLT, …
5 February 2008
Kaiser: COMS E6125
65
Summary
• Webpages intended for human audience
usually written in HTML, where descriptive
markup is interpreted by browser
• Webpages intended for machine
processing (other than browser) usually
written in some XML vocabulary
understood by both the producer and the
consumer
5 February 2008
Kaiser: COMS E6125
66
Second Assignment:
Revised Paper Proposal
• Due Monday February 18th at 5pm
• Maximum three pages (not including
figures, if any), plus references (required)
• Plan and outline your paper (which will be
~15 pages)
• See
http://york.cs.columbia.edu/classes/
cs6125/revised_paper_proposal.htm
5 February 2008
Kaiser: COMS E6125
67
Revised Paper Proposal
• Each full paper should have title, author,
abstract (~200 words), introduction, body
sections, conclusions, bibliography (cited
references)
• The point of this assignment is to
determine what will be in those sections
• Assume a reader who is taking the class
but may not know anything at all about
your specific topic
5 February 2008
Kaiser: COMS E6125
68
Revised Paper Proposal:
Introduction and Conclusion
• What is your topic?
• What is the problem being addressed?
• What is the solution, or design space of
solutions, proposed or actualized?
• What is your argument?
• What is your point of view?
• What is the opposing point of view?
5 February 2008
Kaiser: COMS E6125
69
Revised Paper Proposal:
Body Sections
• What sections? (usually 3-5)
• What subsections? (perhaps down to
subsubsections)
• Motivate your literature reading to fill
those sections
• Full paper will be due March 14th
5 February 2008
Kaiser: COMS E6125
70
A Note about Citations and
Bibliographic References
• References should be cited in the text like
this “Kaiser said blah blah [1]” or this
“[Kai07] describes mumble”
• Bibliography entry should appear
something like this
[Kai07] Gail Kaiser, COMS E6125 WebenHanced Information Management,
Columbia University Department of
Computer Science, 2007,
http://york.cs.columbia.edu/classes/cs6125/.
5 February 2008
Kaiser: COMS E6125
71
Second Assignment:
Logistics
• Due Monday February 18th by 5pm
• Maximum three pages when printed (not
including optional figures and required
reference list)
• Submit by posting in Revised Paper
Proposal folder on CourseWorks
• Must be in a format I can read, which
means pdf, word, powerpoint, html, plain
ascii text (with all figures embedded or
viewable in an ordinary browser)
5 February 2008
Kaiser: COMS E6125
72
Heads Up on Project
• Preliminary Proposal due Monday March 10th (note
this is before the full paper)
• Optionally work in teams (see
http://york.cs.columbia.edu/classes/cs6125/team
_advice)
• Build a new system or extend an existing system –
submit code, demo system
• OR evaluate/compare one or more existing
system(s) – submit procedures and findings, show
system(s)
• You may "continue" your paper topic towards the
project, or do something entirely different
5 February 2008
Kaiser: COMS E6125
73
Heads Up on Presentation
• Individual ~10 talk in class during one of
last few class sessions
• No proposal, just do it
• May be based on paper, project, or some
other topic (in the case of team members
all presenting on the same project, please
coordinate to avoid redundancy and
discuss your plans with the instructor in
advance)
5 February 2008
Kaiser: COMS E6125
74
Reminders
• Class participation is important! (10%
corresponds to a whole letter grade)
• Revised paper proposal due February 18th
• Preliminary project proposal due March
10th
• Paper must be individual, projects may
optionally be done in teams
5 February 2008
Kaiser: COMS E6125
75
COMS E6125 Web-enHanced
Information Management
(WHIM)
Prof. Gail Kaiser
Spring 2008
5 February 2008
Kaiser: COMS E6125
76
Descargar

COMS E6998 Web-enHanced Information Management …