XML and Linguistic Annotation
Chris Brew, Ohio State University
( credit to Marc Moens, Henry Thompson, David
McKelvie, all Language Technology Group, University
of Edinburgh)
XML and Linguistic Annotation
1
XML topics
What is XML?
HTML,XML and SGML
Wider context of XML
Data Description
DTDs, Schemas
 Query Languages
 XML Query, XQL, Quilt, LORE, LT QUERY
 Style Languages
 CSS, XSL
XML and Linguistic Annotation
Summer School, July 2000
2
What is XML?
 It is a markup language used for annotating text
 is concerned with logical structure
 to identify sections, titles, section headers, chapters, paragraphs,…
 is not concerned with appearance
 you say 'this is a subtitle'
not 'this is in bold, 14pt, centered'
 you say 'this is an example'
not 'this is in verbatim, indented by 5pts, ragged right’
 Derived from SGML.
XML and Linguistic Annotation
Summer School, July 2000
3
Why is XML a big deal?
 It is a W3C standard
 It is vendor-independent, platform independent, application
independent,…
 unlike Word documents, RTF documents, PDF documents, Postscript
documents,…
 It is human readable
 ditto (for most values of 'human')
 The Web interchange format
XML and Linguistic Annotation
Summer School, July 2000
4
Who is in charge of XML?
 XML is a W3C Recommendation
 The W3C is The World Wide Web Consortium, a voluntary
association of companies and non-profit organizations. Membership
costs serious money, confers voting rights. Complex procedures, with
the Chairman (Tim Berners-Lee) holding all the high cards, but the
big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.
 The recommendation was written by the W3C’s XML Working
Group.
XML and Linguistic Annotation
Summer School, July 2000
5
XML as a career move?
 Most of the big computer and entertainment companies believe
XML is the solution.
 Exactly what was the problem?
Presenting a parts database over the Internet
Running an on-line job market (flipdog.com)
Usually not corpus creation.
 Scholars win and lose
SGML was a minority interest where we had serious influence on what
facilities were used
XML is mainstream. We’re the minority now.
This year’s .coms are busily hiring people who understand ontologies, NLP
and web technology.
XML and Linguistic Annotation
Summer School, July 2000
6
Does it live up to the hype?
 Of course not, but…
 The basic idea is simple labeled brackets. Lisp showed the power of this idea in
knowledge representation.
 Knowledge representation is inherently hard. Lisp made it easier to state the
problem, but it wasn’t itself the solution. XML won’t solve your knowledge
representation problems either, but it will let you state them.
 Labeled brackets++
 Labeled brackets – but designed for information exchange, with sophisticated
input (and political pressures) from many interest groups.
XML and Linguistic Annotation
Summer School, July 2000
7
Does it live up to the hype?
 Yes. XML and allied standards (XSLT, XML Query,) give us a
framework for data interchange.
Browser
Weather
Reports
XSL
Day Planner
Weather Model
XML
Data
XML and Linguistic Annotation
XML
Transformation
Summer School, July 2000
End Users
8
Transformation
 End users will differ in which parts of the weather reports they need,
so the middle stage is the crux.
 One XML format defines the available data
 Transformations map this format into what is needed by the different
applications, leaving out bits that they don’t need.
 One common transformation is to HTML, for browsers. (easy)
 Another is to printed paper, for efficient random access. (difficult, because our
quality expectations are so high)
XML and Linguistic Annotation
Summer School, July 2000
9
Representing knowledge in text
 Unformatted text
 Formatted text
 Structured Markup
XML and Linguistic Annotation
Summer School, July 2000
10
Unformatted text
United Kingdom
Geography
Location: Western Europe, bordering on the North Atlantic Ocean
and the North Sea, between Ireland and France
Map references: Europe, Standard Time Zones of the World
Area:
total area: 244,820 km2
land area: 241,590 km2
comparative area: slightly smaller than Oregon
note: includes Rockall and Shetland Islands
Land boundaries: total 360 km, Ireland 360 km
Coastline: 12,429 km
XML and Linguistic Annotation
Summer School, July 2000
11
Formatted text
United Kingdom
Geography
Location: Western Europe, bordering on the North Atlantic Ocean
and the North Sea, between Ireland and France
Map references: Europe, Standard Time Zones of the World
Area: total area: 244,820 km2
land area: 241,590 km2
comparative area: slightly smaller than Oregon
>> note: includes Rockall and Shetland Islands
Land boundaries: total 360 km, Ireland 360 km
Coastline: 12,429 km
XML and Linguistic Annotation
Summer School, July 2000
12
XML marked up text
<chapter><title>United Kingdom</title>
<section><title>Geography</title>
<featlist>
<feat name=Location>Western Europe, bordering on the North
Atlantic Ocean and the North Sea, between Ireland and France
<feat name='Map references'>Europe, Standard Time
Zones of the World
<feat name=Area><featlist>
<feat name='total area'>244,820 km2</feat>
<feat name='land area'>241,590 km2 </feat>
<feat name='comparative area'>slightly smaller than Oregon
<addendum>note: includes Rockall and Shetland Islands
</feat></featlist></feat>
<feat name='Land boundaries'>total 360 km, Ireland 360 km
</feat></featlist>
</section>
XML and Linguistic Annotation
Summer School, July 2000
13
The syntax...
But aren't all those angle brackets still terribly cumbersome and
complicated?

Yes. simpler relative only to SGML. But..
There are tools that allow you to add XML annotation without the need to
know XML
There are tools that allow you to search XML annotation without the need to
know XML
XML is no more complex than other annotation schemes
If you roll your own scheme, you’ll have to write (and maintain) the tools.
If you use XML, part or all of your tool set will be provided by mainstream
computer industry.
XML and Linguistic Annotation
Summer School, July 2000
14
RTF Format
{\rtf1\ansi \pard\plain\s1\fs36\ppscheme-3\lang2057 {\f1\lang1033 Formatted
text\par
}\pard\plain\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs32\ppscheme6\lang1033 United Kingdom}{\f1\fs20\lang1033 }{\f1\fs16\lang1033 \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs24\lang1033
Geography}{\f1\fs12\lang1033 \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
Location: Western Europe, bordering on the North Atlantic Ocean \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
and the North Sea, between Ireland and France\par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
Map references: Europe, Standard Time Zones of the World \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
Area: total area: 244,820 km2 \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
land area: 241,590 km2 \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
comparative area: slightly smaller than Oregon\par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
>>
note: includes}{\f1\fs20\lang1033 Rockall}{\f1\fs20\lang1033 and Shetland
Islands\par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
Land
boundaries: total 360 km, Ireland 360 km \par
}\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033
Coastline: 12,429 km\par}}
XML and Linguistic Annotation
Summer School, July 2000
15
XHTML is a use of XML
 HTML derived from SGML, but an application, not a subset
SGML/XML let you define new types of document
HTML only gives you a language to write document instances
 Hard-wired to a particular tag set (often with proprietary extensions -- e.g.
frames)
 Hard-wired to particular typographic format, with limited style-sheets
 XHTML is to XML as HTML is to SGML
XML and Linguistic Annotation
Summer School, July 2000
16
What is XML?
 SGML Lite
 Simpler to write
 Simpler to parse
 HTML Heavy




New user-definable tags
Not (just) about browsing
Data interchange
Heavily legislated syntax
XML and Linguistic Annotation
SGML/XML for computational linguists
Summer School, July 2000
17
What is XML?
 XML is just labeled brackets. You get elements with a start tag, some
content, and an end tag.
<memo>
<sender>Marc Moens</sender>
<recipient>Henry, David</recipient>
<status>confidential</status>
<subject>GGP Contract</subject>
<message>The GGP contract
is ready for signature.
Please sign the contract
as well as the NDA.</message>
</memo>
XML and Linguistic Annotation
Summer School, July 2000
18
XML is SGML made simple
<memo>
<sender>Marc Moens
<recipient>Henry, David
<status>confidential</status>
<subject>GGP Contract
<message>The GGP contract
is ready for signature. </memo>
 SGML is labeled brackets too. You get elements with an optional
start tag, some content,
XML and Linguistic Annotation
Summer School, July 2000
19
XML Basics
Document Type Definition (DTD)
 Describes what can (and can’t) be in a particular type of document
 E.g. a memo DTD might specify that every memo has:
sender (name),
recipients (list of names),
date (default: today),
subject,
message,
status (confidential or unrestricted)
Document Instance:
 Identifies the document type and contains the marked-up text
 E.g. a memo document instance:
refers to the memo DTD
contains text marked up in conformance with that DTD
XML and Linguistic Annotation
Summer School, July 2000
20
XML and document structure
XML is used to make the structure of documents
• explicit
• machine readable
Document content
SGML Tags
Marc Moens
This is the first paragraph. It
has some text.
This is the second paragraph
with some more text.
XML and Linguistic Annotation
Summer School, July 2000
21
XML markup
<article status='draft'>
<header>
<title>XML tags
</title>
<author>Marc Moens
</author>
</header>
<body>
<para>This is the
first paragraph. It has
some text. </para>
<para> This is the
second paragraph with
some more
text <emph>and</emph> an
embedded element.
</para></body>
</article>
XML and Linguistic Annotation
Elements:
start tags
content
end tags
e.g. <author>
e.g. Marc Moens
e.g. </author>
Elements mark up text to indicate
structure and function of text
(as opposed to appearance)
tag name = element type
Elements can have attributes
Elements and attributes are defined in the
Document Type Definition
Summer School, July 2000
22
XML markup: for structure and function
Encodes structure information
to support rendering
as well as data handling
He shouted: 'Come here now, Mr Banks.'
<sentence>He
<verb>shouted</verb>:
<quote>
<verb mood=imperative>Come
</verb> here
<emphasis>now</emphasis>,
<person><title>Mr</title>
<name>Banks</name></person></
quote></sentence>
XML and Linguistic Annotation
Data handling e.g.
• search for all quotes inside sentences but
not in footnotes;
• search for every mention of someone
called Banks without finding the Banks of
Scotland
[Use an XML-aware query tool]
Rendering e.g.
• emphasis should be bold underline;
•quotes should be in italics
[Use a stylesheet]
Summer School, July 2000
23
XML: Relevance for Linguists
 Simplify and standardize appeal to context
 E.g. build tokenizer which specifically works for headlines of newspaper articles:
We need to be able to tell the tokenizer where the headline starts and ends
 Annotate text with interesting linguistic information
 E.g. use XML tags to record the results of a tokenizer or part of speech tagger.
Or a human annotator
 Allow sharing of results between research efforts
 without having to write a new parser every time you get new material from
somewhere
XML and Linguistic Annotation
Summer School, July 2000
24
XML: Relevance for Linguists (example)
cat text
|
|
lttok
ltpos
-q '.*/P' -m W
-q '.*/W' -m C
Use the tokeniser lttok on all paragraphs <P> in the text
and mark the resulting words as <W> entities
Then run the part of speech tagger ltpos over the text
and pos tag all the <W> entities, putting the result in attribute C
<W
<W
<W
<W
C=VBD>said</W><W C=DET>the</W>
C=NN>director</W><W C=IN>of</W>
C=NNP>Russian</W><W C=NNP>Bear</W>
C=NNP>Ltd. </W><W C=ë.í>.</W>
XML and Linguistic Annotation
Summer School, July 2000
25
Associated Standards
 XSLT
 Transforming documents
 XML Query
 Find bits of documents
 XML Schema
 Use element syntax for DTDs
 Namespaces
 Ensure that <art:draw><cube/><cube/></art:draw> and
<soccer:draw><team name=“crew”/><team
name=“burn”/></soccer:draw> both get processed correctly.
XML and Linguistic Annotation
Summer School, July 2000
26
Infrastructure standards
 Xpath
 Referring to parts of documents
 XPointer
 pointing at documents and parts of documents
 DOM
 Uniform programmer’s interface to document trees (abstracts away from some
details)
 SAX
 Stream-based document interface (essential for big documents)
 Information Set
XML and Linguistic Annotation
Summer School, July 2000
27
XML in detail





Well-formedness and validity
DTDs
XML tools
XSLT
XML Query
XML and Linguistic Annotation
Summer School, July 2000
28
Well-formed and Valid documents
 Well-formed XML
 Each start tag has an end tag
 XML content is rooted in single “document element”
 Valid encoding declaration
 Valid






Well-formed
All elements mentioned in DTD
All entities defined
All parent-child relations as described in DTD
All attributes used as described in DTD
All element IDs unique
XML and Linguistic Annotation
Summer School, July 2000
29
Why well-formedness?
 a simpler standard for documents to meet
 Can be determined without reference to a DTD
 Simplifies the parser
 Retains “standalone” property of HTML, which was a big win.
 Non-validating XML systems can thus still be conformant, providing
they check well-formedness
 If you have a DTD (or a Schema) you can do more refined
processing.
XML and Linguistic Annotation
Summer School, July 2000
30
DTDs
 Document Type Definitions: the grammar of a document family




Elements
Attributes & values
Entities & parameter entities
Comments
XML and Linguistic Annotation
Summer School, July 2000
31
DTD: Elements
 Elements are used to structure a document. Element types are
declared in the DTD:






<!DOCTYPE article [
<!ELEMENT article (title, section+) >
<!ELEMENT section (title, para+) >
<!ELEMENT para
(#PCDATA) >
<!ELEMENT title
(#PCDATA) >
]>
XML and Linguistic Annotation
Summer School, July 2000
32
DTD: Attribute declarations
 Attributes specify properties of elements. The attributes which may
appear on elements of a given type are also declared in the DTD.
 <!DOCTYPE article [
<!ELEMENT article (title, section+) >




<!ATTLIST article artno NUMBER #IMPLIED >
<!ELEMENT section (title, para+) >
<!ATTLIST section secid ID #REQUIRED >
<!ELEMENT para (#PCDATA) >
<!ELEMENT title (#PCDATA) >
]>
XML and Linguistic Annotation
Summer School, July 2000
33
DTD: Entity declarations
 Entities provide short names for commonly used strings, and are also
declared in the DTD.
 <!DOCTYPE article [
<!ELEMENT article (title, section+) >
<!ATTLIST article artno NUMBER #IMPLIED >
<!ELEMENT section (title, para+) >
<!ATTLIST section secid ID #REQUIRED >
 <!ENTITY ltg "Language Technology Group>
 ]>
XML and Linguistic Annotation
Summer School, July 2000
34
DTD: IDs
 IDs are rigid designators for particular elements in the document.
They are declared using type ID
<!DOCTYPE article [
<!ELEMENT article (title, section+) >
<!ATTLIST article artno NUMBER #IMPLIED >
<!ELEMENT section (title, para+) >
<!ATTLIST section secid ID #REQUIRED >
<!ENTITY ltg "Language Technology Group>
]>
 Potentially, IDs allow processors to provide fast random access to
parts of documents.
 Ids must be unique. Checking might be onerous
XML and Linguistic Annotation
Summer School, July 2000
35
XML tools
 XML Parser
 LT XML Toolkit
 XSLT - xt and Saxon
XML and Linguistic Annotation
Summer School, July 2000
36
XML Parser
 probably most important single bit of XML software
 uses DTD to check if document instance is valid
XML and Linguistic Annotation
Summer School, July 2000
37
Example: >> cat memo.xml
<?xml version=“1.0” encoding=“ISO-8859-1”?>
<!DOCTYPE article [
<!ELEMENT article (para+)>
<!ELEMENT para (#PCDATA)>
<!ENTITY ltg "Language Technology Group">
]>
<article>
<para>
This is the text of a very short article,
with very little internal structure.
Here is a reference to the &ltg; entity.
</para>
</article>
XML and Linguistic Annotation
Summer School, July 2000
38
Example: >> xmlnorm -V memo.xml
Add correct
output
Entity reference has been
replaced with entity text
by parser
XML and Linguistic Annotation
Summer School, July 2000
39
Exercise






Practice using xmlnorm to check your documents
Add some new entities to the memo.
Experience some of xmlnorm‘s error messages
Begin to think about DTD design
Practice using Web browsers to look at XML files
Get a glimpse of what XSL is about
XML and Linguistic Annotation
Summer School, July 2000
40
DTD: Comments
<!DOCTYPE article [
<!-- Just a simple example DTD -->
<!ELEMENT
article (title, section+) >
<!ATTLIST
article artno NUMBER #IMPLIED >
<!ELEMENT
section (title, para+) >
<!ATTLIST
section secid ID #REQUIRED >
<!ELEMENT para (#PCDATA) >
<!ELEMENT title (#PCDATA) >
<!ENTITY ltg 'Language Technology Group'>
]>
XML and Linguistic Annotation
Summer School, July 2000
41
Element type declaration details
<!ELEMENT chapter (title, section+) >
keyword
content model
An unambiguous regular
expression
element type
start with a-z
may contain hyphen, number, stops
not case sensitive
can be more than one
XML and Linguistic Annotation
Summer School, July 2000
42
Element types: Content model
<!ELEMENT article
(title, section+) >
+ at least one, possibly more
? optional
* zero or more
,
|
all occur, in that order
exclusive or
<!ELEMENT header
( ( (title, subtitle?),(author, affil)+ ),
(date | status)? ) >
XML eradicated SGML’s neat & all occur, any order
XML and Linguistic Annotation
Summer School, July 2000
43
Element types: Content model options
 <!ELEMENT graphic EMPTY >
 EMPTY
 no content
 no end tag
 point semantics: attributes may specialise
 (#PCDATA)
 text only
 ANY
 no constraint: sub-elements and/or text
 ((#PCDATA|emph)*)
 'mixed content'
XML and Linguistic Annotation
Summer School, July 2000
44
Element grammar
 Since content model is a regular expression, markup
grammar is context free
 Except for one thing
 ANY keyword
 Note that any realistic application interprets the markup
tree. The interpretation could be anything. All bets are off…
XML and Linguistic Annotation
Summer School, July 2000
45
Example: >> nsgmls exa2a.sgm
nsgmls:exa2a.sgm:7:42:E: element "PI" undefined
nsgmls:exa2a.sgm:8:24:E: general entity "T." not defined
and no default entity
(ARTICLE
<pi/ interpreted as start tag
(PARA
-Here is some text with an inequality: a
(PI
-2
and an abbreviation: AT
&T. interpreted as entity reference,
not defined so gone from output
)PI
)PARA
)ARTICLE
No C to confirm
validity.
XML and Linguistic Annotation
SGML/XML for computational linguists
Summer School, July 2000
46
Escaping special characters
 There are several ways around the problem of introducing XML's
meta-syntax characters into documents
 Use numeric character references
AT&#38;T
 Use CDATA marked sections
<![CDATA[<this> is data &not markup]]>
 XML provides built-in definitions for amp, lt, gt, quot and apos
XML and Linguistic Annotation
Summer School, July 2000
47
Example: >> nsgmls exa2b.sgm
(ARTICLE
(PARA
-Here is some text with an inequality: a<pi/2\n
and an abbreviation: AT&T.
)PARA
)ARTICLE
C
XML and Linguistic Annotation
SGML/XML for computational linguists
Summer School, July 2000
48
76
DTD: Comments
<!--
Comments added here -->
double hyphens act as comment
<!ELEMENT article
(title, section+)>
XML and Linguistic Annotation
Summer School, July 2000
49
DTD: Attributes
<!DOCTYPE
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
]>
article [
article
(title, section+) >
article
artno CDATA #IMPLIED >
section
(title, para+) >
section
secid ID #REQUIRED >
para
(#PCDATA) >
title
(#PCDATA) >
XML and Linguistic Annotation
Summer School, July 2000
50
DTD Attribute declarations: syntax
<!ATTLIST
article
keyword
artno
attribute name
CDATA
#IMPLIED >
attribute type
default type
#REQUIRED
#IMPLIED (= optional)
#FIXED
element type
XML and Linguistic Annotation
Summer School, July 2000
51
Attribute Value types (contd)
<!ATTLIST
CDATA
ENTITY
ID
IDREF
article
artno
CDATA
#IMPLIED >
valid SGML characters
declared entity name
unique name
reference to a unique name
XML and Linguistic Annotation
Summer School, July 2000
52
Cross-references
<!DOCTYPE article [
<!ELEMENT article (section+)>
<!ATTLIST section secid ID #IMPLIED>
<!ELEMENT section (#PCDATA | xref)+>
<!ELEMENT xref EMPTY>
<!ATTLIST xref xrefid IDREF #REQUIRED>
]>
<article>
<section secid='s1'>Here is some text.</section>
<section>In section <xref xrefid='s1'> we showed
you how to create crossreferences.</section>
</article>
XML and Linguistic Annotation
Summer School, July 2000
53
IDs and IDREFs
 In a valid SGML/XML document
 IDs are unique
 IDREFs are discharged
 Applications may interpret IDREF/ID connections
 Links from elsewhere may target IDs
 cf. HTML 'name' attribute as the target for #....
XML and Linguistic Annotation
Summer School, July 2000
54
Attribute value types: list
CDATA
valid SGML characters
author='Robin Hood'
ENTITY/IES
declared entity name(s)
figs='pict2 pict7'
unique name
id='foo37'
ID
IDREF(S)
reference(s) to an ID
refid='foo2 foo37'
NMTOKEN(S)
name(s) w/o i.c. restraint
code='96-mm01 98-a'
NOTATION
data content notation
XML and Linguistic Annotation
encoding='eps'
Summer School, July 2000
55
Enumerated attribute values
Attribute values can also be constrained to be one of a finite set of
allowed values
<!ATTLIST section
status (draft|alpha| beta|final) 'draft' >
<section status=alpha>
<section status=final>
<section>
<section status=gamma>
XML and Linguistic Annotation
Not valid
Summer School, July 2000
56
Elements vs Attributes
<!ELEMENT
<!ELEMENT
date
day
(day, month, year)>
(#PCDATA)>
Order will be enforced
Content is unconstrained
vs
<!ELEMENT dateday EMPTY>
<!ATTLIST dateday NUMBER
monthNUMBER
#REQUIRED
year NUMBER
#REQUIRED>
Content is constrained
XML and Linguistic Annotation
#REQUIRED
Order is unconstrained
Summer School, July 2000
57
DTD: Entities
<!DOCTYPE article [
<!ELEMENT article - - (#PCDATA)>
<!ENTITY ltg 'Language Technology Group'>
]>
<article>
The &ltg; carries out application-oriented research in
language engineering. The &ltg; is based within
the HCRC.
</article>
can be nested:
<!ENTITY hcl 'HCRC &ltg;'>
XML and Linguistic Annotation
Each occurrence of &ltg;
in the text is replaced by
Language Technology Group
during parsing.
Summer School, July 2000
58
DTD: Parameter Entities
Like entities, except within the DTD
<!ENTITY
% section
'(title?, para+)'>
each time parser finds %section; in
the DTD, it will replace it with (title?, para+)
<!ENTITY % section (title?, para+)>
<!ELEMENT article - - (title, %section;+)>
<!ELEMENT subsect - - (%section;+)>
XML and Linguistic Annotation
Summer School, July 2000
59
DTD
 That’s almost all there is to it
 For more detail, see the XML standard
 Which, as Michael Kay puts it, is like tax legislation
 DTD syntax differs from element syntax
 Harder to learn/use XML Schema
 Also, DTDs were designed to be used by document designers, not for
distributed data interchange
 XML can use a DTD, but doesn’t assume one.
 Composite documents entail composite DTDs, but these don’t exist.
 Namespace prefixes add extra complexity
XML and Linguistic Annotation
Summer School, July 2000
60
XSL Transformations
Content from one
document.
Style from another
Structure
XML and Linguistic Annotation
Summer School, July 2000
61
barts_stylish_memo.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="memo.xsl"?>
<!DOCTYPE article [
<!ELEMENT article (title,(para|credit)+)>
<!ELEMENT para (#PCDATA)>
<!ENTITY ltg "Language Technology Group">
<!ENTITY author "Bart Simpson">
<!ENTITY techie "Lisa Simpson">
<!ENTITY parents "Marge and Homer">
<!ENTITY school "M&amp;M University">
]>
<article>
<title>Bart's Ph.D Thesis</title>
<para> by &author;: &school;</para>
<para>
This is the text of a very short article,
with very little internal structure.
Here is a reference to the &ltg; entity.
Please may I stop now?
</para>
<credit>
&techie; of &school; for slick XML authoring.
</credit>
<credit>
for unfailing support.
XML and Linguistic&parents;
Annotation
</credit>
</article>
Summer School, July 2000
62
memo.xsl
IE5 attempts to
display the style
in visual form,
without any
content.
Germ of a good
idea here.
XML and Linguistic Annotation
Summer School, July 2000
63
Source of memo.xsl
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<html>
<head><title><xsl:value-of select="//title"/></title>
</head>
<body BGCOLOR='#FFFFCC'>
<h1><xsl:value-of select="//title"/></h1>
<xsl:for-each select="//para">
<p><xsl:value-of/></p>
</xsl:for-each>
<hr/><p>
<i> Thanks to: </i><br/>
<xsl:for-each select="//credit">
&#160; <xsl:value-of/><br/>
</xsl:for-each><hr/>
</p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
XML and Linguistic Annotation
Summer School, July 2000
64
Fill in the blanks
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<html>
<head><title>•••</title>
</head>
<body BGCOLOR='#FFFFCC'>
<h1>•••</h1>
<xsl:for-each select="//para">
<p>•••</p>
</xsl:for-each>
<hr/><p>
<i> Thanks to: </i><br/>
<xsl:for-each select="//credit">
&#160; ••• <br/>
</xsl:for-each> <hr/>
</p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
XML and Linguistic Annotation
XSLT gives you tools for
sending part of document to
one place, part to another.
Simplest use is pure fill in the
blanks. Anybody who uses
HTML, PHP and so on will be
comfortable with this use of
XSLT
If necessary, it is a Turingcomplete programming
language. It gives you the
rope if you need it.
Summer School, July 2000
65
Fill in the blanks
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<html>
<head><title> <xsl:value-of select="//title"/> </title>
</head>
<body BGCOLOR='#FFFFCC'>
<h1> <xsl:value-of select="//title"/> </h1>
<xsl:for-each select="//para">
<p> <xsl:value-of/> </p>
</xsl:for-each>
<hr/><p>
<i> Thanks to: </i><br/>
<xsl:for-each select="//credit">
&#160; <xsl:value-of/> <br/>
</xsl:for-each> <hr/>
</p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
XML and Linguistic Annotation
Summer School, July 2000
66
XSLT standards
 Microsoft’s implementation in IE5 is non-standard (they put it out
well before the standard existed). They are moving to conformance.
 James Clark’s xt and Michael Kay’s Saxon are much more complete
and conformant
 W3C eats its own lunch. The HTML versions of the XML standard
are generated with XSL
 In practice, current best options are
 Static data:Pre-generate HTML from XML at publication time
 Dynamic data: Use Saxon or xt as Java Servlets
XML and Linguistic Annotation
Summer School, July 2000
67
Generating HTML
HTML is
generated by
running Saxon on
poem.xml and
poem.xsl
saxon
poem.xml
poem.xsl >
poem.html
XML and Linguistic Annotation
Summer School, July 2000
68
Using IE5 to view poem.xml
<poem>
<author>Rupert Brooke</author>
<date>1912</date>
<title>Song</title>
<stanza>
<line>And suddenly the wind comes soft,</line>
<line>And Spring is here again;</line>
<line>And the hawthorn quickens with buds of
green</line>
<line>And my heart with buds of pain.</line>
</stanza>
<stanza>
<line>My heart all Winter lay so numb,</line>
<line>The earth so dead and frore,</line>
<line>That I never thought the Spring would come
again</line>
<line>Or my heart wake any more.</line>
</stanza>
<stanza>
<line>But Winter's broken and earth has woken,</line>
<line>And the small birds cry again;</line>
<line>And the hawthorn hedge puts forth its buds,</line>
<line>And my heart puts forth its pain.</line>
</stanza>
</poem>
XML and Linguistic Annotation
Summer School, July 2000
69
poem.xsl
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="poem">
<html>
<head>
<title><xsl:value-of select="title"/></title>
</head>
<body>
<xsl:apply-templates select="title"/>
<xsl:apply-templates select="author"/>
<xsl:apply-templates select="stanza"/>
<xsl:apply-templates select="date"/>
</body>
</html>
</xsl:template>
<xsl:template match="title">
<div align="center"><h1><xsl:value-of select="."/></h1></div>
</xsl:template>
<xsl:template match="author">
<div align="center"><h2>By <xsl:value-of select="."/></h2></div>
</xsl:template>
<xsl:template match="stanza">
<p><xsl:apply-templates select="line"/></p>
</xsl:template>
<xsl:template match="line">
<xsl:if test="position() mod 2 = 0">&#160;&#160;</xsl:if>
<xsl:value-of select="."/><br/>
</xsl:template>
Namespace declaration
is different (standard
conforming) for Saxon.
+XSLT language is
different.
+ Saxon and XT are really
easy to install.
- IE5 has millions of
current users
<xsl:template match="date">
<p><i><xsl:value-of select="."/></i></p>
</xsl:template>
</xsl:stylesheet>
XML and Linguistic Annotation
Summer School, July 2000
70
“Problems” with XML
 Uses complex and weird terminology
 Yes. But so does the ANSI C standard. So do most fields…
 Not convenient for specifying graphs (as opposed to trees)
 This is a point about graphs, not XML. Unification grammar notations get
unwieldy too.
 Not as convenient as plain text
 True for some tasks, but the extra structure of XML lets do things that you
wouldn’t even try with plain text.
XML and Linguistic Annotation
Summer School, July 2000
71
XML tools for Unix
 Simple equivalents of UN*X tools are available (for free) to do simple
SGML processing
 We'll introduce them using examples, and give details at the end
XML and Linguistic Annotation
Summer School, July 2000
72
sggrep
 LT XML program for searching for structure and text in XML files
 sggrep -q query -s subquery -t regexp in.xml
 Options




-d DTD: Specify a DTD explicitly. File is an XML file
-r : Attribute values in queries are regular expressions.
-v : Invert sense of sub-query+regexp.
Other options
XML and Linguistic Annotation
Summer School, July 2000
73
LT XML query language
 Two-dimensional regular expressions
 First dimension is over tree paths
Based on file path analogy:
DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs
 Second dimension is regular expressions over text content of leaf nodes
|
Select Ss containing Ws whose text is it's or its
-q S -s './W' -t "^(it's|its)$"
Full UTZOO (Henry Spencer) regular expression support
 Influential, slightly dated now.
XML and Linguistic Annotation
Summer School, July 2000
74
sggrep: examples of use
 sggrep -q ".*/P/S" -s "./W[TAG=NN]"
ï find all S elements occuring inside a P element at any depth which immediately
contain a W element with attribute TAG="NN".
 sggrep -q ".*/P/S/W[TAG=NN]"
ï find those W elements themselves
 sggrep -q ".*/S/W[0]" -t "^[a-z]"
ï find all sentence initial words starting with a lower case letter.
XML and Linguistic Annotation
Summer School, July 2000
75
sgmltrans
converts XML into different formats.
sgmltrans -r rulefile file.nsg > file.txt
ï sample rule file:
.*/W matches W
""
what to print at start tag
"/$TAG\n" what to print at end tag: value of TAG attribute
.*/W/#
matches text inside W
" " --> "" text replacement: eliminate space if any
.*/S matches S
""
start tag: nothing
"\n" end tag: make each S on separate line
.*
matches other markup
XML and Linguistic Annotation
Summer School, July 2000
76
sgmltrans: example of use
The previous rule file would do this:
<?xml version='1.0'>
<TEST><P><S>
<W TAG='A'>The </W>
<W TAG='B'>cat </W>
<W>sat </W>
<C>.</C></S>
<S>
<W TAG='A'>on </W>
<W TAG='B'>the </W>
<W>mat </W>
<C>.</C>
</S></P></TEST>
XML and Linguistic Annotation
The/A
cat/B
sat/
on/A
the/B
mat/
Summer School, July 2000
77
sgrpg: SGML report generator
 Program for making more complex queries of normalised SGML and
for transforming SGML.
 Provides nested subqueries and sequencing
 Usage:
 sgrpg query sub-query regexp out-fmt oargs < file.nsg > file.txt
 sgrpg -f pat-file < file.nsg > file.txt
 This now looks like a design study for XSLT and XML Query.
 Has one advantage, designed (from the outset) for big documents
XML and Linguistic Annotation
Summer School, July 2000
78
The British National Corpus
 2 gigabytes of contemporary English
 Marked up to word level with part of speech tags
 Extract data:
 zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]"
 gives all singular nouns in a part of the corpus, e.g.
<W TYPE=NN1>part </W>
<W TYPE=NN1>meeting </W>
<W TYPE=NN1>while </W>
<W TYPE=NN1>funeral</W>
<W TYPE=NN1>loss</W>
<W TYPE=NN1>meeting</W>
<W TYPE=NN1>time </W>
XML and Linguistic Annotation
Summer School, July 2000
79
The BNC: an example (2)
zcat medium.xml.gz | \
sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \
-t "^[Rr]ight$"
gives sentences containing non-adjectival uses of the word 'right', e.g.
<S N=092>
<W TYPE=ITJ>Yes </W>
<W TYPE=DT0>that </W>
<W TYPE=VBD>was</W>
<C TYPE=PUN>, </C>
<W TYPE=DT0>that </W>
<W TYPE=VBD>was </W>
<W TYPE=AV0>right</W>
. . .
</S>
XML and Linguistic Annotation
Summer School, July 2000
80
The BNC: an example (3)
Format the output into a more readable form:
zcat medium.xml.gz
| \
sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\
sgmltrans -r test.rule
Yes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0 erm/UNC
there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0
you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0
there/EX0 ?
He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP
says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP
see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP
the/AT0 menu/NN1 and/CJC said/VVD right/AV0 , he/PNP said/VVD .
XML and Linguistic Annotation
Summer School, July 2000
81
An extended example: Noun Compounds
 Noun compounds in British National Corpus
 What is a noun compound?
Too hard.
 Simple approximation? Sequence of tags matching NN. . .
BNC uses a version of the Brown tags, where NN0, NN1, . . . are all variants
of Noun
 A pipeline of SGML-aware tools will do the job
 sgrpg | sggrep [ | . . .]
 Use sgrpg to wrap such tag sequences in <G> ... </G>.
 Use sggrep to filter the output.
 Use further tools to tabulate, format, etc.
XML and Linguistic Annotation
Summer School, July 2000
82
An extended example: The pipe
 Step by step through the pipe
 sgrpg -r -f np-pat.xml | ...
Group the sequences
-r use regexp matching
-f script file
 ... sggrep -d groups.xml -q '.*/G'
extract the sequences
-d DTD
-q query (selects groups)
 Result:
<G><W TYPE='AJ0-NN1'>Local</W>
<W TYPE='NN0'>government</W>
<W TYPE='NN2'>districts</W></G>
...
XML and Linguistic Annotation
Summer School, July 2000
83
An extended example: filtering
 Find all words with unresolved tags, e.g. AJ0-NN1
 use regexp matching, which is unanchored by default
 ...| sggrep -r -q './W[TYPE="-"]' | ...
 Find all words in second position
 ...| sggrep -q './W[1]' | ...
 Find all words with unresolved tags in second position
 ...| sggrep -r -q './W[1 TYPE="-"]' | ...
XML and Linguistic Annotation
Summer School, July 2000
84
An extended example: counting
 Count all words in second position
 ...| sggrep -q './W[1]' | sgcount
 Count all words with unresolved tags in second position
 ...| sggrep -r -q './W[1 TYPE="-"]' | sgcount
 Results:
 all 2nd place W
23283
 2nd place W with unresolved tag
XML and Linguistic Annotation
5066
Summer School, July 2000
85
An extended example: long compounds
 Long compounds including 'government'






Use subquery to select <G>...</G>s with 'government':
sggrep -q G -s './W' -t government
Next step, discard short ones:
sggrep -q G -s './W[2]'
Then sgmltrans for neater format
Results:
official/AJ0-NN1 government/NN0 report/NN1-VB
Local/AJ0-NN1 government/NN0 districts/NN2
...
XML and Linguistic Annotation
Summer School, July 2000
86
An extended example: left context
 select for 'government' in 2nd place
 . . . | sggrep -q G -s './W[1]' -t government |
 pull words from first place
 sggrep -q './W[0]' |
 remove markup
 textonly |
 use UN*X for the rest





sort | uniq -c | sort -nr | head -4
6 French
5 German
4 interim
4 Chinese
XML and Linguistic Annotation
Summer School, July 2000
87
British International Corpus?
 We are more francophone than we think!
 Longest 'noun-phrase' in 10% of BNC is:
serai/NN1 mentionn&eacute;/NN1 dans/NN2 le/NN1 rapport/NN1-VB
qui/NN1 te/NN1 sera/NN1 remis/NN1
 No disgrace that the part-of-speech tagger gave up here.
 Tools can't be better than their input allows
XML and Linguistic Annotation
Summer School, July 2000
88
XML Conclusions
 XML is the wave of the future
 Both Microsoft and Netscape have endorsed it
 Both Mozillla and IE5 have XML support built-in
 Very good free software is available
 Microsoft seem to be serious about standard compliance
 The W3C have made it clear that all subsequent W3C standards for
web distribution of information will be based on XML (c.f. SMIL,
SVG and RDF)
 Issues
 XSLT efficiency - space and time.
XML and Linguistic Annotation
Summer School, July 2000
89
To read
 Robin Cover’s SGML/XML Web Page
http://www.sil.org/sgml/sgml.html
 includes many pointers to SGML tutorials, overviews, publications
 The Whirlwind Guide to SGML & XML Tools and Vendors
http://www.infotek.no/sgmltool/guide.htm
 The XML FAQ
http://www.ucc.ie/xml/
 An excellent introduction to XML with pointers to useful resources for
newcomers to the standard
XML and Linguistic Annotation
Summer School, July 2000
90
SGML/XML for Linguistics
 2.1 Programs for querying/modifying SGML
 an example
 what is needed
 available tools
 2.2 SGML marked-up corpora
 some existing resources
 2.3 Related developments
 SSTML
 SGML for X-waves
XML and Linguistic Annotation
Summer School, July 2000
91
An example
 You want to build a system that performs particular LE task
 You have a corpus of texts for
analysis (detecting textual regularities)
system training
system testing
 Use XML
Why?
How?
XML and Linguistic Annotation
Summer School, July 2000
92
Why use XML?
 Use structure of text to fine-tune certain tools
 e.g. build tokeniser which specifically works for headlines of newspaper articles
 Annotate text with linguistic information
 e.g. use SGML tags to record the results of a tokeniser or part of speech tagger,
so that other tools can make use of this information
 Ensure the others (and you two years from now :-) will have easy
access to your results
 No special-purpose parser required
 Simple retrieval and tabulation with existing free tools
 DTD provides some self-documentation
XML and Linguistic Annotation
Summer School, July 2000
93
What is needed to use XML?
 XML is text
 Therefore:
 you can use any UNIX text manipulation program
e.g. grep, sed, awk, perl, etc
 XML is annotated text
 Therefore:
 Needed: versions of these tools that are XML-aware
XML and Linguistic Annotation
Summer School, July 2000
94
What is needed to use XML?
 SGML reflects the hierarchical structure of a text
 You want to be able to tell tools to operate on a particular part of the SGMLannotated text, for example:
all WORD entities with attribute POS set to JJ
(i.e. all adjectives)
occurring within the first PARAGRAPH of the main BODY of an
ARTICLE; or
occurring within the HEADLINE of and ARTICLE
 Needed: a query language over XML structures
XML and Linguistic Annotation
Summer School, July 2000
95
What is needed to use XML?
 XML-aware versions of text processing tools
 Query language
 In fact sggrep is just a simple wrapper round our query language.
Our query language and interface
is designed to work with big files,
so it doesn’t read the whole
document into memory unless
absolutely necessary. Most
competitors do this
XML and Linguistic Annotation
Summer School, July 2000
96
XML tools: the LT XML library
 sggrep is part of an SGML toolset, called LT XML
 Developed by the Language Technology Group (Edinburgh)
 see: http://www.ltg.ed.ac.uk/software
 XML Library with
 Command-line tools
 Application Programming Interface (API)
 Available for WIN32, UN*X (and Mac)
 LT XML processes XML or nSGML
 nSGML now looks like a design study for XML
XML and Linguistic Annotation
Summer School, July 2000
97
LT XML: Command-line tools







sggrep - retrieving context sensitive data
sgmltrans - transforming information
sgrpg
- more complex queries/reformatting
textonly - strips out SGML markup
sgcount
- counts SGML tags
knit - resolves XML-link links
others
XML and Linguistic Annotation
Summer School, July 2000
98
LT NSL: APIs
 LT NSL Application Program Interfaces:
procedure calls to help you write your own programs to process
nSGML
 C language API
 Python language API
XML and Linguistic Annotation
Summer School, July 2000
99
C API for specialised access
 Write your own programs to read/write SGML/XML
 LT XML provides a rich API
 Both event and tree views of the document stream
 The distribution includes two heavily commented example programs.
XML and Linguistic Annotation
Summer School, July 2000
100
Python language API for LT XML
Experimental integration of the LT XML API into Python (free
portable object-oriented scripting language)
Uses TK portable widget library for graphical UI
Reflects document stream as Python objects
XML and Linguistic Annotation
Summer School, July 2000
101
Specialised XML editors
 Using the Python API we have written a number of specialised
processors:
 A WYSIWYG XML instance editor (XED)
 Several specialised annotation tools, E.g. PoS correctors, span coders
 Limited set of operations
 Preserve validity
 Hide structure from the user
XML and Linguistic Annotation
Summer School, July 2000
102
Dataflow in LT NSL programs
file1.sgm Ö file2.sgm ...
mknsg
nSGML
NSL
stream
API
C(++) program
parser
DDB file
nSGML
NSL
stream
API
C(++) program
parser
unknit
file1.sgm ...
XML and Linguistic Annotation
Summer School, July 2000
103
The Edinburgh MapTask Corpus
 Contents
 128 task oriented spontaneous Scottish dialogues
 small corpus, but very dense and detailed SGML markup.
 Availability:
 Transcripts and digitized speech on 8 CD-ROMS:
http://www.elsnet.org/resources.html or from the LDC
 What is its markup like?
 (early) TEI-compliant
 Turns, pointers into the speech, identification of non-words.
 Word-level transcripts with timing markup available soon via the Internet
XML and Linguistic Annotation
Summer School, July 2000
104
HCRC Maptask: an example
mknsg q1ec1.turns.sgm | sggrep -q ".*/W[TAG=at]"
<W START=2.9644 DUR=0.0725 UTT=1 TAG=at>a</W>
<W START=17.1410 DUR=0.1779 UTT=3 TAG=at>an</W>
<W START=18.6693 DUR=0.0791 UTT=3 TAG=at>the</W>
XML and Linguistic Annotation
Summer School, July 2000
105
Parsed HCRC Maptask : an example
mknsg q1ec1.g.syn.sgm | sggrep -q ".*/NP" | sgmltrans -r
mt.rule
<NP>we </NP>
<NP>a caravan park </NP>
<NP>we </NP>
<NP>we </NP>
<NP><NP><NP>an old mill </NP></NP><PP>on <NP>the right
hand side </NP></PP></NP>
<NP><NP>an old mill </NP><PP>on <NP>the right
</NP></PP></NP>
<NP>you </NP>
...
XML and Linguistic Annotation
Summer School, July 2000
106
The MLCC corpus
 Contents
 Financial Newspaper texts: Dutch, English, French, German, Italian, Spanish
 Parallel texts:
The Journal of the European Commission, Written Questions (1993).
Corpus of European Parliamentary debates (1993-1994). (languages: Danish,
Dutch, English, French, German, Greek, Italian, Portuguese and Spanish ).
Markup
 Available
 from ELRA: http://www.inpg.fr/ELRA/catalog.html
XML and Linguistic Annotation
Summer School, July 2000
107
The MLCC Corpus: an example
zcat exp.joc006.93.en.01.tei.gz |\
mknsg | \
sggrep -q ".*/DIV4[TYPE=Q]/HEAD"
<HEAD>Subject: The staffing in the Commission of the
European Communities</HEAD>
<HEAD>Subject: Supplies of military equipment to
Iraq</HEAD>
<HEAD>Subject: Commission plans to liberalize the postal
sector and to abolish the State monopoly</HEAD>
<HEAD>Subject: New industries in Attika</HEAD>
...
XML and Linguistic Annotation
Summer School, July 2000
108
The same example for French
zcat exp.joc006.93.fr.01.tei.gz |\
mknsg | \
sggrep ".*/DIV4[TYPE=Q]/HEAD" ""
<HEAD>Objet: Organigramme de la Commission</HEAD>
<HEAD>Objet: Livraisons de matÈriel militaire ‡ l'Irak</HEAD>
<HEAD>Objet: Projets de la Commission visant ‡ libÈraliser et ‡
abolir le monopole d'…tat dans le secteur des postes</HEAD>
<HEAD>Objet: Nouvelles industries en Attique</HEAD>
Corresponds to the English data: Suitable input for multilingual alignment experiments.
XML and Linguistic Annotation
Summer School, July 2000
109
The Text Encoding Initiative (TEI)
 The TEI is a large and well documented DTD for textual markup.
 Use it if you can
 Now has an XML version
 Large and comprehensive hardcopy documentation available
 http://www.uic.edu/orgs/tei/
 DTDs available there as well
XML and Linguistic Annotation
Summer School, July 2000
110
The Linguistic Data Consortium
 LDC - based in Pennsylvania USA
 Distributes text corpora
 See: http://www.ldc.upenn.edu/
 SGML Corpora include:
 The European Language Newspaper Text corpus
French (100 million words), German (90 million words) and Portuguese (15
million words). SGML.
 TIPSTER Information Retrieval Text Research Collection
3 gigabytes. SGML-like. Various English texts.
 United Nations Parallel Text Corpus (English, French, Spanish)
Fully-compliant SGML, 2.5 gigabytes
XML and Linguistic Annotation
Summer School, July 2000
111
Tutorials
 XML: far too many to mention
 XSL:
 XSL specification
http://www.w3.org/Style/XSL
 Robin Cover's guide
http://www.oasis-open.org/cover/xsl.html
XML and Linguistic Annotation
Summer School, July 2000
112
Resources
 LT-XML
http://www.ltg.ed.ac.uk/software/xml/index.html
 Full-text search
Witten, Moffat and Bell's Managing Gigabytes
http://www.cs.mu.OZ.AU/mg/
XML and Linguistic Annotation
Summer School, July 2000
113
Corpus Tools
 Stuttgart Corpus Workbench
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench
 Birmingham Qwick}
http://www-clg.bham.ac.uk/QWICK/
The MATE Workbench
http://www.cogsci.ed.ac.uk/~dmck/MateCode}.
NB. Prototype
XML and Linguistic Annotation
Summer School, July 2000
114
Bibliography
 McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language
Processing, Computers and the Humanities, 31(5): 367-388, 1997
 Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus
Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998
 References on McKelvie's MATE workbench page
http://www.cogsci.ed.ac.uk/~dmck/MateCode
 Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents.
Computers and the Humanities. 33(10):59-84. 1999
 Alignment graphs (and much else) Steven Bird's Linguistic Annotation Page
http://www.ldc.upenn.edu/annotation/.
XML and Linguistic Annotation
Summer School, July 2000
115
Annotation topics
_ Item annotations
 Words, Parts-of-speech, lemmas
 Simple annotations (one data stream)
 Boundaries,Spans,Partitions
 Complex annotations (multiple data streams)
 Sequences,Graphs,Overlaps
 Data models for annotation access
 Streams, Trees, Graphs, Databases
_ Human factors in annotation
 Writing instructions, Measuring and improving reliability
XML and Linguistic Annotation
Summer School, July 2000
116
XML topics
 Data formats
 HTML,XML and SGML
 Data Description Formalisms
 DTDs, XML Schema
 Style Languages
 XSLT
 Query Languages
 Annotation Graphs, XML Query, XQL, Quilt, LORE
XML and Linguistic Annotation
Summer School, July 2000
117
Exercises
On average, these exercises should take about one hour to complete.
Try not to spend longer.
 Create an XML document
 Create a very simple memo
 Simple annotation
 Disambiguate parts-of-speech
 Compare results with those made by a partner.
 Style
 Create an XML DTD and an XSL style sheet for displaying POS-tagged text in
a browser.
XML and Linguistic Annotation
Summer School, July 2000
118
Exercises
 More complex annotation
 syntactic annotation in Penn tree bank style.
 As before, compare results
 Search
 Exercise XML search tools on the newly annotated texts
XML and Linguistic Annotation
Summer School, July 2000
119
Projects
These are open-ended projects hard enough to merit write-up in a
research paper. I’d willingly supervise these.
 Design a DTD and an XSL stylesheet for tree bank style syntactic
annotations. Implement a convenient interface allowing these
annotations to be edited over the Web.
 Investigate the corpus search tools provided at the LDC web-site.
What do they do? Could they and should they use XML/XSL
technology for the same purpose? (Easiest if your institution has an
LDC membership).
XML and Linguistic Annotation
Summer School, July 2000
120
Projects (contd)
 Critical review of the Talkbank tools (www.talkbank.org)
 Design an XML query language that works well with very big
documents
 What sort of annotation structure for dialog? (cf. MATE)
 Design an optimizing compiler for XSLT (cf. Sun’s very recent XSL
compiler)
 Does XSLT support language modeling and statistical computation?
(If you put XSLT and Splus into a closed box and shake vigorously,
what emerges?)
XML and Linguistic Annotation
Summer School, July 2000
121
In Summary
 Phew!
</xmlstuff>
XML and Linguistic Annotation
Summer School, July 2000
122
Descargar

No Slide Title