Generic XML Programs
Chris Brew, Ohio State University
http://www.ling.ohio-state.edu/~cbrew
1
Questions
 What is needed for linguistic annotation?
 Tools and techniques
• Patterns of collaboration
• Things to avoid doing
• (How) can XML help?
•
•
What is easy/hard to represent?
Which standards are useful right now?
• (How) can we write reliable XML software?
• Patterns for typical annotation tasks
• Simple ways of exploiting DTDs
Generic XML Programs
Ohio Jan 2003
2
What is XML?
• It is a markup language used for annotating text
• is concerned with logical structure
• to identify sections, titles, section headers, chapters, paragraphs,…
• is not concerned with appearance
• you say 'this is a subtitle'
not 'this is in bold, 14pt, centered'
• you say 'this is an example'
not 'this is in verbatim, indented by 5pts, ragged right’
• Derived from SGML
• Designed (in some measure) by linguists and scholars.
Generic XML Programs
Ohio Jan 2003
3
What is XML?
• It is means for describing and presenting data (usually on the
web).
• Most of the big computer and entertainment companies
believe XML is the solution.
• But exactly what was the problem?
• Presenting a parts database over the Internet
• Running an on-line job market
• Usually not corpus creation
• XML is mainstream. We’re the minority now.
Generic XML Programs
Ohio Jan 2003
4
Does XML live up to the hype?
• Of course not, but…
• The basic idea is simple labeled brackets. Lisp showed the power of this
idea in knowledge representation.
• Knowledge representation is inherently hard. Lisp made it easier to state
the problem, but it wasn’t itself the solution. XML won’t solve your
knowledge representation problems either, but it will let you state them
and explain them to your friends.
• Labeled brackets++
• Labeled brackets – but designed for information exchange, with
sophisticated input (and political pressures) from many interest groups.
Generic XML Programs
Ohio Jan 2003
5
Does XML live up to the hype?
• Yes. XML and allied standards (XSLT, XML Query,) give us a
framework for data interchange.
Browser
Weather
Reports
XSL
Day Planner
Weather Model
XML
Data
Generic XML Programs
XML
Transformation
Ohio Jan 2003
End Users
6
Linguistic annotation
• The real task of linguistic annotation is to add information to
(recordings of) naturally occurring communicative behaviour.
• Annotation is hard enough without having to worry about
buggy tools. Many annotators are data-centric and intolerant of
‘interface junk’.
• Many users are working linguists who simply want to find
examples and/or numbers to put into the next paper. Limited
tolerance for complex and inaccessible data formats.
Generic XML Programs
Ohio Jan 2003
7
Multi-level Annotation
• Corpora are often annotated at different levels of detail, by
heterogeneous teams, over a period of decades.
• Our view/understanding of the data will change.
• A useful strategy is a layered approach, defining a core set of
distinctions which is augmented by optional levels of detail
• EAGLES defines cross-language annotation standards this way.
• Optional levels may be for particular languages or particular
applications
Generic XML Programs
Ohio Jan 2003
8
Encoding for multiple levels
• XML has good capabilities.
•
•
•
•
Links between documents
Transformation tools
Display facilities
Well-defined properties
• We need agreed standards for
• Levels of annotation
• Some automatic
• Some require humans
• Checking procedures
• Annotation Criteria
Generic XML Programs
(POS tagging )
(co-reference etc.)
Ohio Jan 2003
9
Architectures for multiple levels
• Must support
•
•
•
•
Range of annotation types
Multiple versions, alternatives
Different human languages
Different media and modalities
• (video, audio, diagrams)
• Complex links between documents, parts of documents, external data
sources.
• XML has some support for all of these, especially given the use
of stand-off annotation.
Generic XML Programs
Ohio Jan 2003
10
Data models
• TIPSTER model (annotated spans)
• GATE (focus on re-use of processes rather than data)
• ATLAS Annotation graph formalism
• Networks of nodes, many of which are ordered by the time relation
before/after. Nodes may overlap, not overlap, or the annotation may be
indeterminate. Gives a lot of flexibility for doing e.g. phonology.
• Granularity is critical -- must be able to refer to smallest objects of
interest.
• Verbose, but annotation graph queries can be optimized, and/or indexed.
Generic XML Programs
Ohio Jan 2003
11
Tool architecture
• MULTEXT/LT XML Pioneered use of standoff annotation.
Adapted Unix tool architecture for ?ML
• GATE implements the Tipster architecture. Focus on tool
composition
• ATLAS similar, but based on annotation graphs
Generic XML Programs
Ohio Jan 2003
12
Tool architecture: consensus
• High-level (computer) language independent API with a three
layer architecture
• Low-level physical details (database, text-files, proprietary system)
• Logical view of the data (XML has several good candidates)
• High-level, SQL-like query interface, designed to be usable by
programmers and non-programmers alike.
• Visual interface for annotation
• (Optional) shortcuts for efficiency. Provide an event-based view
on hierarchical structure, until such time as query language is
efficient
Generic XML Programs
Ohio Jan 2003
13
Tool support
• Still need to embed tools into a system
• May wish to compress our data files.
• XML compresses very well (Liefke and Suciu)
• Searching compressed data is feasible
• May wish to index data files, as has been done with Stuttgart’s CQP
• Relies on data having fairly simple structure
Generic XML Programs
Ohio Jan 2003
14
Non-traditional data
•
•
•
•
•
•
Documents with diagrams, including engineering drawings.
Books which overlay text and illustration.
Manuscripts where the physical details of calligraphy matter.
Interlinked texts.
Phonetic databases, word-lists
Personal mailboxes and the like
No obvious time line in some of these. Challenges to
indexing.
Generic XML Programs
Ohio Jan 2003
15
Software infrastructure
• XML+XSLT
One way of providing views on corpora
• Very good XML tools exist
• But they aren’t specialised for language
• Annotation graphs
• Advantage: customizable to our concerns
• Disadvantage: someone has to do the customization.
• Challenge: efficient, intuitive, expressive query languages
(Bird,Buneman and Tan)
Generic XML Programs
Ohio Jan 2003
16
Transcriber
Generic XML Programs
Ohio Jan 2003
17
Transcriber (Barras et al, ‘00)
• Designed for manual annotation of large speech files.
• Now has annotation graph engine at its heart.
• Uses XML to define communication between modules
• Motivation is graceful handling of multi-channel audio or video
• Make the end-product more customizable, by building in less of
the data description, and putting more into data files
• Retain simple, crisp user interface
Generic XML Programs
Ohio Jan 2003
18
Clan
Generic XML Programs
Ohio Jan 2003
19
Talkbank project
•
•
•
•
Very diverse user group (ethologists, child language, CA)
Serious effort to standardize tools
Strongly committed to open source products
http://www.talkbank.org
Generic XML Programs
Ohio Jan 2003
20
Semi-structured data
• The standard assumption in the database community is that
when we have a body of data we know its structure.
• This is simply not true on the Web. We typically have data which
have some structure but some irregularities.
Person: (name: “Chris Brew”)
Person:(name: (first: “Jamie”,last:”Brew”))
Person:(name:(first:”Matthew”, last:”Brew”,initial:”R”))
Generic XML Programs
Ohio Jan 2003
21
XML is semi-structured data
• XML is allowed not to have detailed document type information
(though it may have).
• Some XML applications need to be generic, in the sense that
they are not limited to any particular DTD browsers, editors,
tree diff…
• Others make assumptions about the class of documents they
will process, but do not fully specify DTD
• Others are tied to many details of specific DTDs
Generic XML Programs
Ohio Jan 2003
22
Types of annotation
•
A taxonomy of different sorts of annotation which are
needed for various forms of linguistic data.
Generic XML Programs
Ohio Jan 2003
23
Item annotations
• Words, Parts-of-speech, lemmas
Each item receives one annotation on each of several, and
is related to others primarily by contiguity.
Sample tool: Stuttgart CQP
Sample query: word = “right” & pos != “j.*”
Generic XML Programs
Ohio Jan 2003
24
Simple annotations
• Boundaries,Spans,Partitions
• Boundaries
• Correspond to EMPTY XML elements
• Single click inserts boundary
• Resulting span partitions the time line of the input file.
• Discontiguous spans.
• Click and drag selects a span of the document
• Inserts start tag and end tag
• Attributes
• Subsequent click on start tag (or empty tag) brings up menu of
attributes
Generic XML Programs
Ohio Jan 2003
25
DTD relations
• Such annotation tools rely on relations between DTDs to give
meaningful user actions
• Ensure syntactic consistency of input, allowing annotator to
focus on meaning of annotation.
Generic XML Programs
Ohio Jan 2003
26
Stylesheets mediate relations
• Editors and their operations relate to the data which they
process.
• Visual presentations relate to the structure of the data which
they process
• Frequency counted wordlists are related to the original form of
the corpus.
• Many processes of analysis and summarisation are well
expressed as relations between document types.
• Stylesheets, in some form, are a natural choice to mediate such
relations, making them customizable.
Generic XML Programs
Ohio Jan 2003
27
XSL Transformations
Content from one
document.
Style from another
Structure
Generic XML Programs
Ohio Jan 2003
28
barts_stylish_memo.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="memo.xsl"?>
<!DOCTYPE article [
<!ELEMENT article (title,(para|credit)+)>
<!ELEMENT para (#PCDATA)>
<!ENTITY ltg "Language Technology Group">
<!ENTITY author "Bart Simpson">
<!ENTITY techie "Lisa Simpson">
<!ENTITY parents "Marge and Homer">
<!ENTITY school "M&amp;M University">
]>
<article>
<title>Bart's Ph.D Thesis</title>
<para> by &author;: &school;</para>
<para>
This is the text of a very short article,
with very little internal structure.
Here is a reference to the &ltg; entity.
Please may I stop now?
</para>
<credit>
&techie; of &school; for slick XML authoring.
</credit>
<credit>
</credit>
</article>
for unfailing support.
Generic XML&parents;
Programs
Ohio Jan 2003
29
memo.xsl
IE5 attempts to
display the style
in visual form,
without any
content.
Not standard,
but very
reasonable.
Generic XML Programs
Ohio Jan 2003
30
Source of memo.xsl
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<html>
<head><title><xsl:value-of select="//title"/></title>
</head>
<body BGCOLOR='#FFFFCC'>
<h1><xsl:value-of select="//title"/></h1>
<xsl:for-each select="//para">
<p><xsl:value-of/></p>
</xsl:for-each>
<hr/><p>
<i> Thanks to: </i><br/>
<xsl:for-each select="//credit">
&#160; <xsl:value-of/><br/>
</xsl:for-each><hr/>
</p>
</body>
</html></xsl:template>
</xsl:stylesheet>
Generic XML Programs
Ohio Jan 2003
31
Fill in the blanks
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<html>
<head><title>•••</title>
</head>
<body BGCOLOR='#FFFFCC'>
<h1>•••</h1>
<xsl:for-each select="//para">
<p>•••</p>
</xsl:for-each>
<hr/><p>
<i> Thanks to: </i><br/>
<xsl:for-each select="//credit">
&#160; ••• <br/>
</xsl:for-each> <hr/>
</p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Generic XML Programs
XSLT gives you tools for
sending part of document to
one place, part to another.
Simplest use is pure fill in
the blanks. Anybody who
uses HTML, PHP and so on
will be comfortable with
this use of XSLT
If necessary, it is a Turingcomplete programming
language. It gives you the
rope if you need it.
Ohio Jan 2003
32
XSLT standards
• Microsoft’s implementation in IE5 is now standard (used not to
be, they put it out well before the standard existed).
• James Clark’s xt and Michael Kay’s Saxon are complete and
highly conformant. xt is highly optimized, Saxon simpler and
easier to work with
• W3C eats its own lunch. The HTML versions of the XML
standard are generated with XSL
• In practice, current best options are
• Static data:Pre-generate HTML from XML at publication time
• Dynamic data: Use Saxon or xt as Java Servlets
Generic XML Programs
Ohio Jan 2003
33
Generating HTML
HTML is
generated by
running Saxon
on poem.xml
and poem.xsl
saxon
poem.xml
poem.xsl >
poem.html
Generic XML Programs
Ohio Jan 2003
34
Using IE5 to view poem.xml
<poem>
<author>Rupert Brooke</author>
<date>1912</date>
<title>Song</title>
<stanza>
<line>And suddenly the wind comes soft,</line>
<line>And Spring is here again;</line>
<line>And the hawthorn quickens with buds of
green</line>
<line>And my heart with buds of pain.</line>
</stanza>
<stanza>
<line>My heart all Winter lay so numb,</line>
<line>The earth so dead and frore,</line>
<line>That I never thought the Spring would come
again</line>
<line>Or my heart wake any more.</line>
</stanza>
<stanza>
<line>But Winter's broken and earth has
woken,</line>
<line>And the small birds cry again;</line>
<line>And the hawthorn hedge puts forth its
buds,</line>
<line>And my heart puts forth its pain.</line>
</stanza>
</poem>
Generic XML Programs
Ohio Jan 2003
35
poem.xsl
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="poem">
<html>
<head>
<title><xsl:value-of select="title"/></title>
</head>
<body>
<xsl:apply-templates select="title"/>
<xsl:apply-templates select="author"/>
<xsl:apply-templates select="stanza"/>
<xsl:apply-templates select="date"/>
</body>
</html>
</xsl:template>
Generic XML Programs
Ohio Jan 2003
36
poem.xsl
<xsl:template match="title">
<div align="center"><h1><xsl:value-of select="."/></h1></div>
</xsl:template>
<xsl:template match="author">
<div align="center"><h2>By <xsl:value-of select="."/></h2></div>
</xsl:template>
<xsl:template match="stanza">
<p><xsl:apply-templates select="line"/></p>
</xsl:template>
Generic XML Programs
Ohio Jan 2003
37
poem.xsl
<xsl:template match="line">
<xsl:if test="position() mod 2 =
0">&#160;&#160;</xsl:if>
<xsl:value-of select="."/><br/>
</xsl:template>
<xsl:template match="date">
<p><i><xsl:value-of select="."/></i></p>
</xsl:template>
</xsl:stylesheet>
Computation model of XSL is structural recursion, allowing
considerable flexibility in transforming documents.
Implemented via queries.
Generic XML Programs
Ohio Jan 2003
38
XML tools for Unix
• Simple equivalents of UN*X tools are available (for free) to do
simple SGML processing
• We'll introduce them using examples, and give details at the end
Generic XML Programs
Ohio Jan 2003
39
sggrep
• LT XML program for searching for structure and text in XML
files
• sggrep -q query -s subquery -t regexp in.xml
• Options
•
•
•
•
-d DTD: Specify a DTD explicitly. File is an XML file
-r : Attribute values in queries are regular expressions.
-v : Invert sense of sub-query+regexp.
Other options
Generic XML Programs
Ohio Jan 2003
40
LT XML query language
• Two-dimensional regular expressions
• First dimension is over tree paths
• Based on file path analogy:
DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs
• Second dimension is regular expressions over text content of leaf nodes
|
• Select Ss containing Ws whose text is it's or its
-q S -s './W' -t "^(it's|its)$"
• Full UTZOO (Henry Spencer) regular expression support
Generic XML Programs
Ohio Jan 2003
41
sggrep: examples of use
• sggrep -q ".*/P/S" -s "./W[TAG=NN]"
• find all S elements occuring inside a P element at any depth which
immediately contain a W element with attribute TAG="NN".
• sggrep -q ".*/P/S/W[TAG=NN]"
• find those W elements themselves
• sggrep -q ".*/S/W[0]" -t "^[a-z]"
• find all sentence initial words starting with a lower case letter.
Generic XML Programs
Ohio Jan 2003
42
LT XML and annotation
• Based on the Unix pipeline idea
•
•
•
•
Reads from standard input (usually)
Writes to standard output (usually)
Each tool does a single, fairly simple task
Controlled by command line flags
• In standard Unix filters, unit of transaction is character or line
• In LT XML, unit of transaction is
• Either: An XML Bit (start tag, end tag, content fragment)
• Or: An XML Item (balanced start tag, end tag, contents)
• Or: mixed bits and items
Generic XML Programs
Ohio Jan 2003
43
xmlnorm
• xmlnorm is very simple to write using the LT XML API. Read
all bits from file and print them back out in order.
• Many options for character set conversion and such.
// open input and output files
// optionally set output encoding
while ((bit := GetNextBit(in)) != NULL) {
PrintBit(bit,out);
}
Generic XML Programs
Ohio Jan 2003
44
xmlchange
– xmlchange is just like xmlnorm, but with an additional line
which changes the content of some of the bits before printing
// open input and output files
// optionally set output encoding
while ((bit := GetNextBit(in)) != NULL) {
PerhapsChangeBit(&bit);
PrintBit(bit,out);
}
Generic XML Programs
Ohio Jan 2003
45
xmlitems
– xmlitems Read bits until an interesting start bit turns up, then
read the whole Item into memory before printing it.
// open input and output files
// optionally set output encoding
while ((bit := GetNextBit(in)) != NULL) {
if(bit.type = NSL_start && interesting(bit)){
item = ItemParse(bit,in);
ProcessItem(&item);
PrintItem(item,out);
}|
}
Generic XML Programs
Ohio Jan 2003
46
The Query Interface
• Query language packages up ItemParse mechanism returning
only the items selected by the query
• Programs can receive the query as an argument or read it from
some other file
// open input and output files
// optionally set output encoding
qu = ParseQuery(quString);
while ((item := GetNextQueryItem(in,qu,out)) != NULL) {
PerhapsChangeItem(&item);
PrintItem(item,out);
}
Generic XML Programs
Ohio Jan 2003
47
Queries or Bits?
• If you use the query interface, well-formed output is nearly
automatic, with bits you have to be sure to balance every start
tag with an end tag.
• If the items are huge, reading them into memory may be
unacceptable. In that case, use the ItemParse pattern above
Generic XML Programs
Ohio Jan 2003
48
Other LT XML facilities
• C structures representing documents types, attribute
information, element types
• Unique element names, so that (in C) you can say
if(element.type == target) {…}
rather than
if(strcmp(element.type,target) == 0) {…}
• Support for memory management.
• Example programs
• 100+ pages of detailed documentation
Generic XML Programs
Ohio Jan 2003
49
Python interface
• Closely mirrors C interface, but safer and much easier to work
with. Trades ultimate efficiency for convenience.
• Exposes DTD information, making possible applications like
Thompson’s xed and xsv
• Meshes with well with Python’s new Unicode support and
excellent GUI facilities.
Generic XML Programs
Ohio Jan 2003
50
When not to use LT XML
• If input and output DTDs are very different, the LT XML filter
model is unnatural. For that you want XSLT or similar.
• LT XML is stream oriented, so not especially natural for random
access (if you need that, consider using a database).
• If someone else is providing your tools and they already do a
good job.
Generic XML Programs
Ohio Jan 2003
51
DTD relations
• Such annotation tools rely on relations between DTDs to give
meaningful user actions
• Ensure syntactic consistency of input, allowing annotator to
focus on meaning of annotation.
Generic XML Programs
Ohio Jan 2003
52
What is XML Link
• Just as XML itself simplified SGML while extending HTML
• XML-link simplifies HyTime while extending HTML
• XML-link provides mechanisms for
Describing links with link elements
Identifying links and link ends by type and role
Locating link ends with a powerful locator syntax
Incorporating link elements in-line or out-of-line
Specifying default behaviours
Generic XML Programs
Ohio Jan 2003
53
Simple XML-link example
• This a simple reconstruction of HTML's A element, specifying
two-ended link in-line with one implicit and one explicit locator
<refr XML-LINK="SIMPLE" HREF="http://www.w3.org/">The
W3C</refr>
• On the next slide is a richer example, specifying a two-ended
link out-of-line with two explicit locators
Generic XML Programs
Ohio Jan 2003
54
More complex link example
<connect XML-LINK='EXTENDED'>
<dutch XML-LINK='LOCATOR'
HREF='http://www.klm.nl/About/Nederlands/default.h
tm'>
<english XML-LINK='LOCATOR'
HREF='http://www.klm.nl/About/default.htm'>
This is a good example of hand-crafted home-page
translation pairing.
</connect>
Generic XML Programs
Ohio Jan 2003
55
Using XML/XSL
• Ian Hughson and Henry Thompson built a prototype MT
system using XSLT as the transformation language.
• MUC systems often benefit from being able to use queries to
plug together subtly different processing pipelines for different
parts of the document.
• Annotation pipelines can still benefit from this kind of process
description.
• Links can be richer than in Web browsers, allowing structures
more complex than trees.
Generic XML Programs
Ohio Jan 2003
56
Conclusions
• The LT XML strategy is to write tools which are parameterisable
by user queries. These are similar in spirit to Unix sed and
grep
• When transformations get complex, this gets unwieldy, but
essentially the same idea is present in XSLT, which packages up
collections of queries into programs.
• Long term, we might want something cleaner than XSLT.
Generic XML Programs
Ohio Jan 2003
57
In Summary
• Phew!
</xmlstuff>
Generic XML Programs
Ohio Jan 2003
58
TreeStyle (Brew, 1999)
•
•
•
•
Corpus Access
Tree analysis
Node styling
Visualization
- after Cutting et al.
- analysis policy
- style sheet
- non-portable
Generic function dramatises trade off between demands
of data and capabilities of medium.
(DISPLAY-OBJECT
<tree-node> <visual>)
Generic XML Programs
Ohio Jan 2003
59
Corpus access
• Interface: (item
<corpus> <tree-index>)
• Just like Common Lisp elt. Also map-corpus and similar.
• Implementation
• In-memory-corpus simple adapter for Common Lisp sequence
types.
• wsj-corpus simple disk-based indices to the files of the Wall Street
Journal part of Penn Treebank. The indexing is crucial to usability.
Generic XML Programs
Ohio Jan 2003
60
Analysis
TreeStyle transforms corpus trees into
CLOS objects which make it easy to
traverse the parent/child hierarchy. Also
give access to node labels.
This makes it easy to state simple
heuristics over local trees, allowing
encoding of (e.g.) Collins’ heuristics for
finding heads and complements in the
Penn Treebank.
This is labelled data, the information is
present in the treebank from the outset.
We just make it vivid.
1 ("s"
2 ("np"
3
"John")
4 ("vp"
5
("v"
6
"detests")
7 ("np"
8
"anchovies")))
1 -> (:head (:headword "detests"))
2 -> (:complement (:headword "John"))
3 -> (:terminal)
4 -> (:head (:headword "detests"))
5 -> (:head (:headword "detests"))
6 -> (:terminal)
7 -> (:complement (:headword "anchovies"))
8 -> (:terminal)
Information is added to an eq hash table
indexed by tree nodes.
Generic XML Programs
Ohio Jan 2003
61
Styling
Style policy is stated in the same
framework as analysis policy.
One defines an analysis policy by
specializing the generic function
policy-for-node to a given corpus
type.
One defines a style policy by specializing
the generic function style-for-node
to a given corpus type.
Analysis and style are non-destructive,
optional and separate from visualization
proper. The analysis policy and the style
policy together play a role similar to that
of the stylesheet in XSL or CSS.
Generic XML Programs
1 ("s"
2 ("np"
3
"John")
4 ("vp"
5
("v"
6
"detests")
7 ("np"
8
"anchovies")))
1 -> (:red :head (:headword "detests"))
2 -> (:purple :complement (:headword "John"))
3 -> (:italic :grey :terminal)
4 -> (:red :head (:headword "detests"))
5 -> (:red :head (:headword "detests"))
6 -> (:italic :grey :terminal)
7 -> (:purple :complement (:headword
"anchovies"))
8 -> (:italic :grey :terminal)
Ohio Jan 2003
62
Case study
Apply Collins’ heuristics
to the Penn Treebank.
Colour really helps bring
out the information
encoded by Penn’s
annotators.
•Red heads
•Purple complements
•Blue modifiers
Co-ordination and
gapping are the tricky
problems, as always.
Generic XML Programs
Ohio Jan 2003
63
Case study
Collins gives a generative
statistical model for Penn
trees. His parser uses the
treebank as grammar,
and performs very well.
We’d like the same thing
in a categorial framework.
What is involved?
Thought: effectively
ignore modifiers (in
categorial terms, make
them X/X or X\X). How
will this pan out?
Generic XML Programs
Ohio Jan 2003
64
Case study
Not categorial enough,
needs to be binary
branching.
Left-factor, as in
Charniak. Goldwater and
Johnson (leaves
probabilities unchanged).
Is this reasonable? Not
always. Don’t like “But
funds” as a constituent.
Uncategorial.
Want to try “head
factoring”. Work in
progress.
Generic XML Programs
Ohio Jan 2003
65
The British National Corpus
• 2 gigabytes of contemporary English
• Marked up to word level with part of speech tags
• Extract data:
• zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]"
• gives all singular nouns in a part of the corpus, e.g.
<W TYPE=NN1>part </W>
<W TYPE=NN1>meeting </W>
<W TYPE=NN1>while </W>
<W TYPE=NN1>funeral</W>
<W TYPE=NN1>loss</W>
<W TYPE=NN1>meeting</W>
<W TYPE=NN1>time </W>
Generic XML Programs
Ohio Jan 2003
66
The BNC: an example (2)
zcat medium.xml.gz | \
sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \
-t "^[Rr]ight$"
gives sentences containing non-adjectival uses of the word 'right', e.g.
<S N=092>
<W TYPE=ITJ>Yes </W>
<W TYPE=DT0>that </W>
<W TYPE=VBD>was</W>
<C TYPE=PUN>, </C>
<W TYPE=DT0>that </W>
<W TYPE=VBD>was </W>
<W TYPE=AV0>right</W>
. . .
</S>
Generic XML Programs
Ohio Jan 2003
67
The BNC: an example (3)
Format the output into a more readable form:
zcat medium.xml.gz
| \
sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\
sgmltrans -r test.rule
Yes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0
erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ
much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD
n't/XX0 there/EX0 ?
He/PNP
he/PNP
me/PNP
at/PRP
he/PNP
goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC
says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB
see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ
the/AT0 menu/NN1 and/CJC said/VVD right/AV0 ,
said/VVD .
Generic XML Programs
Ohio Jan 2003
68
An extended example: Noun Compounds
• Noun compounds in British National Corpus
• What is a noun compound?
• Too hard.
• Simple approximation? Sequence of tags matching NN. . .
• BNC uses a version of the Brown tags, where NN0, NN1, . . . are all
variants of Noun
• A pipeline of SGML-aware tools will do the job
• sgrpg | sggrep [ | . . .]
• Use sgrpg to wrap such tag sequences in <G> ... </G>.
• Use sggrep to filter the output.
• Use further tools to tabulate, format, etc.
Generic XML Programs
Ohio Jan 2003
69
An extended example: The pipe
• Step by step through the pipe
• sgrpg -r -f np-pat.xml | ...
• Group the sequences
• -r use regexp matching
• -f script file
• ... sggrep -d groups.xml -q '.*/G'
• extract the sequences
• -d DTD
• -q query (selects groups)
• Result:
• <G><W TYPE='AJ0-NN1'>Local</W>
<W TYPE='NN0'>government</W>
<W TYPE='NN2'>districts</W></G>
...
Generic XML Programs
Ohio Jan 2003
70
An extended example: filtering
• Find all words with unresolved tags, e.g. AJ0-NN1
• use regexp matching, which is unanchored by default
• ...| sggrep -r -q './W[TYPE="-"]' | ...
• Find all words in second position
•
...| sggrep -q './W[1]' | ...
• Find all words with unresolved tags in second position
•
...| sggrep -r -q './W[1 TYPE="-"]' | ...
Generic XML Programs
Ohio Jan 2003
71
An extended example: counting
• Count all words in second position
•
...| sggrep -q './W[1]' | sgcount
• Count all words with unresolved tags in second position
•
...| sggrep -r -q './W[1 TYPE="-"]' | sgcount
• Results:
• all 2nd place W
23283
• 2nd place W with unresolved tag
Generic XML Programs
Ohio Jan 2003
5066
72
An extended example: long compounds
• Long compounds including 'government'
•
•
•
•
•
•
Use subquery to select <G>...</G>s with 'government':
sggrep -q G -s './W' -t government
Next step, discard short ones:
sggrep -q G -s './W[2]'
Then sgmltrans for neater format
Results:
• official/AJ0-NN1 government/NN0 report/NN1-VB
Local/AJ0-NN1 government/NN0 districts/NN2
...
Generic XML Programs
Ohio Jan 2003
73
An extended example: left context
• select for 'government' in 2nd place
• . . . | sggrep -q G -s './W[1]' -t government |
• pull words from first place
• sggrep -q './W[0]' |
• remove markup
• textonly |
• use UN*X for the rest
•
•
•
•
•
sort | uniq -c | sort -nr | head -4
6 French
5 German
4 interim
4 Chinese
Generic XML Programs
Ohio Jan 2003
74
British International Corpus?
• We are more francophone than we think!
• Longest 'noun-phrase' in 10% of BNC is:
• serai/NN1 mentionn&eacute;/NN1 dans/NN2 le/NN1
rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1
• No disgrace that the part-of-speech tagger gave up here.
• Tools can't be better than their input allows
Generic XML Programs
Ohio Jan 2003
75
Human factors in annotation
• Writing instructions
• Start by writing a DTD.
• Define an explicit step-by-step process for making decisions.
• You’ll still need a body of case-law to ensure reliability. So you’ll need
to revise as you go.
• If you have a group of five people, use a star model in which each of
four communicate with a central coordinator, revising the draft
instructions as you go.
• Measuring reliability
• Use standard measures of inter-rater reliability from social psychology,
including kappa statistic
Generic XML Programs
Ohio Jan 2003
76
Can we predict queries?
• Corpora don’t change much.
• Easy searches can be handled with flat indices into tagged data.
• Complicated searches are rare enough that it might be OK to do them by
linear search of the corpora
• Queries do change
• Things like the historian’s application in Welty and Ide warrant expressive
search, but won’t cause a revolution.
• A single paper correlating gaze, gesture and audio might spawn many
imitators. Expectations from corpus tools might shift radically.
Generic XML Programs
Ohio Jan 2003
77
Tutorials
• XML: far too many to mention
• The XML revolution: technologies for the future Web
• http://www.brics.dk/~amoeller/XML
• XSL:
• XSL specification
• http://www.w3.org/Style/XSL
• Robin Cover's guide
• http://www.oasis-open.org/cover/xsl.html
Generic XML Programs
Ohio Jan 2003
78
Resources
• LT-XML
• http://www.ltg.ed.ac.uk/software/xml/index.h
tml
• Full-text search
• Witten, Moffat and Bell's Managing Gigabytes
• http://www.cs.mu.OZ.AU/mg/
Generic XML Programs
Ohio Jan 2003
79
Corpus Tools
• Stuttgart Corpus Workbench
• http://www.ims.unistuttgart.de/projekte/CorpusWorkbench
• Birmingham Qwick}
http://www-clg.bham.ac.uk/QWICK/
The MATE Workbench
http://www.cogsci.ed.ac.uk/~dmck/MateCode}.
• NB. Prototype
Generic XML Programs
Ohio Jan 2003
80
The Linguistic Data Consortium
• LDC - based in Pennsylvania USA
• Distributes text corpora
• See: http://www.ldc.upenn.edu/
• SGML Corpora include:
• The European Language Newspaper Text corpus
• French (100 million words), German (90 million words) and
Portuguese (15 million words). SGML.
• TIPSTER Information Retrieval Text Research Collection
• 3 gigabytes. SGML-like. Various English texts.
• United Nations Parallel Text Corpus (English, French, Spanish)
• Fully-compliant SGML, 2.5 gigabytes
Generic XML Programs
Ohio Jan 2003
81
Bibliography
•
•
•
•
•
McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural
Language Processing, Computers and the Humanities, 31(5): 367-388, 1997
Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for
Corpus Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998
References on McKelvie's MATE workbench page
http://www.cogsci.ed.ac.uk/~dmck/MateCode
Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents.
Computers and the Humanities. 33(10):59-84. 1999
Alignment graphs (and much else) Steven Bird's Linguistic Annotation Page
http://www.ldc.upenn.edu/annotation/.
Generic XML Programs
Ohio Jan 2003
82
“Problems” with XML
• Uses complex and weird terminology
• Yes. But so does the ANSI C standard. So do most fields…
• Not convenient for specifying graphs (as opposed to trees)
• This is a point about graphs, not XML. Unification grammar notations
get unwieldy too.
• Not as convenient as plain text
• True for some tasks, but the extra structure of XML lets do things that
you wouldn’t even try with plain text.
Generic XML Programs
Ohio Jan 2003
83
XML Conclusions
• XML is the wave of the future
• Both Microsoft and Netscape have endorsed it
• Both Mozillla and IE5 have XML support built-in
• Very good free software is available
• Microsoft seem to be serious about standard compliance
• The W3C have made it clear that all subsequent W3C standards
for web distribution of information will be based on XML (c.f.
SMIL, SVG and RDF)
• Issues
• XSLT efficiency - space and time.
Generic XML Programs
Ohio Jan 2003
84
Descargar

No Slide Title