Advanced Course: Treebank-Based Acquisition
of LFG, HPSG and CCG Resources
Josef van Genabith, Dublin City University
Yusuke Miyao, University of Tokyo
Julia Hockenmaier, University of Pennsylvania
and University of Edinburgh
ESSLLI 2006
18th European Summer School for Language, Logic and
Information, University of Malaga, July – August 2006
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
1
Lecturer Contact Information
•
Josef van Genabith, National Centre for Language
Technology NCLT, School of Computing, Dublin City
University, Dublin 9, Ireland, [email protected]
•
Julia Hockenmaier, [email protected]
•
Yusuke Miyao, Department of Computer Science, The
University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 1130033, JAPAN, [email protected]
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
2
Motivation
•
•
•
•
•
•
•
•
What do grammars do?
–
Grammars define languages as sets of strings
–
Grammars define what strings are grammatical
and what strings are not
–
Grammars tell us about the syntactic structure of (associated with)
strings
“Shallow” vs. “Deep” grammars
Shallow grammars do all of the above
Deep grammars (in addition) relate text to information/meaning
representation
Information: predicate-argument-adjunct structure, deep dependency
relations, logical forms, …
In natural languages, linguistic material is not always interpreted
locally where you encounter it: long-distance dependencies (LDDs)
Resolution of LDDs crucial to construct accurate and complete
information/meaning representations.
Deep grammars := (text <-> meaning) + (LDD resolution)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
3
Motivation
•
•
•
•
•
•
Unification (Constraint-Based) Grammar Formalisms
(FU, GPSG, PATR-II, …)
–
Lexical-Functional Grammar (LFG)
–
Head-Driven Phrase Structure Grammar (HPSG)
–
Combinatory Categorial Grammar (CCG)
–
Tree-Adjoining Grammar (TAG)
Traditionally, deep constraint-based grammars are hand-crafted
LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey Tools,
RASP, ALPINO, …
Wide-coverage, deep unification (constraint-based) grammar
development is knowledge extensive and expensive!
Very hard to scale hand-crafted grammars to unrestricted text!
English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer
2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and
Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
4
Motivation
•
Instance of “knowledge acquisition bottleneck” familiar from classical
“rationalist” rule/knowledge-based AI/NLP
•
•
Alternative to classical “rationalist” rule/knowledge-based AI/NLP
“Empiricist” research paradigm (AI/NLP):
–
–
–
–
•
•
Corpora, treebanks, …, machine-learning-based and statistical
approaches, …
Treebank-based grammar acquisition, probabilistic parsing
Advantage: grammars can be induced (learned) automatically
Very low development cost, wide-coverage, robust, but …
Most treebank-based grammar induction/parsing technology produces
“shallow” grammars
Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …),
do not map strings to information/meaning representations …
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
5
Motivation
•
•
Poses a research question:
Can we address the knowledge acquisition bottleneck for deep
grammar development by combining insights from rationalist and
empiricist research paradigms?
•
•
Specifically:
Can we automatically acquire wide-coverage “deep”, probabilistic,
constraint-based grammars from treebanks?
How do we use them in parsing?
Can we use them for generation?
Can we acquire resources for different languages and treebank
encodings?
How do these resources compare with hand-crafted resources?
…
•
•
•
•
•
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
6
Course Overview
Monday:
Motivation, Course Overview, Introductions to TAG, LFG,
CCG, HPSG and Penn-II TreeBank, TAG Resources
Tuesday:
Penn-II-Based Acquisition of LFG Resources
Wednesday:
Penn-II-Based Acquisition of CCG Resources
Thursday:
Penn-II-Based Acquisition of HPSG Resources
Friday:
Multilingual Resources, Formal Semantics, Comparing LFG,
CCG, HPSG and TAG-Based Approaches, Demos, Current
and Future Work, Discussion
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
7
Course Overview
Tuesday/Wednesday/Thursday
Penn-II-Based Acquisition of XXG Resources:
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Sematics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
8
Grammar Formalisms
Grammar Formalisms
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
9
Grammar formalisms and linguistic theories
• Linguistics aims to explain natural language:
– What is universal grammar?
– What are language-specific constraints?
• Formalisms are mathematical theories:
– They provide a language in which linguistic theories can
be expressed (like calculus for physics)
– They define elementary objects (trees, strings, feature
structures) and recursive operations which generate
complex objects from simple objects.
– They do impose linguistic constraints (e.g. on the kinds
of dependencies they can capture)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
10
Lexicalised Grammar
Formalisms:
TAG, CCG, LFG and HPSG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
11
Lexicalised formalisms (TAG, CCG, LFG and HPSG)
• The lexicon:
– pairs words with elementary objects
– specifies all language-specific information
(number and location of arguments,
control and binding theory)
• The grammatical operations:
– are universal
– define (and impose constraints on) recursion
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
12
TAG, CCG, LFG and HPSG
• They describe different kinds of linguistic objects:
– TAG is a theory of trees
– CCG is a theory of (syntactic and semantic) types
– LFG is a multi-level theory based on a projection
architecture relating different types of linguistic objects
(trees, AVMs, linear logic–based semantics)
– HPSG uses single, uniform formalism (typed feature
structures) to describe phonological, morphological,
syntactic and semantic representations (signs)
• They differ in details:
– treatment of wh-movement, coordination, etc.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
13
TAG, CCG, LFG and HPSG
• TAG and CCG are weakly equivalent.
• Both are mildly context-sensitive:
– can capture Dutch crossing dependencies
– but are still efficiently parseable (in polynomial time)
• LFG context-sensitive
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
14
Tree-Adjoining Grammar (TAG)
Tree-Adjoining Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
15
(Lexicalized) Tree-Adjoining Grammar
•
TAG is a tree-rewriting formalism:
– TAG defines operations (substitution and adjunction) on trees.
– The elementary objects in TAG are trees (not strings)
•
TAG is lexicalized:
– Each elementary tree is anchored to a lexical item (word)
– “Extended domain of locality”:
The elementary tree contains all arguments of the anchor.
– TAG requires a linguistic theory which specifies the shape
of these elementary trees.
•
TAG is mildly context-sensitive:
– can capture Dutch crossing dependencies
– but is still efficiently parseable
AK Joshi and Y Schabes (1996) Tree Adjoining Grammars.
In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
16
TAG substitution (arguments)
Derived tree:
a1:
X
a2:
X
Y
a3: Y
Substitute
X
Y
Derivation tree:
a1
a2
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
a3
17
TAG adjunction (modifiers)
Auxiliary
tree
a1:
b1:X
Derived tree:
Foot node
X
X*
X
ADJOIN
Derivation tree:
X*
a1
b1
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
18
A small TAG lexicon
b1:
a2:
VP
NP
a1:
S
John
NP
a3:
NP
RB
VP*
always
VP
VBZ
NP
eats
tapas
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
19
A TAG derivation
a1:
S
a1
a2
a3
NP
NP
VP
NP
NP
VBZ
eats
a2:
NP
NP
John
ESSLLI 2006
b1:
VP
RB
a3:
VP*
always
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
NP
NP
tapas
20
A TAG derivation
a1
a2 b1 a3
S
VP
VP
NP
John VBZ
NP
eats
tapas
b1
VP
VP
RB
VP*
always
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
21
A TAG derivation
S
NP
VP
John RB
VP*
VP
always VBZ
eats
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
NP
tapas
22
Combinatory Categorial Grammar (CCG)
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
23
Combinatory Categorial Grammar
• CCG is a lexicalized grammar formalism
(the “rules” of the grammar are completely general,
all language-specific information is given in the lexicon)
• CCG is nearly context-free
(can capture Dutch crossing dependencies, but is still efficiently parseable)
• CCG has a flexible constituent structure
• CCG has a simple, unified treatment of
extraction and coordination
• CCG has a transparent syntax-semantics interface
(every syntactic category and operation has a semantic counterpart)
• CCG rules are monotonic
(movement or traces don’t exist)
• CCG rules are type-driven, not structure-driven
(this means e.g. that intransitive verbs and VPs are indistinguishable)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
24
CCG: the machinery
• Categories:
specify subcat lists of words/constituents.
• Combinatory rules:
specify how constituents can combine.
• The lexicon:
specifies which categories a word can have.
• Derivations:
spell out process of combining constituents.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
25
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a
result when combined with an argument:
VP or intransitive verb:
Transitive verb:
Adverb:
PPs:
S\NP
(S\NP)/NP
(S\NP)\(S\NP)
((S\NP)\(S\NP))/NP
(NP\NP)/NP
• Every category has a semantic interpretation
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
26
Function application
• Combines a function with its argument
to yield a result:
(S\NP)/NP
eats
NP
tapas
->
NP
John
S\NP
->
eats tapas
S\NP
eats tapas
S
John eats tapas
• Used in all variants of categorial grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
27
A (C)CG derivation
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
28
Type-raising and function composition
• Type-raising: turns an argument into a function.
Corresponds to case:
NP
NP
->
->
S/(S\NP)
(S\NP)/((S\NP)/NP)
(nominative)
(accusative)
• Function composition:
composes two functions (complex categories)
(S\NP)/PP
PP/NP
->
(S\NP)/NP
S/(S\NP) (S\NP)/NP -> S/NP
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
29
Type-raising and Composition
• Wh-movement:
• Right-node raising:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
30
Another CCG derivation
• We will only be concerned with canonical
“normal-form” derivations, which only use
function composition and type-raising when
syntactically necessary.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
31
CCG: semantics
• Every syntactic category and rule has a
semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
32
The CCG lexicon
• Pairs words with their syntactic categories
(and semantic interpretation):
eats
(S\NP)/NP
S\NP
xy.eats’xy
x.eats’x
• The main bottleneck for wide-coverage
CCG parsing
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
33
Why use CCG for statistical parsing?
• CCG derivations are binary trees: we can use
standard chart parsing techniques.
• CCG derivations represent long-range dependencies
and complement-adjunct distinctions directly:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
34
A comparison with Penn Treebank parsers
• Standard Treebank parsers do not recover the null
elements and function tags that are necessary for
semantic interpretation:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
35
Lexical-Functional Grammar (LFG)
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
36
Lexical-Functional Grammar LFG
Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001,
Dalrymple 2001) is a unification- (or constraint-) based theory of
grammar.
Two (basic) levels of representation:
• C-structure: represents surface grammatical configurations such as
word order, annotated CFG data structures
• F-structure: represents abstract syntactic functions such as SUBJ(ject),
OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM
attribute-value matrices/structures
F-structure approximates to basic predicate-argument structure,
dependency representation, logical form (van Genabith and Crouch,
1996; 1997)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
37
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
38
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
39
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
40
LFG Grammar Rules and Lexical Entries
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
41
LFG Parse Tree (with Equations/Constraints)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
42
LFG Constraint Resolution (1/3)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
43
LFG Constraint Resolution (2/3)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
44
LFG Constraint Resolution (3/3)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
45
LFG Subcategorisation & Long Distance Dependencies
• Subcategorisation:
– Semantic forms (subcat frames): sign< SUBJ, OBJ>
– Completeness: all GFs in semantic form present at local f-structure
– Coherence: only the GFs in semantic form present at local f-structure
• Long Distance Dependencies (LDDs): resolved at f-structure with
Functional Uncertainty Equations (regular expressions specifying paths
in f-structure).
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
46
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
47
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
48
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
49
Head-Driven Phrase Structure Grammar (HPSG)
Head-Driven Phrase Structure
Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
50
Head-Driven Phrase Structure Grammar HPSG
• HPSG (Pollard and Sag 1994, Sag et al. 2003) is a
unification-/constraint-based theory of grammar
• HPSG is a lexicalized grammar formalism
• HPSG aims to explain generic regularities that underlie
phrase structures, lexicons, and semantics, as well as
language-specific/-independent constraints
• Syntactic/semantic constraints are uniformly denoted by
signs, which are represented with feature structures
• Two components of HPSG
– Lexical entries represent word-specific constraints (corresponding
to elementary objects)
– Principles express generic grammatical regularities (corresponding
to grammatical operations)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
51
Sign
• Sign is a formal representation of combinations of
phonological forms, syntactic and semantic constraints
phonological form
syntactic/semantic
constraints
sign
PHON string
synsem
local constraints
local
category
HEAD head
MOD synsem
CAT
valence
SYNSEM LOCAL
SPR list
VAL SUBJ list
COMPS list
CONT content
nonlocal
QUE list
NONLOCAL REL list
SLASH list
DTRS dtrs
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
syntactic category
syntactic head
modifying constraints
subcategorization frames
semantic representations
non-local dependencies
daughter structures
52
Lexical entries
• Lexical entries express word-specific constraints
We use simplified
notations in this lecture
PHON “loves”
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
53
Principles
• Principles describe generic regularities of grammar
– Not corresponding to construction rules
• Head Feature Principle
– The value of HEAD must be percolated from the head daughter
HEAD
1
HEAD
1
head daughter
• Valence Principle
– Subcats not consumed are percolated to the mother
• Immediate Dominance (ID) Principle
– A mother and her immediate daughters must satisfy one of ID
schemas
• Many other principles: percolation of NONLOCAL features,
semantics construction, etc.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
54
ID schemas
• ID schemas correspond to construction rules in
CFGs and other grammar formalisms
– For subject-head constructions (ex. “John runs”)
SUBJ <>
1
SUBJ < 1 >
– For head-complement constructions (ex. “loves Mary”)
COMPS
2
COMPS < 1 | 2 >
1
– For filler-head constructions (ex. “what he bought”)
SLASH
ESSLLI 2006
2
1
SLASH < 1 | 2 >
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
55
Example: HPSG parsing
• Lexical entries determine syntactic/semantic
constraints of words
Lexical entries
HEAD noun
SUBJ <>
COMPS <>
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
HEAD noun
SUBJ <>
COMPS <>
John
saw
Mary
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
56
Example: HPSG parsing
• Principles determine generic constraints of
grammar
HEAD 1
SUBJ 2
COMPS 4
HEAD 1
SUBJ 2
COMPS < 3 | 4 >
3
HEAD noun
SUBJ <>
COMPS <>
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
HEAD noun
SUBJ <>
COMPS <>
John
saw
Mary
Unification
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
57
Example: HPSG parsing
• Principle application produces phrasal signs
HEAD verb
SUBJ <HEAD noun>
COMPS <>
HEAD noun
SUBJ <>
COMPS <>
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
HEAD noun
SUBJ <>
COMPS <>
John
saw
Mary
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
58
Example: HPSG parsing
• Recursive applications of principles produce
syntactic/semantic structures of sentences
HEAD verb
SUBJ <>
COMPS <>
HEAD verb
SUBJ <HEAD noun>
COMPS <>
HEAD noun
SUBJ <>
COMPS <>
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
HEAD noun
SUBJ <>
COMPS <>
John
saw
Mary
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
59
Example: LDDs
• NONLOCAL
features
(SLASH, REL,
etc.) explain
long-distance
dependencies
HEAD noun
SUBJ < >
COMPS < >
SPR < >
1
– WH movements
– Topicalization
– Relative clauses
etc...
HEAD noun
SUBJ < >
COMPS < >
SPR < 1 >
HEAD det
SUBJ < >
COMPS < >
the
HEAD noun
<>
2 SUBJ
COMPS < >
SPR < 1 >
prices
3
HEAD verb
SUBJ < >
COMPS < >
REL < 2 >
HEAD verb
SUBJ < >
COMPS < >
SLASH < 2 >
HEAD noun
SUBJ < >
COMPS < >
we
HEAD verb
SUBJ < 3 >
COMPS < >
SLASH < 2 >
HEAD verb
HEAD verb
SUBJ < 3 >
4 SUBJ < 3 >
COMPS < >
COMPS < 4 >
SLASH < 2 >
were
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
charged
60
Brief Intro to Penn Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
61
The Penn Treebank
• The first large syntactically annotated corpus
• Contains text from different domains:
–
–
–
–
Wall Street Journal (50,000 sentences, 1 Million words)
Switchboard
Brown corpus
ATIS
• The annotation:
– POS-tagged (Ratnaparkhi’s MXPOST)
– Manually annotated with phrase-structure trees
– Traces and other null elements used to represent non-local
dependencies (movement, PRO, etc.)
– Designed to facilitate extraction of predicate-argument structure
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
62
A Treebank tree
• Relatively flat structures:
– There is no noun level
– VP arguments and adjuncts appear at the same level
• Co-indexed null elements indicate long-range dependencies
• Function tags indicate complement-adjunct distinction (?)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
63
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
64
Penn-II Treebank
•
Until Congress acts , the government hasn't any authority to issue new debt
obligations of any kind , the Treasury said .
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
65
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
66
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
67
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
68
Penn-II Treebank
A D JP
A D JP -A D V
A D JP -C LR
A D JP -H LN
A D JP -L O C
A D JP -M N R
A D JP -P R D
A D JP -S B J
A D JP -T M P
A D JP -T P C
A D JP -T T L
ADVP
A D V P -C LR
A D V P -D IR
A D V P -E X T
A D V P -H L N
A D V P -L O C
A D V P -M N R
A D V P -P R D
A D V P -P R P
A D V P -P U T
A D V P -T M P
A D V P -T P C
A D V P |P R T
C O N JP
FR A G
F R A G -A D V
F R A G -H L N
F R A G -P R D
F R A G -T P C
F R A G -T T L
ESSLLI 2006
IN T J
IN T J-C LR
IN T J-H LN
LST
NAC
N A C -L O C
N A C -T M P
N A C -T T L
NP
N P -A D V
N P -B N F
N P -C LR
N P -D IR
N P -E X T
N P -H L N
N P -LG S
N P -LO C
N P -M N R
N P -P R D
N P -S B J
N P -T M P
N P -T P C
N P -T T L
N P -V O C
NX
N X -T T L
PP
P P -B N F
P P -C LR
P P -D IR
P P -D T V
P P -E X T
P P -H LN
P P -LG S
P P -LO C
P P -M N R
P P -N O M
P P -P R D
P P -P R P
P P -P U T
P P -S B J
P P -T M P
P P -T P C
P P -T T L
PRN
PRT
P R T |A D V P
QP
RRC
S
S -A D V
S -C LF
S -C LR
S -H LN
S -LO C
S -M N R
S -N O M
S -P R D
S -P R P
S -S B J
S -T M P
S -T P C
S -T T L
SBAR
S B A R -A D V
S B A R -C LR
S B A R -D IR
S B A R -H L N
S B A R -L O C
S B A R -M N R
S B A R -N O M
S B A R -P R D
S B A R -P R P
S B A R -P U T
S B A R -S B J
S B A R -T M P
S B A R -T P C
S B A R -T T L
SBARQ
S B A R Q -H L N
S B A R Q -N O M
S B A R Q -P R D
S B A R Q -T P C
S B A R Q -T T L
S IN V
S IN V -A D V
S IN V -H L N
S IN V -T P C
S IN V -T T L
SQ
S Q -P R D
S Q -T P C
S Q -T T L
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
UCP
U C P -A D V
U C P -C LR
U C P -D IR
U C P -E X T
U C P -H LN
U C P -LO C
U C P -M N R
U C P -P R D
U C P -P R P
U C P -T M P
U C P -T P C
VP
V P -T P C
V P -T T L
W H A D JP
W HADVP
W H A D V P -T M P
W HNP
W HPP
X
X -A D V
X -C L F
X -D IR
X -E X T
X -H L N
X -P U T
X -T M P
X -T T L
X -T T L
69
Penn-II Treebank (Simple Transitive Verb)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
70
Penn-II Treebank (Simple Coordination)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
71
Penn-II Treebank (Passive)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
72
Penn-II Treebank (Subject WH-Relative Clause)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
73
Penn-II Treebank (WH-Less Complement Relative Cl.)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
74
Penn-II Treebank (Control and WH-Compl. Rel. Cl.)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
75
Penn-II Treebank (Adv. Relative Clause)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
76
Penn-II Treebank (Coord. and Right Node Raising)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
77
The Parseval measure
• Standard evaluation metric for Treebank parsers.
Two components:
– Precision: how many of the proposed NTs are correct?
– Recall: how many of the correct NTs are proposed?
• Measures recovery of nonterminals
(span + syntactic category)
• Ignores function tags and null elements
Has biased research towards parsers that produce
linguistically shallow output (Collins, Charniak)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
78
Treebank-Based Acquisition
of TAG resources
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
79
Extracting a TAG from the Treebank
• Two different approaches:
– F. Xia. Automatic Grammar Generation From Two
Different Perspectives. PhD thesis, University of
Pennsylvania, 2001.
– J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated
Extraction of Tree-Adjoining Grammars from Treebanks,
Natural Language Engineering (forthcoming)
• This lecture: just the basic ideas!
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
80
Extracting a TAG from the Penn Treebank
• Input: a Treebank tree
(= the TAG derived tree)
•Output: a set of elementary trees
(= the TAG lexicon)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
81
Extracting a TAG: the head
- Identify the head path (requires a head percolation table)
- Find the arguments of the head (requires an argument table)
- Ignore modifiers (requires an adjunct table)
- Merge unary productions (VP -> VP)
S
NP-SBJ
V
P
V
P
VBG
NP
making
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
82
Extracting a TAG: the head
• This is the elementary tree for the head:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
83
Extracting a TAG: arguments
• Arguments are combined via substitution
• Recurse on the arguments:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
84
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees
(use adjunction to be combined with the head)
• Auxiliary trees require a foot node
(with the same label as the root)
VP
VBZ
is
VP
ADVP-MNR
officially
NP
DT
the
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
85
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees
(use adjunction to be combined with the head)
• Auxiliary trees require a foot node
(with the same label as the root)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
86
Special cases
• Coordination
• Null elements (e.g. traces for wh-movement):
The trace has to be part of the elementary tree
of the main verb
• Punctuation marks
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
87
Wh-movement: relative clauses
(NP (NP a charge))
(SBAR (WHNP-2 (-NONE- 0))
(S (NP-SBJ Mr. Coleman))
(VP (VBZ denies)
(NP (-NONE- *T*-2)))))))
NP
NP
SBAR
S
WHNP
-NONE0
ESSLLI 2006
NP
VP
VBZ
NP
denies
-NONE*T*-2
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
88
Evaluating an extracted grammar/lexicon
• Grammar/lexicon size?
– Depends on head table, argument/adjunct distinction, treatment of
null elements, mapping of Treebank labels/POS tags to categories in
extracted grammar etc.
– For TAGs, between 3,000-8,500 elementary tree types,
and 100,000-130,000 lexical entries.
• Lexical coverage?
– For TAGs, around 92-93%
• Distribution of tree types?
• Convergence?
• Quality?
– Inspection, comparison with manual grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
89
References: TAG extraction
TAG:
A.K. Joshi and Y. Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa,
Eds., Handbook of Formal Languages
TAG extraction:
F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD thesis, University
of Pennsylvania, 2001.
J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars
from Treebanks, Natural Language Engineering (forthcoming)
Also: L. Shen and A.K. Joshi, Building an LTAG Treebank, Technical Report MS-CIS-05-15,
CIS Department, University of Pennsylvania, 2005
Parsing with extracted TAGs:
D. Chiang. Statistical parsing with an automatically extracted tree adjoining grammar. In Data
Oriented Parsing, CSLI Publications, pages 299–316.
L. Shen and A.K. Joshi. Incremental LTAG parsing, HLT/EMNLP 2005
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
90
Penn-II-Based Acquisition of LFG Resources
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
91
Penn-II-Based Acquisition of LFG Resources
• Introduction
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Semantics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
92
Introduction: Penn-II & LFG
• If we had f-structure annotated version of Penn-II, we could use
(standard) machine learning methods to extract probabilistic, widecoverage LFG resources
• How do we get f-structure annotated Penn-II?
• Manually? No: 50,000 trees …!
• Automatically! Yes: F-Structure annotation algorithm … !
• Penn-II is a 2nd generation treebank – contains lots of annotations to
support derivation of deep meaning representations: trees, Penn-II
“functional” tags, traces & coindexation – f-structure annotation
algorithm can exploit those.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
93
Introduction: Penn-II & LFG
• What is the task?
• Given a Penn-II tree, the f-structure annotation algorithm has to
traverse the tree and associate all tree nodes with f-structure equations
(including lexical equations at the leaves of the tree).
• A simple example
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
94
Introduction: Penn-II & LFG
Factory payrolls fell in September.
S
↑=↓
NP-SBJ
VP
↑=↓
↑subj=↓
NN
↓↑adjunct
Factory
NNS
↑=↓
payrolls
VBD
↑=↓
fell
PP-TMP
↓↑adjunct
IN
↑=↓
NP
↑obj=↓
in
NNP
↑=↓
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
September
95
Introduction: Penn-II & LFG
subj : pred : payroll
num : pl
pers : 3
adjunct : 2 : pred : factory
num : sg
pers : 3
adjunct : 1 : pred : in
obj : pred : september
num : sg
pers : 3
pred : fall
tense : past
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
96
Treebank Preprocessing/Clean-Up: Penn-II & LFG
•
Penn-II treebank: often flat analyses (coordination, NPs …), a
certain amount of noise: inconsistent annotations, errors …
•
No treebank preprocessing or clean-up in the LFG approach
•
Take Penn-II treebank as is, but
•
Remove all trees with FRAG or X labelled constituents
•
Frag = fragments, X = not known how to annotate
•
Total of 48,424 trees as they are.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
97
Treebank Annotation: Penn-II & LFG
•
•
Annotation-based (rather than conversion-based)
Automatic annotation of nodes in Penn-II treebank tress with fstructure equations
F-structure Annotation Algorithm
Annotation Algorithm exploits:
•
•
–
–
–
–
–
ESSLLI 2006
Head information
Categorial information
Configurational information
Penn-II functional tags
Trace information
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
98
Treebank Annotation: Penn-II & LFG
• Architecture of a modular algorithm to assign LFG f-structure equations
to trees in the Penn-II treebank:
Head-Lexicalisation [Magerman,1994]
Left-Right Context Annotation Principles
Proto
F-Structures
Coordination Annotation Principles
Proper
F-Structures
Catch-All and Clean-Up
Traces
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
99
Treebank Annotation: Penn-II & LFG
•
Head Lexicalisation: modified rules based on (Magerman, 1994)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
100
Treebank Annotation: Penn-II & LFG
Left-Right Context Annotation Principles:
Left
Context
•
•
Head
Right
Context
Head of NP likely to be rightmost noun …
Mother → Left Context Head Right Context
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
101
Treebank Annotation: Penn-II & LFG
Left-Right Annotation Matrix
NP:
Left Context
Head
DT: ↑spec:det=↓
QP: ↑spec:quant=↓
JJ, ADJP: ↓↑adjunct
NN, NNS:
↑=↓
Right Context
NP: ↓↑app
PP: ↓↑adjunct
S, SBAR: ↓↑relmod
NP
DT
NP
ADJP
a RB
very
NN
JJ
deal
politicized
→
↑spec:det=↓
↓↑adjunct
↑=↓
DT
ADJP
NN
a
RB
very
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
JJ
deal
politicized
102
Treebank Annotation: Penn-II & LFG
A D JP
A D JP -A D V
A D JP -C L R
A D JP -H L N
A D JP -L O C
A D JP -M N R
A D JP -P R D
A D JP -S B J
A D JP -T M P
A D JP -T P C
A D JP -T T L
ADVP
A D V P -C L R
A D V P -D IR
A D V P -E X T
A D V P -H L N
A D V P -L O C
A D V P -M N R
A D V P -P R D
A D V P -P R P
A D V P -P U T
A D V P -T M P
A D V P -T P C
A D V P |P R T
C O N JP
FR A G
FR A G - A D V
FR A G -H L N
FR A G -P R D
FR A G -T P C
FR A G -T T L
ESSLLI 2006
IN T J
IN T J-C L R
IN T J-H L N
LST
NAC
N A C -L O C
N A C -T M P
N A C -T T L
NP
N P -A D V
N P -B N F
N P -C L R
N P -D IR
N P -E X T
N P -H L N
N P -L G S
N P -L O C
N P -M N R
N P -P R D
N P -S B J
N P -T M P
N P -T P C
N P -T T L
N P -V O C
NX
N X -T T L
PP
P P -B N F
P P -C L R
P P -D IR
P P -D T V
P P -E X T
P P -H L N
P P -L G S
P P -L O C
P P -M N R
P P -N O M
P P -P R D
P P -P R P
P P -P U T
P P -S B J
P P -T M P
P P -T P C
P P -T T L
PRN
PRT
P R T |A D V P
QP
RRC
S
S -A D V
S -C L F
S -C L R
S -H L N
S -L O C
S -M N R
S -N O M
S -P R D
S -P R P
S -S B J
S -T M P
S -T P C
S -T T L
SBAR
S B A R -A D V
S B A R -C L R
S B A R -D IR
S B A R -H L N
S B A R -L O C
S B A R -M N R
S B A R -N O M
S B A R -P R D
S B A R -P R P
S B A R -P U T
S B A R -S B J
S B A R -T M P
S B A R -T P C
S B A R -T T L
SBARQ
S B A R Q -H L N
S B A R Q -N O M
S B A R Q -P R D
S B A R Q -T P C
S B A R Q -T T L
S IN V
S IN V -A D V
S IN V -H L N
S IN V -T P C
S IN V -T T L
SQ
S Q -P R D
S Q -T P C
S Q -T T L
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
UCP
U C P -A D V
U C P -C L R
U C P -D IR
U C P -E X T
U C P -H L N
U C P -L O C
U C P -M N R
U C P -P R D
U C P -P R P
U C P -T M P
U C P -T P C
VP
V P -T P C
V P -T T L
W H A D JP
W H AD VP
W H A D V P -T M P
WHNP
W HPP
X
X -A D V
X -C L F
X -D IR
X -E X T
X -H L N
X -P U T
X -T M P
X -T T L
X -T T L
103
Treebank Annotation: Penn-II & LFG
• Do annotation matrix for each of the monadic categories
(without –Fun tags) in Penn-II
• Based on analysing the most frequent rule types for each category
such that
 sum total of token frequencies of these rule types is greater than
85% of total number of rule tokens for that category
100% 85%

NP 6595

S
2602
102
20
100% 85%
VP
ADVP
10239
307
234
6
• Apply annotation matrix to all (i.e. also unseen) rules/sub-trees, i.e. also
those NP-LOC, NP-TMP etc.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
104
Treebank Annotation: Penn-II & LFG
• Co-ordination Annotation Principles
• Often flat Penn-II analysis of coordination:
Co-ordinated Element
Object
Modifier
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
105
Treebank Annotation: Penn-II & LFG
• Unlike constituents coordination:
Co-ordinated Element
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
106
Treebank Annotation: Penn-II & LFG
Traces Module:
• Long Distance Dependencies
•
•
•
•
•
•
•
•
Topicalisation
Wh- and wh-less questions
Relative clauses
Passivisation
Control constructions
ICH (interpret constituent here)
RNR (right node raising)
…
• Translate Penn-II traces and coindexation into corresponding reentrancy
in f-structure
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
107
Treebank Annotation: WH-Relative Clauses
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
108
Treebank Annotation: Wh-Less Relative Clauses
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
109
Treebank Annotation: Control & Wh-Rel. LDD
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
110
Treebank Annotation: Adv. Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
111
Treebank Annotation: Right Node Raising
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
112
Treebank Annotation: Right Node Raising
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
113
Treebank Annotation: Penn-II & LFG
Catch-All and Clean-Up Module:
• Penn-II Functional Tags are used to identify potential errors
– e.g. Nodes with the tag -SBJ should be annotated as the subject …
• Correction of Overgeneralisations
– e.g. Change a second OBJ annotations to OBJ2 …
– e.g. Change arguments of head nouns erroneously annotated as relative
clauses to COMP arguments:
• … signs [that managers expect declines]_RELCL …
• … signs [that managers expect declines]_COMP …
• Unannotated Nodes
– Defaults …
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
114
Treebank Annotation: Penn-II & LFG
Head-Lexicalisation [Magerman,1995]
Left-Right Context Annotation Principles
Proto
F-Structures
Coordination Annotation Principles
Proper
F-Structures
Catch-All and Clean-Up
Traces
Constraint Solver
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
115
Treebank Annotation: Penn-II & LFG
• Collect f-structure equations
• Send to constraint solver
• Generates f-structures
• F-structure annotation algorithm implemented in Java, constraint solver
in Prolog
• ~3 min annotating approx. 50,000 Penn-II trees
• ~5 min producing approx. 50,000 f-structures
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
116
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
117
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
118
Treebank Annotation: Penn-II & LFG
Evaluation (Quantitative):
• Burke (2006)
• Coverage:
Over 99.8% of Penn-II sentences (without X and FRAG constituents) receive
a single covering and connected f-structure:
0 F-structures
1 F-structure
2 F-structures
ESSLLI 2006
45
48329
50
0.093%
99.804%
0.103%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
119
Treebank Annotation: Penn-II & LFG
Evaluation (Qualitative):
• Burke (2006)
• F-structure quality evaluation against DCU 105, a manually annotated
dependency gold standard of 105 sentences randomly extracted from
WSJ section 23.
• Triples are extracted from the gold standard and the automatically
produced f-structures using the evaluation software from (Crouch et al.
2002) and (Riezler et al. 2002)
relation(predicate~0, argument~1)
• Results calculated in terms of Precision and Recall
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
120
Treebank Annotation: Penn-II & LFG
• Precision and Recall for DCU 105 Dependency Bank results are
calculated for All Annotations and for Preds-Only
DCU 105
All Annotations
Preds-Only
Precision
97.06%
94.28%
Recall
96.80%
94.28%
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
121
Treebank Annotation: Penn-II & LFG
DCU 105
Feature
adjunct
app
comp
coord
obj
obl
oblag
passive
poss
quant
relmod
subj
topic
topicrel
xcomp
ESSLLI 2006
Precision
892/968 = 92
16/16 = 100
88/92 = 96
153/184 = 83
442/459 = 96
50/52 = 96
12/12 = 100
76/79 = 96
74/79 = 94
40/64 = 62
46/48 = 96
396/412 = 96
13/13 = 100
46/49 = 94
145/153 = 95
Recall
892/950 = 94
16/19 = 84
88/102 = 86
153/167 = 92
442/461 = 96
50/61 = 82
12/12 = 100
76/80 = 95
74/81 = 91
40/52 = 77
46/50 = 92
396/414 = 96
13/13 = 100
46/52 = 88
145/146 = 99
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
F-Score
93
91
91
87
96
88
100
96
92
69
94
96
100
91
97
122
Treebank Annotation: Penn-II & LFG
• Following (Kaplan et al. 2004) Precision and Recall for PARC 700
Dependency Bank calculated for:
all annotations  PARC features  preds-only
• Mapping required
• (Burke 2006)
PARC 700
ESSLLI 2006
PARC features
Precision
88.31%
Recall
86.38%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
123
Grammar and Lexicon Extraction : Penn-II & LFG
Lexical Resources:
•
•
•
•
•
•
•
Lexical information extremely important in modern lexicalised
grammar formalisms
LFG, HPSG, CCG, TAG, …
Lexicon development is time consuming and extremely expensive
Rarely if ever complete
Familiar knowledge acquisition bottleneck …
Subcategorisation frame induction (LFG semantic forms) from fStructure annotated version of Penn-II and -III
Evaluation against COMLEX
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
124
Grammar and Lexicon Extraction: Penn-II & LFG
•
Lexicon Construction
–
Manual vs. Automated
Our Approach:
–
–
–
–
–
–
–
F-Structure Annotation of Penn-II and Penn-III
Frames not Predefined
Functional and Categorial Information
Parameterised for Prepositions and Particles
Active and Passive
Long Distance Dependencies
Conditional Probabilities
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
125
Grammar and Lexicon Extraction: Penn-II & LFG
• Extraction Methodology
– Automatic F-Structure Annotation of Penn-II & III
– Lexical Extraction Algorithm
– Examples
• Evaluation
– Gold Standards (COMLEX, OALD)
– Experimental Architecture
– Results
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
126
Grammar and Lexicon Extraction: Penn-II & LFG
sign<subj,obj>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
127
Grammar and Lexicon Extraction: Penn-II & LFG
• Semantic Forms:
PRED<GF1, GF2, …, GFn>
• Governable Grammatical Functions (Arguments)
– SUBJ, OBJ, OBJθ, OBL, OBLθ, COMP, XCOMP, PART…
• Non-Governable Grammatical Functions (Adjuncts)
– ADJ, XADJ, APP, RELMOD, …
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
128
Grammar and Lexicon Extraction: Penn-II & LFG
Penn-II Treebank
Automatic F-Structure
Annotation Algorithm
LFG F-Structures
Extraction Algorithm
Semantic Forms
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
129
Grammar and Lexicon Extraction: Penn-II & LFG
Extraction Algorithm:
For each f-structure F
For each level of embedding in F
Determine the local predicate PRED
Collect all subcategorisable grammatical functions GF1, …, GFn
Return: PRED<GF1, GF2, …, GFn>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
130
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the
pred : inquiry
num : sg
pers : 3
adjunct : 1 : pred : soon
pred : focus
tense : past
obl : pform : on
obj : spec : det : pred : the
pred : judge
num : sg
pers : 3
Prepositions and OBLs:
focus([subj,obl:on])
on([obj])
“The inquiry soon focused on the judge” (wsj_0267_72)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
131
Grammar and Lexicon Extraction: Penn-II & LFG
topic : index : [1]
subj : spec : det : pred : the
num : sing
pred : government
pers : 3
…
…
pred : have
tense : pres
subj : spec : det : pred : the
pers : 3
pred : treasury
num : sing
comp : index : [1]
subj : spec : det : pred : the
num : sing
pred : government
pers : 3
…
…
pred : have
tense : pres
pred : say
tense : past
ESSLLI 2006
LDDs:
say([subj,comp])
“Until Congress acts , the government hasn't any
authority to issue new debt obligations of any kind, the
Treasury said.” (wsj_0008_2)
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
132
Grammar and Lexicon Extraction: Penn-II & LFG
subj : pred : pro
pron_form : it
passive : +
to_inf : +
pred : be
xcomp : subj : pred : pro
pron_form : it
passive : +
pred : consider
tense : past
obl : pform : as
obj : spec : det : pred : a
………
………
pred : risk
num : sg
pers : 3
Passive:
consider([subj,obl:as],p)
“… to be considered as an additional risk for the investor…”(wsj_0018_14)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
133
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the
cat : dt
CFG categories:
pred : inquiry
num : sg
focus(v,[subj,obl:on])
pers : 3
cat : nn
focus(v,[subj(n),obl:on])
adjunct : 1 : pred : soon
cat : rb
pred : focus
tense : past
cat : vbd
obl : pform : on
obj : spec : det : pred : the
cat : dt
pred : judge
num : sg
pers : 3
cat : nn
“The inquiry soon focused on the judge.”
(wsj_0267_72)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
134
Grammar and Lexicon Extraction: Penn-II & LFG
Lexicon extracted from Penn-II (O’Donovan et al 2005):
S em a n tic F o rm
C o n d itio n al
P ro b a b ility
accept([subj,obj])
0 .81 3
a c c e p t ( [ s u b j ] ,p )
0 .06 0
accept([subj,comp])
0 .03 3
a c c e p t ( [ s u b j , o b l : a s ] ,p )
0 .02 0
accept([subj,obj,obl:as])
0 .02 0
accept([subj,obj,obl:from])
0 .02 0
accept([subj])
0 .01 3
Others
0 .02 1
L em m a s
S em a n tic F o rm s
F ra m e T y p es
ESSLLI 2006
W itho u t P rep/P a rt
3 586
1 096 9
38
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
W ith P rep/P a rt
3 586
1 434 8
5 77
135
Grammar and Lexicon Extraction: Penn-II & LFG
• Evaluation for all active verbs (2992) extracted from Penn-II against
COMLEX
• Largest evaluation for English subcat frame extraction system
• Carroll and Rooth (1998) – 200 verbs
• Schulte im Walde (2000) – over 3000 German verbs
• (VERB
:ORTH “reimburse”
:SUBC ((NP-NP)
(NP-PP :PVAL (“for”))
(NP)))
• (vp-frame
ESSLLI 2006
np-np
:cs
((np 2)(np 3))
:gs
(:subject 1 :obj 2 :obj2 3)
:ex
“she asked him his name”)
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
136
Grammar and Lexicon Extraction: Penn-II & LFG
• Following Schulte im Walde (2000):
• Experiment 1: Exclude prepositional phrases entirely (e.g.
[subj,obl:on] is [subj])
• Experiment 2: Include prepositional phrase but not specific preposition
(e.g. [subj,obl]).
– 2a (+ Part value)
• Experiment 3: Include details of specific preposition (e.g.
[subj,obl:on])
– 3a (+ Part value)
• Relative Thresholds of 1% and 5%
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
137
Grammar and Lexicon Extraction: Penn-II & LFG
E xp.
E xp.
E xp.
E xp.
E xp.
1
2
2a
3
3a
ESSLLI 2006
T hreshold of 1%
P
R
F
79.0%
59.6%
68.0%
77.1%
50.4%
61.0%
76.4%
44.5%
56.3%
73.7%
22.1%
34.0%
73.3%
19.9%
31.3%
T hreshold of 5%
P
R
F
83.5%
54.7%
66.1%
81.4%
44.8%
57.8%
80.9%
39.0%
52.6%
78.0%
18.3%
29.6%
77.6%
16.2%
26.8%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
138
Grammar and Lexicon Extraction: Penn-II & LFG
• Directional Prepositions (about, across, along, around, behind,
below, beneath, between, beyond, by, down, from…) included in
COMLEX by “default” for verbs that have at least one p-dir …
(VERB
:ORTH "cycle"
R ecall
In crease
F -S core
In crease
ESSLLI 2006
:SUBC ((PP :PVAL ("p-dir")))
E xp . 3
E xp . 3a
40.8%
35.4%
18.7%
15.5%
54.4%
49.7%
20.4%
18.4%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
139
Grammar and Lexicon Extraction: Penn-II & LFG
• Penn-III = Penn-II + the parsed section of the Brown Corpus
– About 300,000 of a total of 1 Million Words Brown Corpus
– Balanced Corpus (8 genres) e.g. Humour, Science Fiction etc.
• Subcategorisation variation across domains
• More data, more verbs
• -CLR tag (closely related)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
140
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
141
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
142
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
143
Grammar and Lexicon Extraction: Penn-II & LFG
• Applications:
• Porting to other languages
– German (TIGER)
– Spanish (CAST3LB )
– Chinese (CTB-I and II)
• LDD resolution in parsing new text (Cahill et al., 2004)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
144
Grammar and Lexicon Extraction: Penn-II & LFG
Parsing-Based Subcat Frame Extraction (O’Donovan 2006):
• Treebank-based vs. parsing-based subcat frame extraction
• We parsed British National Corpus BNC (100 million words) with our
automatically induced LFGs
• 19 days on single machine: ~5 million words per day
• Subcat frame extraction for ~10,000 verb lemmas
• Evaluation against COMLEX and OALD
• Evaluation against Korhonen (2002) gold standard
• Our method is statistically significantly better
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
145
Parsing: Penn-II and LFG
•
Overview Parsing Architectures:
Pipeline & Integrated
•
Long-Distance Dependency Resolution at F-Structure
•
Evaluation
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
146
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
147
Parsing: Penn-II and LFG
• PCFG consists of CFG rules with associated probabilities
• A-PCFG treats strings consisting of CFG categories followed by 1 or more
functional annotation(s) as monadic categories (e.g. NP[up-obj=down] )
• Probabilistic parsing technology (PCFGs, History-Based and Lexicalised
Parsers) produces trees without LDDs
• Exceptions: (Collins 1999): wh-relclauses; (Johnson 2002) postprocessing; …
• In our (standard) architecture new text is parsed into proto f-structures.
• LDD resolution at f-structure
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
148
Parsing: Penn-II and LFG
• Penn-II tree with traces and co-indexation for LDDs
“U.N. signs treaty, the paper said”
S
S-1
NP
NNP
NP
DT
VP
VBZ
U.N. signs
NP the
VP
NN
VBD
paper said
NN
S
-NONE*T*-1
treaty
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
149
Parsing: Penn-II and LFG
• Trace and coindexaction in tree translated into reentrancy at f-structure
by annotation algorithm:
“U.N. signs treaty, the headline said”
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
150
Parsing: Penn-II and LFG
• Parse tree from PCFG and History-Based Parsers without traces:
“U.N. signs treaty, the paper said”
S
S
NP
NNP
NP
DT
VP
VBZ
U.N. signs
NP the
VP
NN
VBD
paper
said
NN
treaty
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
151
Parsing: Penn-II and LFG
• Basic, but possibly incomplete, predicate-argument structures (proto-fstructures):
“U.N. signs treaty, the headline said”
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
152
Parsing: Penn-II and LFG
• Require:
– subcategorisation frames (O’Donovan et al., 2004, 2005; O’Donovan
2006)
– functional uncertainty equations
• Previous Example:
– say([subj,comp])
–  topic = comp*comp (search along a path of 0 or more comps)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
153
Parsing: Penn-II and LFG
Subcat Frames:
• Automatically acquired from automatically f-structure-annotated Penn-II
Treebank following (O’Donovan et al. 2004, 2005; O’Donovan 2006)
• Distinction between active and passive frames
• Associated with probabilities
• O’Donovan et al. evaluate against COMLEX resource
• Extracted from sections 02-21
• 10960 active lemma-frame types (semantic forms/subcat frames), 2241
passive types
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
154
Parsing: Penn-II and LFG
Functional Uncertainty equations:
• Automatically acquire finite approximations of FU-equations
• Extract paths between co-indexed material in automatically generated fstructures from sections 02-21 from Penn-II
• 26 TOPIC, 60 TOPICREL, 13 FOCUS path types
• 99.69% coverage of paths in section 23
• Each path type associated with a probability
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
155
Parsing: Penn-II and LFG
Sample TOPIC paths with probabilities:
up-topic=up-comp
0.940
up-topic=up-xcomp:comp 0.006
up-topic=up-comp:comp
0.001
Sample TOPICREL paths with frequencies:
up-subj
7894
up-xcomp:xcomp
up-obj
1167
up-xcomp:xcomp:obj
up-xcomp
956
up-comp:subj
up-xcomp:obj
793
up-xcomp:subj
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
161
135
119
92
156
Parsing: Penn-II and LFG
LDD Resolution Algorithm: recursively traverse an f-structure and
–
find TOPIC:T attribute-value pair
–
retrieve TOPIC paths
–
for each path p of the form GF1:…: GFn:GF, traverse the f-structure
along the TOPIC path GF1:…: GFn to local sub f-structure g
•
at g retrieve local PRED:P
•
add GF:T to g iff
– GF is not present at g
– g together with GF is locally complete and coherent with
respect to a semantic form s for P
–
multiply path and semantic form probabilities involved to rank
resolution
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
157
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
158
Parsing: Penn-II and LFG
FU-path approximations
up-topic=up-comp
up-topic=up-xcomp:comp
up-topic=up-comp:comp
... ...
0.940
0.006
0.001
Subcategorisation Frames
say([subj])
0.06
say([comp,subj])
0.87
0.87
say([subj,xcomp])
0.02
... ...
topic : pred : sign
topic
subj : pred : U.N.
obj : pred : treaty
pred : say
pred
subj : say
spec : the
pred : paper
comp : pred : sign
subj : pred : U.N.
obj : pred : treaty
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
159
Parsing: Penn-II and LFG
• How do treebank-based constraint grammars compare to deep handcrafted grammars like XLE and RASP?
• XLE (Riezler et al. 2002, Kaplan et al. 2004)
– hand-crafted, wide-coverage, deep, state-of-the-art English LFG and XLE
parsing system with log-linear-based probability models for disambiguation
– PARC 700 Dependency Bank gold standard (King et al. 2003), Penn-II
Section 23-based
• RASP (Carroll and Briscoe 2002)
– hand-crafted, wide-coverage, deep, state-of-the-art English probabilistic
unification grammar and parsing system (RASP Rapid Accurate Statistical
Parsing)
– CBS 500 Dependency Bank gold standard (Carroll, Briscoe and Sanfillippo
1999), Susanne-based
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
160
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
161
Parsing: Penn-II and LFG
• Choose best treebank-based LFG system to compare with XLE/RASP:
• C-structure engines (state-of-the-art history based, lexicalised parsers):
– (Collins 1999)
– (Charniak 2000)
– (Bikel 2002)
• (Bikel 2002) retrained to retain Penn-II functional tags (-SBJ, -SBJ, -LOC,
-TMP, -CLR, etc.)
• Pipeline architecture: tagged text  Bikel retrained + f-structure
annotation algorithm + LDD resolution  f-structures  automatic
conversion  evaluation against XLE/RASP gold standards PARC700/CBS-500 dependency banks
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
162
Parsing: Penn-II and LFG
• Systematic differences between our f-structures and PARC 700 and CBS
500 dependency representations
• Automatic conversion of our f-structures to PARC 700 / CBS 500 -like
structures (Burke et al. 2004, Burke 2006, Cahill et al. under review)
• Best XLE and RASP resources with better results than those reported in
literature to date
• (Crouch et al. 2002) and (Carroll and Briscoe 2002) evaluation software
• (Noreen 1989) Approximate Randomisation Test to test for statistical
significance of results
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
163
Parsing: Penn-II and LFG
• Result dependency f-scores:
PARC 700 XLE vs. BKR-LFG:
– 80.55% XLE
– 83.08% BKR-LFG (+2.53%)
CBS 500 RASP vs. BKR-LFG:
– 76.57% RASP
– 80.23% BKR-LFG (+3.66%)
•
•
Results statistically significant at  95% level (Noreen 1989) Approximate
Randomisation Test
BKR-LFG = treebank-induced Lexical-Functional Grammar resources with Bickel
retrained (BKR) as c-structure engine in pipeline architecture
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
164
Parsing: Penn-II and LFG
PARC 700 Evaluation:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
165
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
166
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
167
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
168
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
169
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
170
Probability Models: Penn-II & LFG
Probability Models:
• Our approach does not constitute proper probability model (Abney,
1996)
• Why? Probability model leaks:
• Highest ranking parse tree may feature f-structure equations that
cannot be resolved into f-structure
• Probability associated with that parse tree is lost
• Doesn’t happen often in practise (coverage >99.5% on unseen data)
• Research on appropriate discriminative, log-linear or maximum entropy
models is important (Miyao and Tsujii, 2002) (Riezler et al. 2002)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
171
Generation: Penn-II & LFG
Cahill and van Genabith, 2006
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
172
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
173
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
174
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
175
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
176
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
177
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
178
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
179
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
180
Generation: the Good, the Bad and the Ugly
•
•
•
•
•
•
•
•
Orig: Supporters of the legislation view the bill as an effort to add stability and certainty to
the airline-acquisition process , and to preserve the safety and fitness of the industry .
Gen: Supporters of the legislation view the bill as an effort to add stability and certainty to
the airline-acquisition process , and to preserve the safety and fitness of the industry.
Orig: The upshot of the downshoot is that the A 's go into San Francisco 's Candlestick
Park tonight up two games to none in the best-of-seven fest .
Gen: The upshot of the downshoot is that the A 's tonight go into San Francisco 's
Candlestick Park up two games to none in the best-of-seven fest .
Orig: By this time , it was 4:30 a.m. in New York , and Mr. Smith fielded a call from a New
York customer wanting an opinion on the British stock market , which had been having
troubles of its own even before Friday 's New York market break .
Gen: Mr. Smith fielded a call from New a customer York wanting an opinion on the market
British stock which had been having troubles of its own even before Friday 's New York
market break by this time and in New York , it was 4:30 a.m. .
Orig: Only half the usual lunchtime crowd gathered at the tony Corney & Barrow wine bar
on Old Broad Street nearby .
Gen: At wine tony Corney & Barrow the bar on Old Broad Street nearby gathered usual ,
lunchtime only half the crowd , .
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
181
Domain Variation, Multilingual LFG Resources, etc.
• Domain variation: ATIS (Judge et al 2005) and QuestionBank (Judge et
al 2006)
• F-Str -> (Q)LF Quasi-Logical Forms (Cahill et al. 2003)
• Multilingual treebank-based LFG acquisition:
– German: TIGER treebank (Cahill et al 2003), (Cahill et al 2005)
– Chinese: Chinese Penn Treebank (Burke et al 2004)
– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van
Genabith 2006)
• GramLab Project at DCU (2005-2008): Chinese, Japanese, Arabic,
Spanish, French and German
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
182
Demo System
• http://lfg-demo.computing.dcu.ie/lfgparser.html
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
183
Publications
A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFGApproximations, COLING/ACL 2006, Sydney, Australia
J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions,
COLING/ACL 2006, Sydney, Australia
G. Chrupala and J. van Genabith, Using Machine-Learning to Assign Function Labels to Parser Output for
Spanish, COLING/ACL 2006, Sydney, Australia
M. Burke, Automatic Treebank Annotation for the Acquisition of LFG Resources, Ph.D. Thesis, School of
Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O’Donovan, Automatic Extraction of Large-Scale Multilingual Lexical Resources, Ph.D. Thesis, School of
Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of
Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005
A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. TreebankBased Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language
and Computation; Special Issue on "Shared Representations in Multilingual Grammar Engineering",
(eds.) E. Bender, D. Flickinger, F. Fouvry and M. Siegel, Kluwer Academic Press, 2005
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
184
Publications
R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from
the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway,
2005
J. Judge, M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. Strong Domain Variation and
Treebank-Induced LFG Resources; In Proceedings of the Tenth International Conference on LFG,
Bergen, Norway,2005
M. Burke, A. Cahill, J. van Genabith, and A. Way. Evaluating Automatically Acquired F-Structures against
PropBank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005
M. Burke, A. Cahill, M. McCarthy, R.O'Donovan, J. van Genabith and A. Way. Evaluating Automatic FStructure Annotation for the Penn-II Treebank; Journal of Language and Computation; Special Issue on
"Treebanks and Linguistic Theories", (eds.) E. Hinrichs and K.Simov, Kluwer Academic Press. 2005.
pages 523-547
A. Cahill. Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations.
Ph.D. Thesis. School of Computing, Dublin City University, Dublin 9, Ireland. 2004
M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; TreebankBased Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLIC-18 Conference,
Waseda University, Tokyo, Japan, pages 161-172, 2004
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
185
Publications
M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. The Evaluation of an Automatic Annotation
Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International
Conference on LFG, Christchurch, New Zealand, pages 101-121, 2004
A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in
Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 320327, Barcelona, Spain, 2004
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith, and A. Way. Large-Scale Induction and Evaluation of
Lexical Resources from the Penn-II Treebank, In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 368-375, Barcelona, Spain,
2004
M. Burke, Cahill A., R. O' Donovan, J. van Genabith and A. Way. Treebank-Based Acquisition of WideCoverage, Probabilistic LFG Resources: Project Overview, Results and Evaluation, The First International
Joint Conference on Natural Language Processing (IJCNLP-04), Workshop "Beyond shallow analyses Formalisms and statistical modeling for deep analyses"; March 22-24, 2004 Sanya City, Hainan Island,
China, 2004
Cahill A., M. Forst, M. McCarthy, R. O' Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based
Multilingual Unification-Grammar Development. In the Proceedings of the Workshop on Ideas and
Strategies for Multilingual Grammar Development, at the 15th European Summer School in Logic
Language and Information, Vienna, Austria, 18th - 29th August 2003
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
186
Publications
Cahill A, M. McCarthy, J. van Genabith and A. Way. Quasi-Logical Forms for the Penn Treebank; In (eds.)
Harry Bunt, Ielka van der Sluis and Roser Morante; Proceedings of the Fifth International Workshop on
Computational Semantics, IWCS-05, January 15-17, 2003, Tilburg, The Netherlands, ISBN: 90-7402924-8, pp.55-71, 2003
Cahill A, M. McCarthy, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the
Penn-II Treebank. TLT 2002, Treebanks and Linguistic Theories 2002, 20th and 21st September 2002,
Sozopol, Bulgaria, (eds.) E. Hinrichs and K. Simov, Proceedings of the First Workshop on Treebanks and
Linguistic Theories (TLT 2002), pp. 42-60, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure
Annotation, In M. Butt and T. Holloway-King (eds.): Proceedings of the Seventh International
Conference on LFG CSLI Publications, Stanford, CA., pp.76--95. 2002
Cahill A, and J. van Genabith. TTS - A Treebank Tool; in LREC 2002, The Third International Conference on
Language Resources and Evaluation, Las Palmas de Grand Canaria, Spain, May 27th--June 2nd, 2002,
Proceedings of the Conference, Volume V, (eds.) M.G.Rodriguez and C.P. Suarez Arnajo, ISBN 29517408-0-8, pp. 1712-1717, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Automatic Annotation of the Penn-Treebank with LFG FStructure Information; LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation Bootstrapping Annotated Language Data, LREC 2002, Third International Conference on Language
Resources and Evaluation, post-conference workshop, June 1st, 2002, proceedings of the workshop,
(eds.) A. Lenci, S. Montemagni and V. Pirelli, ELRA - European Language Resources Association, Paris
France, pp. 8-15, 2002
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
187
Penn-II-Based Acquisition of CCG Resources
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
188
This lecture
• Recap: CCG
• Translating the Penn Treebank to CCG
– The translation algorithm
– CCGbank: the acquired grammar and lexicon
• Wide-coverage parsing with CCG
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
189
CCG: the machinery
• Categories:
specify subcat lists of words/constituents.
• Combinatory rules:
specify how constituents can combine.
• The lexicon:
specifies which categories a word can have.
• Derivations:
spell out process of combining constituents.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
190
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a
result when combined with an argument:
VP or intransitive verb:
Transitive verb:
Adverb:
PPs:
ESSLLI 2006
S\NP
(S\NP)/NP
(S\NP)\(S\NP)
((S\NP)\(S\NP))/NP
(NP\NP)/NP
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
191
The combinatory rules
• Function application:
X/Y Y

Y
X\Y

x.f(x) a  f(a)
• Function
X/Y
Y\Z
X/Y
Y/Z
x.f(x) y.g(y)  x.f(g(x))
composition:
Y/Z

X\Y

Y\Z

X\Y

• Type-raising:
X
X
ESSLLI 2006


X
X
X/Z
X/Z
X\Z
X/Z
(>)
(<)
(>B)
(<B)
(>Bx)
(<Bx)
a  f.f(a)
T/(T\X)
(>T)
T\(T/X)
(<T)
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
192
CCG derivations
• Canonical “normal-form” derivations (mostly function application):
• Alternative derivations:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
193
Type-raising and Composition
• Wh-movement:
• Right-node raising:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
194
CCG: semantics
• Every syntactic category and rule has a
semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
195
From the Penn Treebank to CCG
•
•
•
•
•
The basic translation algorithm
Dealing with null elements
Type-changing rules in the grammar
Preprocessing
CCGbank: The extracted lexicon/grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
196
Input: Penn Treebank tree
• Flat phrase-structure tree
• Traces/null elements and indices
represent underlying dependencies
• Function tags
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
197
Output: CCG derivation
• Binary derivation tree
with explicit “deep”
dependency structures
and subcategorization
information.
• No null elements
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
198
I. Identify heads, arguments, adjuncts
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
199
II. Binarise the tree
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
200
III. Assign CCG categories
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
201
Morphosyntactic Features
• Features on verbal categories:
declarative, infinitival, past participle,
present participle, passive
• Sentential features:
wh-questions, yes-no questions, embedded
questions, embedded declaratives, fragments, etc.
• CCGbank has no case or number distinction!
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
202
III. Assign CCG categories: adjuncts
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
203
III. Assign CCG categories: arguments
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
204
IV. Assign predicate-argument structure
• We approximate predicate-argument structure by
word-word dependencies
• These are defined by the argument slots of functor
catgeories:
just
(S\NP)/(S\NP)
opened (S[dcl]\NP)/NP
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
opened
doors
205
IV. Assign predicate-argument structure
• Non-local dependencies arise through:
– Binding and control: “He may want you to listen”
– Extraction: “the tapas that he told us she ate”
• Both are mediated by lexical categories:
– Control verbs, auxiliaries/modals
– Relative pronouns
• We represent this via coindexation:
(NP\NPi)/(S[dcl]/NPi)
In CCGbank: added automatically to certain category types
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
206
Lexical categories that mediate dependencies
• Auxiliaries/modals, raising verbs: will, might, seem
(S[dcl]\NPi)/(S[b]\NPi)
• Control verbs: persuade you to go
((S[dcl]\NP)/(S[to]\NPi))/NPi
• Relative pronouns: which, who, that
(NP\NPi)/(S[dcl]/NPi)
• Many more (listed in CCGbank manual)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
207
Summary: The basic algorithm
1.
2.
3.
4.
5.
Identify heads, complements and adjuncts.
Binarize the tree.
Assign CCG categories.
Add co-indexation to lexical categories.
Create predicate-argument structure.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
208
Problems with basic algorithm
• Depends on Treebank markup:
– Complement/adjunct distinction
– The analyses don’t always correspond to CCG analysis
– Errors in Treebank annotation
• Proliferation of categories:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
209
The need for preprocessing
• Eliminating (some of) the noise:
– POS-tagging errors
– Bracketing errors (coordination!)
• Changing the Treebank analyses:
– Small clauses
• Adding more structure:
– Insert a noun level into NPs
– Analyze QPs, fragments, parentheticals, multiwordexpressions
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
210
Compacting the grammar: Type-changing rules
• Type-changing rules for adjuncts
capture syntactic regularities:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
211
Null elements, traces, and coindexation
•
•
•
•
*-null elements: passive, PRO
*T*-traces: wh-movement, tough movement
*RNR*-traces: right-node raising
Other null elements:
–
–
–
–
*EXP*: expletive,
*ICH* (“insert constituent here”): extraposition
*U* (units): $ 500 *U*
*PPA* (permanent predictable ambiguity)
• =-coindexation: argument cluster coordination
and gapping
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
212
* null elements
• Used for passive or PRO (arbitrary or controlled):
• Only the passive * matters for translation:
(S with null subject = VP = S\NP)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
213
Unbounded long-range dependencies
•
… arising through extraction (*T*):
–
Wh-movement (relative clauses and wh-questions):
–
the articles that (you believed he saw that…) I filed
Tough-movement:
Peter is easy to please
–
Parasitic gaps:
the articles that I filed without reading
•
… arising through coordination (*RNR* and =):
–
Right-node raising:
–
Argument cluster coordination:
–
Sentential gapping:
[[Mary ordered] and [John ate]] the tapas.
Mary ordered [[tapas for herself] and [wine for John]].
[[Mary ordered tapas] and [John beer]].
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
214
Dealing with extraction
• Penn Treebank: *T* traces indicate extraction
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
215
Dealing with extraction
• Pass the extracted NP up to relative clause.
• The relative pronoun subcategorizes for an
‘incomplete’ sentence:
(NP\NP)/(S[dcl]\NP) for subject relatives
(NP\NP)/(S[dcl]/NP) for object relatives
• The derivation uses type-raising and composition
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
216
Right node raising in the Penn Treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
217
Right node raising in CCGbank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
218
Argument-cluster coordination
• “Template gapping” annotation:
Co-indexation between constituents in conjuncts
• The first conjunct contains the head
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
219
Argument-cluster coordination in CCGbank
• The shared constituents are coordinated
(via type-raising and composition):
X
NP
ESSLLI 2006


T\(T/NP)
(S\NP)\((S\NP)/NP)
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
(<T)
(<T)
220
Sentential Gapping
• In the Treebank:
• CCG uses decomposition to obtain the types
(interpretation is given extragrammatically)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
221
Remaining problems: NP level
• Lists and appositives are indistinguishable:
• Compound nouns have no internal structure:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
222
Remaining problems: other constructions
• Complement-adjunct distinction:
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
223
Putting it all together….
Funds that are or soon will be listed in New York or London
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
224
The CCG derivation
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
225
The relative clause:
that: (NPi\NPi)/(S[dcl]\NPi) funds
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
are,will
226
The right-node-raising VP
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
227
CCGbank
• Coverage of the translation algorithm:
99.44% of all sentences in the Treebank
(main problem: sentential gapping)
• The lexicon (sec.02-21):
– 74,669 entries for 44,210 word types
– 1286 lexical category types
(439 appear once, 556 appear 5 times or more)
• The grammar (sec. 02-21):
– 3262 rule instantiations (1146 appear once)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
228
The most ambiguous words
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
229
Frequency distribution of categories
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
230
Lexical coverage
• How well does our lexicon cover unseen data?
“Training” data:
Test data:
sections 02-21
section 00
• The lexicon contains the correct entries for
94.0% of the tokens in section 00.
• 3.8% of the tokens in section 00 do not appear
in sections 02-21.
35% of the unknown tokens are N
29% of the unknown tokens are N/N
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
231
Statistical Parsing with CCG
• The data: CCGbank
• The algorithms: standard CKY chart parsing
(and a supertagger)
• The models:
– Generative: Hockenmaier and Steedman (2002)
– Conditional: Clark and Curran (2004)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
232
Parsing algorithms for CCG
• CCG derivations are binary trees.
• Standard chart parsing algorithms (eg. CKY)
can be used.
• Complexity: O(n6)
(or O(n3) if the category set is fixed)
• Recovery of “deep” dependencies require
feature structures.
• Supertagging: assign most likely categories to
words before parsing. Significantly speeds up
parsing!
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
233
Parsing models
• Generative models: P(s,)
Model the process which generates the derivation 
– Advantage: easy to guarantee consistency
– Disadvantage: requires good smoothing techniques,
difficult to include complex features
Good baseline
• Conditional models: P( |s)
Given a sentence s, predict most likely derivation 
– Advantage: more natural for parsing
– Disadvantage: large model size, difficult to estimate
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
234
Evaluation: recovery of dependency structures
Generative:
Labelled
83.3
Unlabelled
90.3
84.6
91.2
(Hockenmaier and Steedman, 2002)
Conditional:
(Clark and Curran, 2004)
This includes long-range dependencies
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
235
ccg2sem: from CCG to DRT
• A Prolog package which translates CCGbank
derivations into Discourse Representation Theory
structures (Bos, 2005)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
236
CCGbanks for other languages
• German (Hockenmaier, 2006):
– Translation of German TIGER corpus into CCG.
– Many crossing dependencies, etc.:
context-free approximations are inappropriate
– Current coverage: 92.4% of all graphs
(excluding headlines, fragments etc.)
• Turkish (Cakici, 2005):
– Extracts a CCG lexicon from the METU Sabanci Treebank.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
237
A few references
General CCG references:
M. Steedman (2000). The Syntactic Process, MIT Press.
M. Steedman (1996). Surface Structure and Interpretation, MIT Press.
CCGbank(s) and wide-coverage CCG parsing:
J. Hockenmaier and M. Steedman (2005). CCGbank: User’s Manual, MS-CIS-05-09, Dept. of
Computer and Information Science, University of Pennsylvania.
J. Hockenmaier and M. Steedman (2002). Acquiring Compact Lexicalized Grammars from a
Cleaner Treebank, LREC, Las Palmas, Spain.
J. Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory
Categorial Grammar. PhD thesis, Infomatics, University of Edinburgh.
J. Hockenmaier and M. Steedman (2002). Generative Models for Statistical Parsing with
Combinatory Categorial Grammar, ACL ‘02, Philadelphia, PA, USA.
S. Clark and J. R. Curran (2004). Parsing the WSJ using CCG and Log-Linear Models ACL '04,
Barcelona, Spain.
S. Clark and J. R. Curran (2004). The Importance of Supertagging for Wide-Coverage CCG
Parsing. Coling’04, Geneva, Switzerland.
J. Bos (2005): Towards Wide-Coverage Semantic Interpretation. IWCS-6.
R. Cakici (2005). Automatic Induction of a CCG Grammar for Turkish.
ACL Student Research Workshop, Ann Arbor, Mi, USA.
J. Hockenmaier (2006). Creating a CCGbank and a wide-coverage CCG lexicon for German.
ACL/COLING ‘06, Sydney, Australia.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
238
More references
• The CCG website:
http://groups.inf.ed.ac.uk/ccg
with lots of general references about CCG
(as well as CCGbank, CCG parsing, etc.)
• CCGbank is available from the Linguistic Data
Consortium (LDC) at the University of Pennsylvania.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
239
Penn-II-Based Acquisition of HPSG Resources
Head-Driven Phrase Structure
Grammar
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
240
Penn-II-Based Acquisition of HPSG Resources
•
•
•
•
Introduction
Treebank conversion and HPSG annotation
Lexicon extraction
Probabilistic models
– Feature forest model
– Design of features
• Parsing
• Evaluation
• Advanced topics
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
241
Introduction
• If we had an HPSG version of Penn-II, we could
obtain lexical entries and probabilistic models
• How do we get HPSG-annotated Penn-II?
• Converting Penn-II into an HPSG-conformant
treebank
• How do we verify the conformity with the HPSG
theory?
• Principles are exploited for the verification
– Implementation of principles is relatively easy, while
construction of the lexicon is extremely difficult
– Principles are hand-coded, while lexical entries are
acquired from a converted treebank
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
242
Introduction
• We develop a treebank rather than a lexicon
• A treebank provides more information than a
lexicon
– Verification of the consistency of the grammar
– Statistics
Principles
Treebank
Lexicon
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
243
Methodology
Treebank
HPSG treebank
Treebank
conversion
Principles
Lexicon
Lexicon
extraction
pretty/JJ
database/NN
Grammar
writer
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
244
Comparison with conventional grammar development
Manual
development
Corpus
Principles
Lexicon
Parser
edit
Principles
Treebank
Lexicon
extractor
Grammar writer
Treebank
ESSLLI 2006
Treebank-based
development
verify
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
Lexicon
245
Treebank conversion and HPSG annotation
• Convert Penn-style parse trees into HPSG-style
parse trees
– Correcting frequent errors in Penn Treebank
• Ex. Confusion of VBD/VBN
– Converting tree structures
• Small clauses, passives, NP structures, auxiliary/control verbs,
LDDs, etc.
– Mapping into HPSG-style representations
• Head/argument/modifier distinction, schema name assignment
• Mapping into HPSG categories
– Applying HPSG principles/schemas
• Undetermined features are filled
• Violations of feature constraints are detected
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
246
Overview
S
NP
Error correction & arg
NP
tree conversion
VP
VP
NL is ADVP
S
head
VP
VP
NL
head
NP
officially making
arg
head
VP
arg
mod head
is officially making
the offer
the offer
Principle
application
HEAD verb
SUBJ < >
COMPS < >
HEAD noun
HEAD verb
SUBJ < 1 >
COMPS < >
1 SUBJ < >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 2 >
NL
HEAD verb
HEAD verb
SUBJ < >
3 SUBJ < 1 >
COMPS < 2 > COMPS < >
MOD
3
is
ESSLLI 2006
2
HEAD verb
SUBJ < >
COMPS < >
HEAD noun
SUBJ < >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 4 >
officially making
NP
HEAD noun
4 SUBJ < >
COMPS < >
the offer
Mapping into
HPSG-style
representation
subject-head
HEAD verb
SUBJ < >
1
head-comp
HEAD verb
SUBJ < >
HEAD verb
NL
head-comp
head-mod
HEAD verb
is
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
HEAD adv
HEAD verb
1
HEAD noun
SUBJ < >
COMPS < >
officially making the offer
247
Tree conversion
• Coordination, quotation, insertion, and apposition
• Small clauses, “than” phrases, quantifier phrases,
complementizers, etc.
• Disambiguation of non-/pre-terminal symbols (TO,
etc.)
• HEAD features (CASE, INV, VFORM, etc.)
• Noun phrase structures
• Auxiliary/control verbs
• Subject extraction
• Long distance dependencies
• Relative clauses, reduced relatives
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
248
Pattern-based tree conversion
tree_transform_rule("predicative", $Input, $Output) :tree_match(TREE_NODE\$Node &
TREE_DTRS\[tree_any & ANY_TREES\$LeftTrees,
(TREE_NODE\SYM\"S" &
TREE_DTRS\($PRDTrees &
[tree_any,
tree & TREE_NODE\FUNC\"PRD",
tree_any])),
tree_any & ANY_TREES\$RightTrees],
$Input),
append_list([$LeftTrees, $PRDTrees, $RightTrees], $Dtrs),
Tree
$Output = TREE_NODE\$Node & TREE_DTRS\$Dtrs.
S
NP
S
VP
He considered
ESSLLI 2006
pattern
NP
VP
He considered NP
S
NP
ADJP
himself
superior
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
ADJP
himself superior
249
Passive
• “be + VBN” constructions are assigned
“VFORM passive”
S
NP-SBJ-2
VP
the details have n’t
been
VP
VP
VFORM passive
worked/VBN NP PRT
*-2 out
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
250
Noun phrase structures
• Determiners are raised
• Possessive structures are explicitly represented
NP
NP
director
NP
Monsanto
’s
PP
NP
NP
of
plant
ESSLLI 2006
N’
DP
Monsanto
sciences
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
’s director
PP
NP
of
plant
sciences
251
Auxiliary/control verbs
• Auxiliary/control verbs are annotated as taking
unsaturated constituents
S
S
NP-1
they
1
VP
VP
n’t
did
S
have
NP
n’t
VP
VP
choose
NP
3
SUBJ < 1 >
to
VP
choose
SUBJ < 2 >
VP
have
VP
*-1 to
SUBJ < 2 >
VP
they
VP
VP
=
did
NP-1
SUBJ < 3 >
NP
this particular moment
this particular moment
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
252
Subject extraction
• HPSG does not allow subject extraction
• Relativizers are treated as ordinary subjects in
relative clauses
NP
NP
NP
SBAR
The company WHNP-1
NP
S
SBAR
The company WHNP-1
which NP
VP
*T*-1 has
which
VP
reported NP
VP
has
VP
reported NP
net losses
net losses
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
253
Subject relative
• Relativizers have a non-empty list in REL
• The element of REL is consumed in a head-relative
construction and represents the relative-antecedent
relation
NP
2
NP
REL < >
SBAR
The company WHNP-1
REL < 2 >
which
REL < 2 >
VP
has
VP
reported NP
net losses
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
254
LDDs: Object relative
• SLASH represents moved arguments
• REL represents relative-antecedent relations
NP
REL < >
SLASH < >
NP 2
SBAR
the energy and ambitions
WHNP-3
1 REL < 2 >
that
REL < 2 >
SLASH < >
S
SLASH < 1 >
NP-2
VP
reformers wanted
SLASH < 1 >
S
SLASH < 1 >
NP
VP
*-2 to
SLASH < 1 >
VP
SLASH < 1 >
reward NP
*T*-3
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
255
Mapping into HPSG-style representations
• Convert nonterminal symbols into HPSG-style
categories
NN
VBD
HEAD: noun
AGR: 3sg
HEAD: verb
VFORM: finite
TENSE: past
• Assign schema names to internal nodes
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
256
Category mapping & schema name assignment
• Example: “NL is officially making the offer”
HEAD verb
SUBJ < >
COMPS < >
S
arg
head
NP
VP
head
VP
NL
HEAD noun
SUBJ < >
COMPS < >
head
mod head
HEAD verb
SUBJ < 1 >
head-comp
arg
VP
is officially making
NL
arg
HEAD verb
head-comp
head-mod
NP
the offer
ESSLLI 2006
subject-head
HEAD verb
SUBJ < 1 >
HEAD verb
HEAD adv
HEAD verb
HEAD noun
SUBJ < >
COMPS < >
is
officially
making
the offer
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
257
Principle application
inverse_schema_binary(subj_head_schema, $Mother, $Left, $Right) :$Left = (SYNSEM\($LeftSynsem &
structureLOCAL\CAT\(HEAD\MOD\[] &
sharing
VAL\(SUBJ\[] & COMPS\[] & SPR\[])))),
$Right = (SYNSEM\LOCAL\CAT\(HEAD\$Head &
VAL\(SUBJ\[$LeftSynsem] & COMPS\[] & SPR\[]))),
$Mother = (SYNSEM\LOCAL\CAT\(HEAD\$Head &
VAL\(SUBJ\[] & COMPS\[] & SPR\[]))).
HEAD: verb
SUBJ: <>
HEAD: verb
HEAD: noun HEAD: verb
He
considered
...
HEAD: noun
SUBJ: <>
He
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
HEAD: verb
SUBJ: <HEAD: noun>
considered
...
258
Principle application
HEAD verb
SUBJ < >
COMPS < >
HEAD noun
SUBJ < >
COMPS < >
HEAD noun
1 SUBJ < >
COMPS < >
3
head-comp
head-mod
HEAD verb
SUBJ < 1 >
COMPS < 2 >
HEAD verb
SUBJ < 1 >
HEAD verb
HEAD adv
HEAD verb
HEAD noun
SUBJ < >
COMPS < >
is
officially
making
the offer
HEAD verb
2 SUBJ < 1 >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 2 >
HEAD adv
MOD 3
HEAD verb
SUBJ < 1 >
COMPS < 4 >
is
officially
making
ESSLLI 2006
head-comp
HEAD verb
HEAD verb
SUBJ < 1 >
COMPS < >
NL
HEAD verb
SUBJ < 1 >
NL
HEAD verb
SUBJ < >
COMPS < >
subject-head
HEAD noun
4 SUBJ < >
COMPS < >
the offer
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
259
Complicated example
HEAD noun
SUBJ < >
COMPS < >
SPR < >
1
the
NP
head
arg
NP
SBAR
the prices WHNP-1
0
HEAD noun
<>
2 SUBJ
COMPS < >
SPR < 1 >
prices
head
arg
HEAD noun
SUBJ < >
COMPS < >
SPR < 1 >
HEAD det
SUBJ < >
COMPS < >
S
HEAD verb
SUBJ < >
COMPS < >
SLASH < 2 >
head
arg
VP
NP
arg
head
VP
head
arg
we were
3
arg
HEAD verb
SUBJ < >
COMPS < >
REL < 2 >
HEAD noun
SUBJ < >
COMPS < >
charged *-2 *T*-1
we
HEAD verb
SUBJ < 3 >
COMPS < >
SLASH < 2 >
HEAD verb
HEAD verb
SUBJ < 3 >
4 SUBJ < 3 >
COMPS < >
COMPS < 4 >
SLASH < 2 >
were
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
charged
260
Lexicon extraction
•
•
•
•
Collecting leaf nodes of HPSG parse trees
Generalizing leaf nodes into lexical entry templates
Applying inverse lexical rules
Assigning predicate argument structures
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
261
Overview
HEAD verb
SUBJ < >
COMPS < >
HEAD noun
Collection of
leaf nodes &
generalization
HEAD verb
SUBJ < 1 >
COMPS < >
1 SUBJ < >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 2 >
NL
HEAD verb
HEAD verb
SUBJ < >
3 SUBJ < 1 >
COMPS < 2 > COMPS < >
MOD
3
HEAD verb
SUBJ < 1 >
COMPS < 4 >
officially making
is
make:
2
HEAD verb
SUBJ < HEAD noun >
COMPS < HEAD noun >
HEAD noun
4 SUBJ < >
COMPS < >
Application of
inverse lexical
rules
the offer
HEAD verb
SUBJ < HEAD noun >
CONT 1
HEAD noun
COMPS < CONT
>
2
make’
CONT ARG1 1
ARG2 2
ESSLLI 2006
making:
HEAD verb
SUBJ < 1 >
COMPS < >
make:
Assignment of
predicate
argument
structures
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
HEAD verb
SUBJ < HEAD noun >
COMPS < HEAD noun >
262
Collecting leaf nodes
• Leaf nodes of HPSG parse trees are instances of
lexical entries
HEAD verb
SUBJ < >
COMPS < >
HEAD noun
1 SUBJ < >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 2 >
NL
3
ESSLLI 2006
HEAD verb
2 SUBJ < 1 >
COMPS < >
HEAD verb
SUBJ < 1 >
COMPS < 2 >
HEAD adv
MOD 3
HEAD verb
SUBJ < 1 >
COMPS < 4 >
is
officially
making
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
HEAD noun
4 SUBJ < >
COMPS < >
the offer
263
Generalization into lexical entry templates
• Unnecessary constraints are removed (restriction)
lexical_entry_template($WordInfo, $Sign, $Template) :copy($Sign, $Template),
$Template = (SYNSEM\LOCAL\(CAT\HEAD\$Head &
VAL\(SUBJ\$Subj & COMPS\$Comps & SPR\$SPR))),
...
restriction($SubjSynsem, [NONLOCAL\]),
restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]),
restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, AUX\]),
restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, TENSE\]),
...
HEAD: verb
A leaf node of
noun
SUBJ: <HEAD:
> the HPSG treebank
POSTHEAD: minus
HEAD: verb
SUBJ: <HEAD: noun>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
Lexical entry
template
264
Application of inverse lexical rules
• Converting lexical entries of inflected words into
lexical entries of lexemes using inverse lexical rules
• Derivational rules: Ex. passive rule
HEAD: verb
SUBJ: <HEAD: noun>
COMPS: <HEAD: prep_by>
HEAD: verb
SUBJ: <HEAD: noun>
COMPS: <HEAD: noun>
• Inflectional rules: Ex. past-tense rule
verb
HEAD: VFORM: finite
TENSE: past
ESSLLI 2006
HEAD:
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
verb
VFORM: base
265
Predicate argument structures
• Create mappings from syntactic arguments into
semantic arguments
Ex. lexical entry for “make”
HEAD verb
SUBJ < CAT|HEAD noun >
CAT
CONT 1
VAL
CAT|HEAD noun
COMPS <
>
CONT 2
make’
CONT ARG1 1
ARG2 2
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
266
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
267
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
268
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
269
Probabilistic models
• Feature forest model
– A solution to the problem of the probabilistic modeling of
feature structures
• Design of features
– How to represent preferences of HPSG parse trees
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
270
Example: PCFG
S
Training data
NP
S
VP
She dances
S
NP
VP
NP
I
dance
S
VP
She danced
NP
VP
I danced
Observed freq. 0.3
0.3
0.2
0.2
Estimated prob. 0.15
0.15
0.2
0.2
CFG rule probabilities
S → NP VP 1.0
NP → She
0.5
0.5
NP → I
ESSLLI 2006
VP → dances
VP → dance
VP → danced
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
0.3
0.3
0.4
271
What is the problem?
S
Training data
NP
S
VP
She dances
S
NP
VP
I
dance
NP
S
VP
She danced
NP
VP
I danced
Observed freq. 0.3
0.3
0.2
0.2
Estimated prob. 0.15
0.15
0.2
0.2
• PCFG assigns probabilities to ungrammatical
structures
– “She dance” (0.15), “I dances” (0.15)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
272
Feature structure constraints
• In HPSG, feature structures explain grammatical
constraints
S → NPAGR 1 VPAGR 1
NPAGR:3sg → She
NPAGR:no3sg → I
VPAGR:3sg → dances
VPAGR:no3sg → dance
VP → danced
• “She dance” “I dances” are never generated
• However, constraints of feature structures violate
“independence assumption” of probabilistic models
(Abney 1997)
How can we estimate probabilities
in this situation?
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
273
Solution: ME model
• Probabilities of parse trees are estimated by
maximum entropy models (Berger et al. 1996)
feature function
• Probability p(T) of parse tree T


p (T )  exp    i f i (T ) 
Z
 i

1
normalization factor
parameter
(feature weight)
• Optimal parameters are computed so as to
maximize the likelihood of training data
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
274
ME model of parse trees
• If feature functions correspond to CFG rules, this
model is an extension of PCFG model
• Probabilities of parse tress are estimated without
independence assumption
f1
S
p (T ) 
1
Z
NP
VP
f2

f3
She
ESSLLI 2006
1
Z
exp 1 f 1 (T )   2 f 2 (T )   3 f 3 (T ) 
a1
f1 ( T )
a2
f 2 (T )
a3
f 3 (T )
dances
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
275
Estimation by a ME model
S
Training data
NP
S
VP
She dances
Observed freq. 0.3

1.145
Estimated prob. 0.3
exp(
i
i
fi )
NP
VP
I
dance
0.3
1.145
0.3
ME parameters exp(  i )
S → NPAGR 1 VPAGR 1 1.0
NPAGR:3sg → She
1.0
NPAGR:no3sg → I
1.0
ESSLLI 2006
S
NP
S
VP
She danced
0.2
0.763
0.2
NP
VP
I danced
0.2
0.763
0.2
VPAGR:3sg → dances 1.145
VPAGR:no3sg → dance 1.145
VP → danced
0.763
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
276
Combinatorial explosion of parse trees
• Exponentially many parse trees are assigned to
sentences (i.e., a set of T is exponential)
By expanding...
S
S
S
NP1 NP2
n
VP1 VP2
NP1 VP1 NP2 VP1
S
S
m
NP1 VP2 NP2 VP2
Size: nm
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
277
Problems by combinatorial explosion
• Parameter estimation is intractable
– Computation of Z   exp    i f i (T ) 
T
– Computation of E ( f i )  
 i
f i (T ) p (T )

• Searching for the most probable parse is intractable
T
– Computation of Tˆ  arg max p (T )
T
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
278
Solutions in HMM and PCFG
• Probabilistic models are divided into independent
probabilities, and dynamic programming is applied
–
–
–
–
Forward-backward probability
Baum-Welch algorithm
Inside-outside probability
Viterbi search
• Inside/outside probabilities can be computed at a
cost proportional to the number of nodes,
assuming a forest structure of parse trees
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
279
Feature forest model
• Dynamic programming can also be applied to
maximum entropy estimation
• Feature forest:
feature forest
– Forest structure
isomorphic to CFG
parse forest
– Assign feature
functions to nodes
rather than symbols
f(S)
f(NP1) f(NP2)
• A ME model is
estimated without
unpacking feature forests
ESSLLI 2006
f(VP1) f(VP2)
Size: n+m
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
280
Feature forest representation of a parse tree
• A feature forest represents exponentially many
trees of features
feature forest representation
f(S)
S
NP1 NP2
n
VP1 VP2
f(NP1) f(NP2)
f(VP1) f(VP2)
m
Size: n+m
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
281
Inside/outside trees of a feature forest
• Focus on a set of
trees below/above
the targeted node
• Inside trees TI(n):
Trees below n
• Outside trees TO(n):
Trees above n
Outside TO(NP1)
feature forest representation
f(S)
f(NP1) f(NP2)
f(VP1) f(VP2)
Inside TI(NP1)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
282
Estimation algorithms for ME models
• Estimation of parameters requires computation of
model expectations (Malouf 2002)
Objective function G (λ) 
Gradient

i


   i f i ( x k )  log Z 

| D | xk D  i

1
~
G (λ)  E ( f i )  E ( f i )

1

| D | xk D
fi ( xk )   fi ( x) p ( x)
Computed from
training data
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
x
Recomputed at
each iteration
283
Inside/outside products
• Unnormalized product


q ( x )  exp    i f i ( x ) 
 i

feature forest representation
• Inside product
 NP1 


exp

f
(
T
(NP
))


 i i I 1 
T I ( NP1 )
 i

• Outside product
 NP1 
f(S)
f(NP1) f(NP2)
f(VP1) f(VP2)


exp

f
(
T
(NP
))
 i i O


1
TO ( NP1 )
 i

ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
284
Computation of inside products
• The inside product of
NP1 is a product of
feature forest representation
inside products of its
daughters
 NP  ( N  N'   N  N'  
1
1
1
1
f(NP1) f(NP2)
2
  N 2  N' 1   N 2  N' 2   )
   i f i (NP 1 )
f(N1) f(N2)
f(N’1) f(N’2)
i


n
n { N 1 ,N 2 , }


n'
n '{ N' 1 , N' 2 , }
   i f i (NP 1 )
 N1
 N' 1
i
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
285
Computation of outside products
• The outside products
of NP1 is a product of feature forest representation
S
the mother’s outside
products and sister’s
inside products
f(S)
 NP1   S  VP 1   S  VP 2  
   i f i (S )
i
 S 
f(NP1) f(NP2)

f(VP1) f(VP2)
 VP 1
n
n { VP 1 , VP 2 , }
   i f i (S )
i
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
286
Computation of model expectations
• Sum of unnormalized
products of trees
feature forest representation
including NP1
 NP1
f(S)
q (T )  

NP 1
NP 1
T :T includes NP 1
• Expectation of fi at
NP1
E NP1 ( f i )  f i (NP 1 )
ESSLLI 2006
1
Z
 NP1 NP1
f(NP1) f(NP2)
f(VP1) f(VP2)
 NP1
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
287
Viterbi search
• Almost the same as
the computation of
inside products
– “max” rather than
“sum”
max q (NP 1 ) 

max
n { N 1 ,N 2 , }
max
q (n)
n '{ N' 1 ,N' 2 , }
q (n' )
feature forest representation
f(NP1) f(NP2)
f(N1) f(N2)
f(N’1) f(N’2)
   i f i (NP 1 )
i
ESSLLI 2006
q (N 1 )
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
q ( N' 1 )
288
Design of features
• Feature engineering is important for higher
accuracy
• Feature functions are designed for capturing
syntactic/semantic preferences of HPSG parse trees
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
289
A chart for HPSG parsing
HEAD verb
SUBCAT <>
Equivalent signs
are packed
HEAD verb
SUBCAT <NP>
HEAD noun
SUBCAT <>
HEAD verb
SUBCAT <NP>
HEAD prep
MOD VP
SUBCAT <>
HEAD prep
MOD NP
SUBCAT <>
HEAD noun
SUBCAT <>
HEAD verb
SUBCAT
<NP,NP>
HEAD noun
SUBCAT <>
HEAD prep
MOD VP
SUBCAT <NP>
HEAD prep
MOD NP
SUBCAT <NP>
he
saw
a girl
with
ESSLLI 2006
HEAD noun
SUBCAT <>
a telescope
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
290
Feature forest representation of a chart
• Node
= each rule
application
HEAD verb
SUBCAT <>
HEAD noun
HEAD verb
SUBCAT <> SUBCAT <NP>
HEAD verb
SUBCAT <NP>
HEAD prep
HEAD verb
MOD VP
SUBCAT <NP> SUBCAT <>
HEAD noun
SUBCAT <>
he
ESSLLI 2006
HEAD prep
MOD VP
SUBCAT <>
HEAD prep
HEAD noun
MOD VP
SUBCAT <>
SUBCAT <NP>
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
HEAD verb
SUBCAT <NP>
HEAD verb
HEAD noun
SUBCAT
SUBCAT <>
<NP,NP>
HEAD noun
SUBCAT <>
HEAD prep
HEAD noun
MOD NP
SUBCAT <>
SUBCAT <>
291
Feature forest representation of predicate argument structures
• Node = already-determined predicate argument
relations
fact
want
ARG1 4 I
dispute2
ARG2 ARG1 4
ARG2 3
3
fact
want
ARG1 ARG1 I
ARG2 dispute1
want
ARG1 1 I
dispute1
ARG2 ARG1 1
She
ESSLLI 2006
ignored
the fact
want
ARG1 2 I
dispute2
ARG2 ARG1 2
ARG2
that I wanted
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
to dispute
292
Extraction of probabilistic events
extract_binary_event("hpsg-forest", "bin", $RuleName, $LDtr, $RDtr, _, _,
$Event) :$Event = [$RuleName, $Dist, $Depth|$HDtrFeatures]) :find_head($Rule, $LSign, $RSign, $Head, $NonHead),
rule_name_mapping($Rule, $Head, $NonHead, $RuleName),
encode_distance($LSign, $RSign, $Dist),
encode_depth($LSign, $RSign, $Depth),
encode_sign($Head, $HDtrFeatures, $NDtrFeatures),
encode_sign($NonHead, $NDtrFeatures, []).
S
NP
VP
Cool boys ADVP ran
never
schema
lexical entry
NTS
POS
span
<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>
distance
ESSLLI 2006
depth
word
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
293
Atomic features
•
•
•
•
•
•
•
•
•
RULE: name of applied rule
DIST: distance between head words
COMMA: whether the phrase includes commas
SPAN: number of words the phrase dominates
SYM: nonterminal symbol (e.g. S, VP, …)
WORD: head word
POS: part-of-speech
LE: lexical entry
ARG: argument label (ARG1, ARG2, ...)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
294
Example: syntactic features
• Feature for the Head-Modifier construction for “saw
a girl” and “with a telescope”
RULE, DIST, COMMA,
f 
SPAN l , SYM l , WORD
SPAN r , SYM r , WORD
HEAD verb
SUBCAT <>
, POS l , LE l
r
, POS r , LE r
head - modifier,3 ,0,

HEAD verb
SUBCAT <NP>
3, VP, saw, VBD, transitive ,
3, PP, with,IN,
vp - mod - prep
HEAD prep
MOD VP
SUBCAT <>
HEAD verb
SUBCAT <NP>
HEAD noun
SUBCAT <>
HEAD verb
SUBCAT
<NP,NP>
HEAD noun
SUBCAT <>
HEAD prep
MOD VP
SUBCAT <NP>
HEAD noun
SUBCAT <>
he
saw
a girl
with
a telescope
ESSLLI 2006
l
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
295
Example: semantic features
• Feature for the predicate argument relation
between “he” and “saw”
ARG1
he
saw
ARG2
ARG , DIST ,
f pa 
girl
WORD
h
, POS h , LE h ,
WORD
n
, POS n , LE n
ARG1 ,1,

saw , VBD , transitive ,
he , PRP , pronoun
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
296
Feature generation
• Features are generated by abstracting descriptions
of probabilistic events
feature_mask("hpsg-forest", "bin",
[1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]).
feature_mask("hpsg-forest", "bin",
[1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]).
feature_mask("hpsg-forest", "bin",
[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]).
<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>
<subj-head, 2, _, _, ran, VBD, V_intrans-past, _, _, boys, NNS, N_plural, _>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
297
Parsing
• Efficient processing of feature structures (details
omitted)
– Abstract machines, quick check, CFG filtering, etc.
• Efficient search with probabilistic HPSG
– Beam thresholding
– Iterative beam thresholding
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
298
Beam thresholding
• Thresholding out edges in each cell of the chart
– Thresholding by number: for each cell, keep only the
best n edges
– Thresholding by width: keep only the edges whose FOM
is greater than w, where w is the difference from the best
FOM in the same cell
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
299
Effect of beam thresholding
• Precision and recall by changing parameters of
beam search
• Recall drops, while precision retains
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
300
Iterative beam thresholding
• Start with a narrow beam width
• Continue widening a beam width until parsing
succeeds
Iterative_parse(sentence) {
w := beam_width_start;
while(w < beam_width_end) {
parse(sentence, w);
if(parse succeeds) return;
w := w + beam_width_step;
}
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
301
Efficacy of iterative beam thresholding
• Evaluated on Penn Treebank Section 24 (< 15
words)
Precision
Recall
F-score
Avg. time
(ms)
Viterbi
88.2%
87.9%
88.1%
103923
Beam
89.0%
82.4%
85.5%
88
Iterative
87.6%
87.2%
87.4%
99
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
302
Distribution of parsing time
• Black: Viterbi, Red: iterative beam thresholding
100000000
Parsing time (ms)
10000000
1000000
100000
10000
1000
100
10
1
0
ESSLLI 2006
5
10
Sentence length (words)
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
15
303
Evaluation
• Evaluation of the lexical entries extracted from
Penn Treebank
– Investigation of obtained lexical entries
– Coverage
• Evaluation of the disambiguation model
– Parsing accuracy
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
304
Experimental settings
• Training data: Sections 2-21 of Penn Treebank II
(39,832 sentences)
• Test data:
– Development set: Section 22 (1,700 sentences)
– Final test set: Section 23 (2,416 sentences)
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
305
Number of tree conversion rules
Target of conversion
Penn-II errors
Category mapping
Head annotation and binarization
Difference of phrase structures
Predicate argument structures
Long distance dependencies
Others
Total
ESSLLI 2006
Number
102
85
63
15
13
13
52
343
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
306
Result of treebank conversion & lexicon extraction
• Treebank conversion and HPSG annotation
succeeded for 37,886 sentences
• Extracted lexicon:
# words
34,765
# lexical entries
1,942
Average # lexical entries/word
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
1.43
307
Sources of treebank conversion failures
• Classification of failures of treebank conversion in
Section 02 (67 failures/1989 sentences)
Shortcomings of tree conversion rules
18
Errors in Penn Treebank
16
Constructions currently unsupported
20
Constructions unsupported by HPSG
13
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
308
Breakdown of extracted lexical entries
# lexical entries Avg. # lex. entries
# words
noun
21,925
186
1.14
verb
4,094
945
1.94
adjective
8,078
62
1.28
adverb
1,295
72
2.75
159
193
9.17
particle
58
10
1.69
determiner
36
33
3.86
conjunction
94
321
9.46
punctuation
15
120
22.00
34,765
1,942
1.43
preposition
Total
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
309
Example lexical entries
Common noun
Ex. review/NN
appeared 140,805 times
noun
HEAD MOD <>
SPR < HEAD det >
SUBJ <>
VAL COMPS <>
Transitive verb
appeared 12,244 times
verb
HEAD MOD <>
VFORM base
SPR <>
VAL SUBJ <HEAD noun>
COMPS <HEAD noun>
Pre-head adjective
appeared 55,049 times
adj
MOD <HEAD noun>
HEAD POSTHEAD -
SPR <>
VAL SUBJ <>
COMPS <>
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
310
Evaluation of coverage
• The ratio of lexical entries in the test data covered
by the grammar is measured
• A sentence is covered when all of the lexical entries
in the sentence are covered (strong coverage)
w/o unknown word handling
w/ unknown word handling
ESSLLI 2006
Lexical entry
96.52%
99.15%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
Sentence
54.7%
84.8%
311
Treebank size vs. coverage
C overage (% )
100
80
60
40
Lexical entry
20
Sentence
0
0
10000
20000
30000
40000
# sentences
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
312
Sentence length vs. coverage
C overage (% )
100
80
60
40
Sentence
coverage
20
0
0
20
40
60
Sentence length
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
313
Error analysis
• Classification of randomly selected uncovered
lexical entries
Errors of Penn Treebank
10
Errors of treebank conversion
48
Lack of lexical entries
23
Constructions currently unsupported
9
Idioms
6
Non-linguistic expressions (ex. list)
4
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
314
Examples of uncovered lexical entries
• Lack of mappings from words into lexical entries
because of data sparseness
– Post-noun adjectives (younger, crucial)
– Coordination conjunctions of NP and S’
– Verbs taking present-participle as a complement
• Unsupported constructions
– Free relatives, extrapositions
• Incorrect lexical entries obtained because of
idiomatic expressions
– (ADVP in part) because …
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
315
Evaluation of parsing accuracy
• Empirical evaluation of the probabilistic models
–
–
–
–
–
–
Overall accuracy
Treebank size vs. accuracy
Sentence length vs. accuracy
Contribution of features
Coverage and accuracy
Error analysis
• Measure: precision/recall of
<predicate word, argument position, argument word, predicate type>
– e.g.) <saw, ARG1, he, transitive>
ARG1
he
ARG2
girl
saw
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
316
Effect of feature forest models
• Accuracy for Section 23 (< 40 words)
Precision
Recall
baseline
78.10
77.39
with syntactic features
86.92
86.28
with semantic features
84.29
83.74
with all features
86.54
86.02
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
317
Treebank size vs. accuracy
P recisi on/r ecall (% )
100
80
60
40
20
0
0
10000
20000
30000
40000
# sentences
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
318
Sentence length vs. accuracy
Coverage (%)
100
80
60
40
Sentence
coverage
20
0
0
20
40
60
Sentence length
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
319
Contribution of features (1/2)
precision
recall
# features
All
87.12
85.45
623,173
-RULE
86.98
85.37
620,511
-DIST
86.74
85.09
603,748
-COMMA
86.55
84.77
608,117
-SPAN
86.53
84.98
583,638
-SYM
86.90
85.47
614,975
-WORD
86.67
84.98
116,044
-POS
86.36
84.71
430,876
-LE
87.03
85.37
412,290
None
78.22
76.46
24,847
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
320
Contribution of features (2/2)
precision
recall
# features
87.12
85.54
83.94
85.45
84.02
82.44
623,173
294,971
286,489
-RULE,DIST,
SPAN,COMMA
83.61
81.98
283,897
-WORD,LE
86.48
85.56
84.89
84.91
83.94
83.43
50,258
64,915
33,740
-SYM,WORD,
POS,LE
82.81
81.48
26,761
None
78.22
76.46
24,847
All
-DIST,SPAN
-DIST,SPAN,COMMA
-WORD,POS
-WORD,POS,LE
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
321
Coverage and accuracy
• Accuracies for strongly covered/uncovered
sentences
Precision
Recall
# sentences
Covered sentences
89.36
88.96
1,825
Uncovered sentences
75.57
74.04
319
• We can expect accuracy improvements by
improving grammar coverage
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
322
Error analysis
• Classification of errors in randomly selected
sentences (100 sentences)
PP-attachment ambiguity
76
Distinction of arguments/modifiers
49
Ambiguity of lexical entries
44
Errors in test data
22
Ambiguity of commas
32
Others
75
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
323
Examples of errors (1/2)
• Antecedent of a relative clause
– It's made only in years when the grapes ripen perfectly (the last
was 1979) and comes from a single acre of [NP grapes [S' that
yielded a mere 75 cases in 1987]].
• Argument/modifier distinction of to-phrases
– More than a few CEOs say the red-carpet treatment tempts
them [VP-modifier to return to a heartland city for future
meetings].
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
324
Examples of errors (2/2)
• Preposition or verb phrase?
– Mitsui Mining & Smelting Co. posted a 62 % rise in pretax
profit to 5.276 billion yen ($ 36.9 million) in its fiscal first half
ended Sept. 30 [VP compared with 3.253 billion yen a year
earlier].
• Selection of subcategorization frames
– [NP-subject ``Nasty innuendoes,''] [VP says [NP-object John
Siegal, Mr. Dinkins's issues director, ``designed to prosecute a
case of political corruption that simply doesn't exist.'']]
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
325
Advanced topics
• Domain adaptation
– Adapting the grammar and/or the disambiguation model
to a new domain using a small amount of training data
• Generation
– Using the grammar for sentence generation
• Semantics construction
– Obtaining representations of formal semantics from
HPSG parsing
• Applications
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
326
Domain adaptation (1/2)
• Disambiguation models are adapted to a bio
domain using small training data
– An original probabilistic model is incorporated into a new
model as a reference distribution
– Parameters of the new model are estimated so as to
maximize the likelihood of the new training data
p new  x  
1
Z
p orig


( x ) exp    i g i  x  
 i

Reference distribution
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
327
Domain adaptation (2/2)
• Evaluation with a bio-domain corpus
• Training data:
– Penn Treebank (News): 39,832 sentences
– GENIA Treebank (Bio): 3,524 sentences
ESSLLI 2006
Precision
Recall
News domain
87.69%
87.16%
Bio domain
(w/o adaptation)
85.50%
83.91%
Bio domain
87.19%
85.58%
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
328
Generation (1/2)
• The methods for HPSG parsing are applied to a
chart generator of HPSG
– Feature forest model
– Iterative beam thresholding
{0,1,2,3}
0-3
0-2
0-1
0
{0,1} {0,2} {0,3} {1,2} {1,3} {2,3}
2-3
1-2
1
{0,1,2} {0,1,3} {0,2,3} {1,2,3}
1-3
2
{0}
3
He bought the book.
0
1
2
chart parsing
ESSLLI 2006
3
he(x)
{1}
{2}
{3}
buy(e) the(y) book(z)
past(e)
0
1
2
3
chart generation
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
329
Generation (2/2)
• Evaluation on Penn Treebank Section 23
Beam width
Beam
thresholding
Iterative beam
thresholding
ESSLLI 2006
Coverage
(%)
Avg. generation
time (msec.)
BLEU
4
44.76
621
0.8196
8
67.70
1776
0.8294
12
73.12
3074
0.8327
16
72.90
4287
0.8341
20
71.81
5273
0.8333
8-20
82.47
1668
0.7982
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
330
Semantics construction (1/2)
• Mapping from HPSG parse trees into semantic
representations of typed dynamic logic (TDL)
– Typed dynamic logic: a variant of dynamic semantics that
includes plural semantics, event semantics, and situation
semantics (Bekki, 2005)
– Completely compositional semantics: lambda calculus
composes semantic representations of phrases from
lexical representations
Few boys fell. They died.
few(x)[boy’x][fall’x] Λ ref(x)[die’x]
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
331
Semantics construction (2/2)
• Approach:
– Mapping HPSG lexical entries into lexical representations
of TDL
– Semantic representations of phrases are composed along
HPSG parse trees
PHON “loves”
HEAD verb
SUBJ <HEAD noun>
COMPS <HEAD noun>
  x.



 obj . sbj .sbj 
  y.


obj

 love ' e   agent ' e , x   theme ' e , y   



• Coverage: around 90% of Penn Treebank Section
23 are assigned well-formed semantic
representations
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
332
Applications: information extraction
• Extraction of protein-protein interactions from
biomedical paper abstracts
– Patterns on predicate argument structures are learned
from small annotated data
– Precision/recall: 71.8%/48.4%
1
(Yakushiji 2005)
(Ramani et al., 2005)
Precision
0.8
0.6
0.4
0.2
0
0
0.2
0.6
0.4
0.8
1
Recall
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
333
Applications: text retrieval
• Retrieval of relational concepts
– All sentences in MEDLINE are parsed into predicate
argument structures
– Relational concepts, such as “what causes cancer”, are
retrieved by matching with predicate argument structures
– Precision/recall: 60-96%/30-50%
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
334
Summary
• Conversion of Penn Treebank II into an HPSG treebank
– Pattern-based tree conversion and principle application
• Extraction of lexical entries from the HPSG treebank
– Generalization, application of inverse lexical rules, and assignment of
predicate argument structures
• Probabilistic modeling of feature structures
– Feature forest model
• Techniques for efficient parsing with probabilistic HPSG
– Iterative beam thresholding
• Evaluation
– Coverage and parsing accuracy
• Advanced topics
– Domain adaptation, sentence generation, semantics construction,
and practical applications
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
335
Publications
• Corpus-oriented development of HPSG
– Y. Miyao, T. Ninomiya, and J. Tsujii. (2003). Lexicalized Grammar
Acquisition. In Proc. 10th EACL Companion Volume.
– Y. Miyao, T. Ninomiya, and J. Tsujii. (2004) Corpus-oriented
grammar development for acquiring a Head-Driven Phrase Structure
Grammar from the Penn Treebank. In Proc. IJCNLP 2004.
– H. Nakanishi, Y. Miyao, and J. Tsujii. (2004). Using Inverse Lexical
Rules to Acquire a Wide-coverage Lexicalized Grammar. In the
IJCNLP 2004 Workshop on “Beyond Shallow Analyses.”
– H. Nakanishi, Y. Miyao and J. Tsujii. (2004). An Empirical
Investigation of the Effect of Lexical Rules on Parsing with a
Treebank Grammar. In Proc. TLT 2004.
– K. Yoshida. (2005). Corpus-Oriented Development of Japanese HPSG
Parsers. In 43rd ACL Student Research Workshop.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
336
Publications
• Feature forest model
– Y. Miyao and J. Tsujii. (2002) Maximum entropy estimation for
feature forests. In Proc. HLT 2002.
• Probabilistic models for HPSG
– Y. Miyao and J. Tsujii. (2003). A model of syntactic disambiguation
based on lexicalized grammars. In Proc. 7th CoNLL.
– Y. Miyao, T. Ninomiya and J. Tsujii. (2003). Probabilistic modeling of
argument structures including non-local dependencies. In Proc.
RANLP 2003.
– Y. Miyao, and J. Tsujii. (2005). Probabilistic disambiguation models
for wide-coverage HPSG parsing. In Proc. ACL 2005.
– T. Ninomiya, T. Matsuzaki, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2006).
Extremely Lexicalized Models for Accurate and Fast HPSG Parsing. In
Proc. EMNLP 2006.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
337
Publications
• Parsing strategies for probabilistic HPSG
– Y. Tsuruoka, Y. Miyao and J. Tsujii. (2004). Towards efficient
probabilistic HPSG parsing: integrating semantic and syntactic
preference to guide the parsing. In the IJCNLP-04 Workshop on
“Beyond shallow analyses.”
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2005). Efficacy of
Beam Thresholding, Unification Filtering and Hybrid Parsing in
Probabilistic HPSG Parsing. In Proc. IWPT 2005.
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J. Tsujii. (2006).
Fast and Scalable HPSG Parsing. Traitement automatique des
langues (TAL). 46(2).
• Domain adaptation
– T. Hara, Y. Miyao, and J. Tsujii. (2005). Adapting a probabilistic
disambiguation model of an HPSG parser to a new domain. In Proc.
IJCNLP 2005.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
338
Publications
• Generation
– H. Nakanishi, Y. Miyao, and J. Tsujii. (2005). Probabilistic models for
disambiguation of an HPSG-based chart generator. In Proc. IWPT
2005.
• Semantics construction
– M. Sato, D. Bekki, Y. Miyao, and J. Tsujii. (2006). Translating HPSGstyle Outputs of a Robust Parser into Typed Dynamic Logic. In Proc.
COLING-ACL 2006 Poster Session.
• Applications
– Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T. Ninomiya,
and J. Tsujii. (2006). Semantic Retrieval for the Accurate
Identification of Relational Concepts. In Proc. COLING-ACL 2006.
– A. Yakushiji, Y. Miyao, T. Ohta, Y. Tateisi, and J. Tsujii. (2006).
Automatic Construction of Predicate-Argument Structure Patterns for
Biomedical Information Extraction. In EMNLP 2006 Poster Session.
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
339
Comparing LFG, CCG, HPSG and TAG Acquisition
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
340
Comparing LFG, CCG, HPSG and TAG Acquisition
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
341
Demos
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
342
Demos
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
343
Future Work & Discussion
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
344
Future Work & Discussion
ESSLLI 2006
Treebank-Based Acquisition of
LFG, HPSG and CCG Resources
345
Descargar

Rapid Multilingual LFG Acquistion