January 7, 2008
Dependency Parsing:
Machine Learning Approaches
Yuji Matsumoto
Graduate School of Information Science
Nara Institute of Science and Technology
(NAIST, Japan)
Basic Language Analyses (POStagging, phrase chunking, parsing)
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
PRP
VBZ
DT
JJ
NN
Base phrase-chunked sentence
NN
MD
VB
TO
RB
CD
CD
IN
NNP
.
Base phrase chunking
He reckons the current account deficit will narrow to only 1.8 billion in September .
NP VP
NP
VP
PP
NP
PP
NP
Dependency parsed sentence
Dependency parsing
He reckons the current account deficit will narrow to only 1.8 billion in September .
2
Word Dependency Parsing (unlabeled)
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP
VBZ
DT
JJ
NN
NN
MD
VB
TO
RB
CD
CD
IN
NNP
.
Word dependency parsing
Word dependency parsed sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
3
Word Dependency Parsing (labeled)
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP
VBZ
DT
JJ
NN
NN
MD
VB
TO
RB
CD
CD
IN
NNP
.
Word dependency parsing
Word dependency parsed sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
SUBJ
MOD
MOD
MOD
SUBJ
COMP
COMP
SPEC
S-COMP
ROOT
4
A phrase structure tree and
a dependency tree
ounces
5
Flattened representation of
a dependency tree
ounces
6
Dependency structure —
terminology
Label
SUBJ
This
• Child
• Dependent
• Modifier
is
• Parent
• Governor
• Head
• The direction of arrows may be drawn from head to child
• When there is an arrow from w to v, we write w→v.
• When there is a path (a series of arrows) from w to v,
we write w→*v.
7
Definition of Dependency Trees
Single head: Except for the root (EOS), all
words have a single parent
Connected: It should be a connected tree
Acyclic: If wi →wj , then it will never be
wj →*wi
Projective: If wi →wj , then for all k between
i and j, either wk →* wi or wk →* wj holds
(non-crossing between dependencies).
8
Projective dependency tree
ounces
Projectiveness: all the words between here finally
depend on either on “was” or “.”
(e.g., light →* was)
9
Non-projective dependency tree
root
NNP VBD DT NN
NN
WP
VBD
IN
DT
NN
John saw a man yesterday who walked along the river
Direction of edges: from a child to the parent
10
Non-projective dependency tree
root
*taken from: R. McDonald and F. Pereira, “Online Learning of Approximate
Dependency Parsing Algorithms,” European Chapter of Association for
Computational Linguistics, 2006.
Direction of edges: from a parent to the children
11
Two Different Strategies for
Structured Language Analysis
Sentences have structures


Linear sequences: POS tagging, Phrase/Named Entity
chunking
Tree structure: Phrase structure trees, dependency trees
Two statistical approaches to structure analysis
1.
Global optimization



2.
Eg., Hidden Markov Models, Conditional Ramdom Fields for
Sequential tagging problems
Probabilistic Context-free parsing
Maximum Spanning Tree Parsing (graph-based)
Repetition of local optimization


Chunking with Support Vector Machine
Deterministic parsing (transition-based)
12
Statistical dependency parsers
Eisner (COLING 96, Penn Technical Report 96)
Kudo & Matsumoto (VLC 00, CoNLL 02):
Yamada & Matsumoto (IWPT 03)
Nivre (IWPT 03, COLING 04, ACL 05)
Cheng, Asahara, Matsumoto (IJCNLP 04)
McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b,
EACL 06)
Global optimization
Repetition of local optimization
13
Dependency Parsing used as
the CoNLL Shared Task
CoNLL (Conference on Natural Language Learning)
Multi-lingual Dependency Parsing Track

10 languages: Arabic, Basque, Catalan, Chinese, Czech,
English, Greek, Hungarian, Italian, Turkish
Domain Adaptation Track


Dependency annotated data in one domain and a large
unannotated data in other domains (biomedical/chemical
abstracts, parent-child dialogue) are available.
Objective: To use large scale unannotated target data to
enhance the performance of dependency parser learned in
the original domain so as to work well in the new domain.
Nivre, J., Hall, J., Kubler, S, McDonald, R., Nilsson, J., Riedel, S., Yuret,
D., “The CoNLL 2007 Shared Task on Dependency Parsing,”
Proceedings of EMNLP-CoNLL 2007, pp.915-932, June 2007.
14
Statistical dependency parsers
(to be introduced in this lecture)
Kudo & Matsumoto (VLC 00, CoNLL 02):
Japanese
Yamada & Matsumoto (IWPT 03)
Nivre (IWPT 03, COLING 04, ACL 05)
McDonald-Crammer-Pereira (EMNLP 05a,
ACL 05b, EACL 06)
Most of them (except for [Nivre 05] and [McDonald 05a])
Assume projective dependency parsing
15
Japanese Syntactic Dependency Analysis
Analysis of relationship between phrasal
units (“bunsetsu” segments)
Two Constraints:


Each segment modifies one of right-hand
side segments (Japanese is head final
language)
Dependencies do not cross one another
(projectiveness)
16
An Example of Japanese Syntactic
Dependency Analysis
Raw text
私は彼女と京都に行きます
(I go to Kyoto with her.)
Morphological analysis and
bunsetsu chunking
私は / 彼女と / 京都に / 行きます
I
with her
to Kyoto
go
Dependency Analysis
私は / 彼女と / 京都に / 行きます
17
Model 1: Probabilistic Model
[Kudo & Matsumoto 00]
Input
私は 1 / 彼女と 2 / 京都に 3 / 行きます
I-top / with her / to Kyoto-loc / go
1. Build a Dependency
Matrix using ME, DT or
SVMs (How probable one
segment modifies another)
4
Modifiee
Modifier
2. Search the optimal dependencies
which maximize the sentence
probabilities, using CYK or Chart
Output
1
2
3
2
3
4
0.1
0.2
0.7
0.2
0.8
1.0
Dependency Matrix
私は
1
/ 彼女と
2
/ 京都に
3
/ 行きます
4
18
Problems of Probabilistic model(1)
Selection of training examples:
All pairs of segments in a sentence


Depending pairs → positive examples
Non-depending pairs → negative examples
This produces a total of n(n-1)/2 training
examples per sentence (n is the number of segments in a
sentence)
In Model 1:


All positive and negative examples are used to learn
an SVM
Test example is given to the SVM, its distance from
the separating hyperplane is transformed into a
pseud-probability using the sigmoid function
19
Problems of Probabilistic Model
Size of training examples is large
O(n3) time is necessary for complete
parsing
The classification cost of SVM is much
more expensive than other ML algorithms
such as Maximum Entropy model and
Decision Trees
20
Model 2: Cascaded Chunking Model
[Kudo & Matsumoto 02]
Parse a sentence deterministically only
deciding whether the current segment
modifies the segment on its immediate
right-hand side
Training examples are extracted using
the same parsing algorithm
21
Example: Training Phase
Annotated sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart was moved
(He was moved by her warm heart.)
??
彼は
彼は111
D
O
O
彼女の2
O
D
Pairs of tag (D or O)
and context(features)
are stored as training
data for SVMs
?
??
?
?
???
温かい3 真心に44 感動した。
感動した。55
D
D
Training
Data
SVM-learning
after accumulation
SVMs
22
Example: Test Phase
Test sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart was moved
(He was moved by her warm heart.)
??
彼は
彼は11111
彼は
D
O
O
彼女の22
O
D
Tag is decided by
SVMs built in
training phase
?
??
?
?
???
感動した。555
温かい3 真心に44 感動した。
感動した。
D
D
SVMs
23
Advantages of Cascaded Chunking
model
Efficiency



O(n3) (Probability model) v.s. O(n2)
(Cascaded chunking model)
Lower than O(n2) since most segments
modify the segment on their immediate righthand-side
The size of training examples is much smaller
Independence from ML methods


Can be combined with any ML algorithm which
works as a binary classifier
Probabilities of dependency are not necessary
24
Features used in implementation
Modify or not?
B
彼の1 友人は2
His
friend-top
A
この本を3
C
持っている4 女性を5 探している6
this book-acc
have
modifier
lady-acc
be looking for
head
His friend is looking for a lady who has this book.
 Static Features
 modifier/modifiee
 Head/Functional Word:
surface, POS, POS-subcategory,
inflection-type, inflection-form, brackets, quotations, punctuations,…
 Between segments:
distance, case-particles, brackets,
quotations, punctuations
 Dynamic Features
 A,B: Static features of Functional word
 C:
Static features of Head word
25
Settings of Experiments
Kyoto University Corpus 2.0/3.0

Standard Data Set
 Training: 7,958 sentences / Test: 1,246 sentences
 Same data as [Uchimoto et al. 98], [Kudo, Matsumoto 00]

Large Data Set
 2-fold Cross-Validation using all 38,383 sentences
Kernel Function: 3rd polynomial
Evaluation method


Dependency accuracy
Sentence accuracy
26
Results
Data Set
Standard (8,000 sentences)
Large (20,000 sentences)
Model
Cascaded
Chunking
Cascaded
Chunking
Probabilistic
Probabilistic
Dependency
Acc. (%)
89.29
89.09
90.45
N/A
Sentence Acc.
(%)
47.53
46.17
53.16
N/A
# of training
sentences
7,956
7,956
19,191
19,191
# of training
examples
110,355
459,105
251,254
1,074,316
Training time
(hours)
8
336
48
N/A
Parsing time
(sec./sent.)
0.5
2.1
0.7
N/A
27
Probabilistic v.s. Cascaded Chunking
Models
Probabilistic
Cascaded Chunking
Strategy
Maximize sentence
probability
Shift-Reduce
Deterministic
Merit
Can see all
candidates of
dependency
Simple, efficient and
scalable,
Accurate as Prob. model
Demerit
Inefficient,
Cannot see the all
(posterior) candidates of
Commit to
unnecessary training dependency
examples
28
Smoothing Effect (in cascade model)
No need to cut off low frequent words
Cut-off by
frequency
Dependency
accuracy
Sentence
accuracy
1
89.3%
47.8%
2
88.7%
46.3%
4
87.8%
44.6%
6
87.6%
44.8%
8
87.4%
42.5%
29
Combination of features
Polynomial Kernels for taking into
combination of features (tested with a
small corpus (2000 sentences))
Degree of
polynomial kernel
Dependency
accuracy
Sentence
accuracy
1
N/A
N/A
2
86.9%
40.6%
3
87.7%
42.9%
4
87.7%
42.8%
30
Deterministic Dependency Parser
based on SVM
Three possible actions:



[Yamada & Matsumoto 03]
Right: For the two adjacent words, modification goes from
left word to the right word
Left: For the two adjacent words, modification goes from
right word to the left word
Shift: no action should be taken for the pair, and move the
focus to the right
 There are two possibilities in this situation:
 There is really no modification relation between the pair
 There is actually a modification relation between them, but
need to wait until the surrounding analysis has been finished
 The second situation can be categorized into a different class
(called Wait)
Do this process on the input sentence from the
beginning to the end, and repeat it until a single
word remains
31
Right action
32
Left action
33
Shift action
34
The features used in learning
SVM is used to make classification
either in 3 class model (right, left, shift)
or in 4 class model (right, left, shift, wait)
35
SVM Learning of Actions
The best action for each configuration is
learned by SVMs
Since the problem is 3-class or 4-class
classification problem, either pair-wise or
one-vs-rest method is employed


pair-wise method: For each pair of classes, learn
an SVM. The best class is decide by voting of all
SVMs
One-vs-rest method: For each class, an SVM is
learned to discriminate that class from others.
The best class is decided by the SVM that gives
the highest value
36
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
right
the
boy
hits
the
dog
with
a
rod
word pair being considered
Referred context
37
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
right
boy
hits
the
dog
with
a
rod
the
38
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
shift
hits
the
dog
with
a
rod
boy
the
39
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
right
hits
the
dog
with
a
rod
boy
the
40
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
shift
hits
dog
boy
the
with
a
rod
the
41
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
shift
hits
dog
boy
the
with
a
rod
the
42
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
right
hits
dog
boy
the
with
a
rod
the
43
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
left
hits
boy
dog
the
with
rod
a
the
44
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
shift
hits
boy
dog
the
the
with
rod
a
45
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
left
hits
boy
dog
the
the
with
rod
a
46
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
left
hits
with
boy
dog
rod
the
the
a
47
An Example of Deterministic Dependency
Parsing (Yamada & Matsumoto Algorithm)
End of parsing
hits
boy
dog
with
the
the
rod
a
48
The Accuracy of Parsing
Accuracies for:
Dependency relation
Rood identification
Complete analysis
Learned with 30000 English sentences
• no children: no child info is considered
• word, POS: only word/POS info is used
• all: all information is used
49
Deterministic linear time dependency
parser based on Shift-Reduce parsing
[Nivre 03,04]
There are two stacks S and Q



Initialization: S[w1] [w2, w3, …, wn]Q
Termination: S[…] []Q
Parsing actions:
 SHIFT: S[…] [wi,…]Q → S[…, wi] […]Q
 Left-Arc: S[…, wi] [wj, …]Q → S[…] [wj, …]Q
wi
 Right-Arc: S[…, wi] [wj,…]Q → S[…, wi, wj] […]Q
 Reduce: S[…, wi, wj] […]Q → S[…, wi] […]Q
Though the original parser uses memory-based learning,
recent implementation uses SVMs to select actions
wj
50
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
left
S
the
Q
boy
hits
the dog with a rod
51
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
S
shift
Q
boy
hits
the dog with a rod
the
52
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
left
S
boy
Q
hits
the dog with a rod
the
53
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
S
shift
Q
hits
the dog with a rod
boy
the
54
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
shift
S
hits
Q
the dog with a rod
boy
the
55
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
left
S
hits
the
Q
dog with a rod
boy
the
56
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
right
S
hits
boy
Q
dog with a rod
the
the
57
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
reduce
S
hits
boy
dog
Q
with a rod
the
the
58
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
right
S
hits
boy
dog
the
the
Q
with a rod
59
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
shift
S
hits with
boy
dog
the
the
Q
a rod
60
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
left
S
hits with
boy
dog
the
the
a
Q
rod
61
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
right
S
hits with
boy
dog
the
the
Q
rod
a
62
An Example of Deterministic Dependency
Parsing (Nivre’s Algorithm)
terminate
S
hits with
boy
dog
the
the
Q
rod
a
63
Computational Costs
Cascade Chunking (Kudo2, Yamada)

O(n2)
Deterministic Shift-Reduce Parsing (Nivre)

O(n)
CKY-based parsing (Eisner, Kudo1)

O(n3)
Maximum Spanning Tree Parsing (McDonald)

O(n2)
64
McDonald’s MST Parsing
R. McDonald, F. Pereira, K. Ribarov, and J.
Hajiˇc, “Non-projective dependency parsing
using spanning tree algorithms,”
In Proceedings of the Joint Conference on
Human Language Technology and Empirical
Methods in Natural Language Processing
(HLT/EMNLP), 2005.
MST: Maximum Spanning Tree
65
Definition of Non-projective
Dependency Trees
Single head: Except for the root, all words
have a single parent
Connected: It should be a connected tree
Acyclic: If wi →wj , then it will never be
wj →*wi
*taken from: R. McDonald and F. Pereira, “Online Learning of Approximate
Dependency Parsing Algorithms,” European Chapter of Association for
Computational Linguistics, 2006.
66
Notation and definition
  {( x t , y t )} t 1 training data
T
t-th sentence in T
xt
y t  {( i , j )} i 1
n
Dependency structure for t-th sentence
(defined by a set of edges)
f (i, j )
The vector of feature functions for edges
dt ( x )
The set of all dependency trees
producible from x
67
The Basic Idea
The score function of dependency trees is defined by
the sum of the scores of edges, which are defined by
the weighted sum of feature functions
s( x, y) 
 s (i , j )   w  f (i , j )
( i , j ) y
saw
( i , j ) y
The target optimization problem:
I
min w
s.t.
s ( x , y )  s ( x , y' )  L ( y , y' )
( x, y)  T ,
y '  dt ( x )
girl
a
L(y,y’) is defined as the number of words in y’ that
have different parent compared with y.
68
Explanation of formulas
s( x, y) 
 s (i , j )   w  f (i , j )
( i , j ) y
s (i, j )
( i , j ) y
Score of an edge connecting nodes i and j
f ( i , j ) Feature vector representing information of
words i and j, and relation between them
w
Weight vector for the feature vectors
for maximizing w  f ( i , j ) (learned from
training examples)
69
Online learning algorithm (MIRA)
1. w0=0; v=0; i=0
2. for n : 1..N
3. for t : 1..T
4.
w(i+1)=update w(i) with (xt,yt)
5.
v=v+w(i+1)
N: predefined number of
6.
i=i+1
iteration
T: number of traning examples
7. w=v/(N*T)
70
Update of the weight vector
min w
s.t.
( i 1)
w
(i )
s ( x , y )  s ( x , y' )  L ( y , y' )
 y '  dt ( x t )
Since this formulation requires all the possible trees,
this is modified so that only the k-best trees are taken
into consideration
min w
s.t.
( i 1)
w
(i )
s ( x , y )  s ( x , y' )  L ( y , y' )
 y '  best k ( x t ; w
(i )
)
(k=5~20)
71
Features used in training
Unigram features: word/POS tag of the parent,
word/POS tag of the child (when the length of a word
is more than 5, only 5-letter prefix is used, this
applies all other features)
Bigram features: pair of word/POS tags of the parent
and child
Trigram features: POS tags of the parent and child,
and the POS tag or a word appearing in-between
Context features: POS tags of the parent and child,
and two POS tags before and after them (backed-off
by trigrams)
72
Settings of Experiments
Convert Penn Treebank into
dependency trees based on Yamada’s
head rules data(chapters 02-21),
development data(22), test data(23)
POS tagging is done by Ratnaparkhi’s
MaxEnt tagger
K-best tree are constructed based of
Eisner’s parsing algorithm
73
Dependency Parsing by
Maximum Spanning Tree Algorithm
1. Learn the optimal w from training data
2. For a given sentence:

Calculate the scores to all word pairs in the
sentence
3. Get the spanning tree with the largest score

There is an algorithm (Chu-Liu-Edmonds
algorithm) that requires just O(n2) time to obtain
the maximum spanning tree (n: the number of
nodes(words))
74
The Image of Getting
the Maximum Spanning Tree
“John saw Mary”
Scores to all the pairs or words
(root is the dummy root node)
The maximum
spanning tree
75
Results of experiments
Model
accuracy
root
comp.
Yamad & Matsumoto
90.3
91.6
38.4
Nivre
87.3
84.3
30.4
Avg. Perceptron
90.6
94.0
36.5
MIRA
90.9
94.2
37.5
accuracy: proportion of correct edges
root: accuracy of root words
comp.: proportion of sentences completely analyzed
76
Available Dependency Parsers
Maximum Spanning Tree Parser (McDonald)

http://ryanmcd.googlepages.com/MSTParser.html
Date-driven Dependency Parser (Nivre)

http://w3.msi.vxu.se/~nivre/research/MaltParser.html
CaboCha: SVM-based Japanese Dependency
Parser (Kudo)

http://chasen.org/~taku/software/cabocha/
77
References
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto, "Deterministic dependency analyzer for Chinese,"
IJCNLP-04: The First International Joint Conference on Natural Language Processing,P.135-140,
2004.
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto, "Chinese Deterministic Dependency Analyzer:
Examining Effects of Global Features and Root Node Finder," Proc. Forth SIGHAN Workshop on
Chinese Language Processing, Proceedings of the Workshop, pp.17-24, October 2005.
Jason M. Eisner, "Three New Probabilistic Models for Dependency Parsing: An Exploration," COLING-96:
The 16th International Conference on Computational Linguistics, pp.340-345, 1996.
Taku Kudo and Yuji Matsumoto, “Japanese Dependency Structure Analysis Based on Support Vector
Machines,”Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora, pp.18-25, October 2000.
Taku Kudo, Yuji Matsumoto, "Japanese dependency analysis using cascaded chunking," Proc. 6th
Conference on Natural Language Learning (CoNLL-02), pp.63-69, 2002.
Ryan McDonald, Koby Crammer, Fernando Pereira, "Online Large-Margin Training of Dependency
Parsers," Proc. Annual Meeting of the Association for Computational Linguistics, pp.91-98, 2005.
Ryan McDonald, Fernando Pereira, Jan Hajic "Non-Projective Dependency Parsing using Spanning Tree
Algorithms,"HLT-EMNLP, 2005.
Ryan McDonald, Fernando Pereira, "Online Learning of Approximate Dependency Parsing Algorithms,"
Proc. European Chapter of Association for Computational Linguistics, 2006.
Joakim Nivre, "An Efficient Algorithm for Projective Dependency Parsing," Proc. 8th International
Workshop on Parsing Technologies (IWPT), pp.149-160, 2003.
Joakim Nivre, Mario Scholz, "Deterministic Dependency Parsing of English Text," COLING 2004: 20th
International Conference on Computational Linguistics, pp.64-70, 2004.
Joakim Nivre, Jens Nilsson, "Pseudo-Projective Dependency Parsing," ACL-05: 43rd Annual Meeting of
the Association for Computational Linguistics, pp.99-106, 2005.
Joakim Nivre,: Inductive Dependency Parsing, Text, Speech & Language Technology Vol.34, Springer,
2006.
Hiroyasu Yamada and Yuji Matsumoto, "Statistical dependency analysis with Support Vector Machines,"
Proc. 8th International Workshop on Parsing Technologies (IWPT), pp.195-206, 2003.
78
Descargar

Part-of-speech Tagging, Chinking and Sallow Parsing