Maximum Entropy Language Modeling with
Syntactic, Semantic and Collocational
Dependencies
Jun Wu and Sanjeev Khudanpur
Center for Language and Speech Processing
Johns Hopkins University
Baltimore, MD 21218
August, 2000
NSF STIMULATE Grant No. IRI-9618874
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
1
Stimulate Team in CLSP
 Faculties:
Frederick Jelinek: syntactic language modeling
Eric Brill: consensus lattice rescoring
Sanjeev Khudanpur: maximum entropy language modeling
David Yarowsky: topic/genre dependent language modeling
 Students:
Ciprian Chelba: syntactic language modeling
Radu Florian: topic/genre dependent language modeling
Lidia Mangu: consensus lattice rescoring
Jun Wu: maximum entropy language modeling
Peng Xu: syntactic language modeling
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
2
Outline
 The Maximum entropy principle
 Semantic (Topic) dependencies
 Syntactic dependencies
 ME models with topic and syntactic dependencies
 Conclusion and Future Work
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
3
The Maximum Entropy Principle
 The maximum entropy (MAXENT) principle
When we make inferences based on incomplete information, we
should draw from that probability distribution that has the
maximum entropy permitted by the information we do have.
 Example (Dice)
Let pi , i = 1,2,K 6 be the probability that the facet with i dots
faces-up. Seek model P = ( p1 , p2 ,K p6 ) , that maximizes
From Lagrangian
H ( P) = - pi  log pi
i
L( P,a ) = - pi  log pi + a  ( pi - 1)
i
i
L
= -1 - log p j + a = 0
pi
So p1 , p2 ,K p6 = ea -1 . Choose a to normalize, p1 , p2 ,K p6 =
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
1
6
4
The Maximum Entropy Principle
 The maximum entropy (MAXENT) principle
When we make inferences based on incomplete information, we
should draw from that probability distribution that has the
maximum entropy permitted by the information we do have.
 Example (Dice)
Let p i , i = 1, 2 , K 6 be the probability that the facet with
faces-up. Seek model P = ( p1 , p 2 , K p 6 ) , that maximizes
i
dots
H ( P ) = -  p i  log p i
From Lagrangian
i
L ( P , a ) = -  p i  log p i + a  (  p i - 1)
L
pi
So
p1 , p 2 , K p 6 = e
a -1
i
i
= - 1 - log p j + a = 0
. Choose a to normalize,
Center for Language and Speech Processing, The Johns Hopkins University.
p1 , p 2 , K p 6 =
August 2000
1
6
5
The Maximum Entropy Principle
(Cont.)
 Example 2: Seek probability distribution with constraints.
pˆ 2 =
1
( pˆ is the empirical distribution.)
4
The feature:
f =

Empirical expectation:
i = 
1
if
0
otherwise

E Pˆ ( f   ) =
1
pˆ i  f   ( i ) =
4
i
Maximize
subject to
H ( P ) = -  p i  log p i
E P ( f   ) = E Pˆ ( f   )
i
L ( P , a ) = -  p i  log p i + a 1  (  p i - 1) + a 2 (  p i f   ( i ) i
So
p2 =
i
1
4
, p1 , p 3 , K p 6 =
i
1
)
4
3
20
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
6
Maximum Entropy Language Modeling
 Pˆ : empirical distribution,
f , f , K f : feature functions,
E ( f ), E ( f ), K E ( f ) : their empirical expectations.
 A maximum entropy (ME) language model is an maximum
likelihood model in exponential family
1
2
1
k
2
k
P (w | h) =
a
f1
a
f2
a
fk
Z (x)
which satisfies each constraint
E P = E Pˆwhile
maximizing H ( P ) .
h is the history,
w is the future.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
7
Advantages and Disadvantage of
Maximum Entropy Language Modeling
 Advantages:
Creating a “smooth” model that satisfies all empirical
constraints.
Incorporating various sources of information in a unified
language model.
 Disadvantage:
Time and space consuming.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
8
Advantages and Disadvantage of
Maximum Entropy Language Modeling
 Advantages:
Creating a “smooth” model that satisfies all empirical
constraints.
Incorporating various sources of information in a unified
language model.
 Disadvantage:
Time and space consuming.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
9
Motivation for Exploiting Semantic and
Syntactic Dependencies
Analysts and financial officials in the
former British colony consider the
contract essential to the revival of
the Hong Kong futures exchange.
 N-gram models only take local correlation between
words into account.
 Several dependencies in natural language with longer
and sentence-structure dependent spans may
compensate for this deficiency.
 Need a model that exploits topic and syntax.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
10
Motivation for Exploiting Semantic and
Syntactic Dependencies
Analysts and financial officials in the
former British colony consider the
contract essential to the revival of
the Hong Kong futures exchange.
 N-gram models only take local correlation between
words into account.
 Several dependencies in natural language with longer
and sentence-structure dependent spans may
compensate for this deficiency.
 Need a model that exploits topic and syntax.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
11
Motivation for Exploiting Semantic and
Syntactic Dependencies
Analysts and financial officials in the
former British colony consider the
contract essential to the revival of
the Hong Kong futures exchange.
 N-gram models only take local correlation between
words into account.
 Several dependencies in natural language with longer
and sentence-structure dependent spans may
compensate for this deficiency.
 Need a model that exploits topic and syntax.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
12
Training a Topic Sensitive Model
 Cluster the training data by topic.
TF-IDF vector (excluding stop words).
Cosine similarity.
K-means clustering.
f t (w)  log
ft (w) >
threshold
f (w)
 Select topic dependent words:
 Estimate an ME model with topic unigram constraints:
l ( wi )
l ( wi -1 , wi )
l ( wi -2 , wi -1 , wi )
l ( topic , wi )



e
e
e
e
P( wi | wi - 2 , wi -1 , topic) =
Z ( wi - 2 , wi -1 , topic)
Where
# [ topic , w i ]
=
P
(
w
,
w
,
w
|
topic
)

i-2
i -1
i
# [ topic ]
w i - 2 , w i -1
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
13
Training a Topic Sensitive Model
 Cluster the training data by topic.
TF-IDF vector (excluding stop words).
Cosine similarity.
K-means clustering.
f t (w)  log
ft (w) >
threshold
f (w)
 Select topic dependent words:
 Estimate an ME model with topic unigram constraints:
l ( wi )
l ( wi -1 , wi )
l ( wi -2 , wi -1 , wi )
l ( topic , wi )



e
e
e
e
P( wi | wi - 2 , wi -1 , topic) =
Z ( wi - 2 , wi -1 , topic)
Where
# [ topic , w i ]
=
P
(
w
,
w
,
w
|
topic
)

i-2
i -1
i
# [ topic ]
w i - 2 , w i -1
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
14
Training a Topic Sensitive Model
 Cluster the training data by topic.
TF-IDF vector (excluding stop words).
Cosine similarity.
K-means clustering.
f t (w)  log
ft (w) >
threshold
f (w)
 Select topic dependent words:
 Estimate an ME model with topic unigram constraints:
l ( wi )
l ( wi -1 , wi )
l ( wi -2 , wi -1 , wi )
l ( topic , wi )



e
e
e
e
P( wi | wi - 2 , wi -1 , topic) =
Z ( wi - 2 , wi -1 , topic)
where
# [ topic , w i ]
=
P
(
w
,
w
,
w
|
topic
)

i-2
i -1
i
# [ topic ]
w i - 2 , w i -1
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
15
Recognition Using a Topic-Sensitive
Model
 Detect the current topic from
Recognizer’s N-best hypotheses vs. reference transcriptions.
Using N-best hypotheses causes little degradation (in
perplexity and WER).
 Assign a new topic for each
Conversation vs. utterance.
Topic assignment for each utterance is better than topic
assignment for the whole conversation.
 See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
16
Recognition Using a Topic-Sensitive
Model
 Detect the current topic from
Recognizer’s N-best hypotheses vs. reference transcriptions.
Using N-best hypotheses causes little degradation (in
perplexity and WER).
 Assign a new topic for each
Conversation vs. utterance.
Topic assignment for each utterance is better than topic
assignment for the whole conversation.
 See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
17
Recognition Using a Topic-Sensitive
Model
 Detect the current topic from
Recognizer’s N-best hypotheses vs. reference transcriptions.
Using N-best hypotheses causes little degradation (in
perplexity and WER).
 Assign a new topic for each
Conversation vs. utterance.
Topic assignment for each utterance is better than topic
assignment for the whole conversation.
 See Khudanpur and Wu ICASSP’99 paper and Florian
and Yarowsky ACL’99 for details.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
18
Experimental Setup
 The experiments are based on WS97 dev test set.
Vocabulary: 22K (closed),
LM training set: 1100 conversations, 2.1M words,
AM training set: 60 hours of speech data,
Acoustic model: state-clustered cross-word triphones model
(6700 states, 12 Gaussians/state),
Front end: 13 MF-PLP +  +  , per conv. side CMS,
Test set: 19 conversations (2 hours), 18K words,
No speaker adaptation.
 The evaluation is based on rescoring 100-best lists of
the first pass speech recognition.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
19
Topic Assignment During Testing :
Reference Trans vs Hypotheses
PPL
79.0
80
76
74.4
73.8
73.1
72.5
72
le
O
37.7%
O
ra
cl
t
e
37.9%
be
s
10
R
ef
l
M
an
ua
on
e
37.8% 37.8%
N
ra
c
st
10
be
R
WER
38.5%
38.6%
38.4%
38.2%
38.0%
37.8%
37.6%
37.4%
37.2%
ef
l
an
ua
M
N
on
e
68
 Even with a WER of over 38%,
there is only a small loss in
perplexity and a negligible
loss in WER when the topic
assignment is based on
recognizer hypotheses instead
of the correct transcriptions.
 Comparisons with the oracle
indicate that there is little
room for further improvement.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
20
Topic Assignment During Testing :
Reference Trans vs Hypotheses
PPL
79.0
80
76
74.4
73.8
73.1
72.5
72
le
O
37.7%
O
ra
cl
t
e
37.9%
be
s
10
R
ef
l
M
an
ua
on
e
37.8% 37.8%
N
ra
c
st
10
be
R
WER
38.5%
38.6%
38.4%
38.2%
38.0%
37.8%
37.6%
37.4%
37.2%
ef
l
an
ua
M
N
on
e
68
 Even with a WER of over 38%,
there is only a small loss in
perplexity and a negligible
loss in WER when the topic
assignment is based on
recognizer hypotheses instead
of the correct transcriptions.
 Comparisons with the oracle
indicate that there is little
room for further improvement.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
21
Topic Assignment During Testing :
Conv. Level vs Utterance Level
PPL
79.0
80
73.8
76
10%
No Topic
74.4
73.3
7%
73.5
Agree
72
68
10
be
37.9%
37.8%
U
C
10
be
s
t.
t.
U
10
be
s
ef
.
R
.C
ef
R
on
e
37.8% 37.8%
N
st
.U
st
.C
WER
38.5%
38.6%
38.4%
38.2%
38.0%
37.8%
37.6%
37.4%
10
be
R
ef
.U
ef
.C
R
N
on
e
83%
Disagree
 Topic assignment based on
utterances brings a slightly
better result than that based
on whole conversations.
 Most of utterances prefer the
topic-independent model.
 Less than one half of the
remaining utterances prefer a
topic other than that assigned
at the conversation level.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
22
Topic Assignment During Testing :
Conv. Level vs Utterance Level
PPL
79.0
80
73.8
76
10%
No Topic
74.4
73.3
7%
73.5
Agree
72
68
10
be
37.9%
37.8%
U
C
10
be
s
t.
t.
U
10
be
s
ef
.
R
.C
ef
R
on
e
37.8% 37.8%
N
st
.U
st
.C
WER
38.5%
38.6%
38.4%
38.2%
38.0%
37.8%
37.6%
37.4%
10
be
R
ef
.U
ef
.C
R
N
on
e
83%
Disagree
 Topic assignment based on
utterances brings a slightly
better result than that based
on whole conversations.
 Most of utterances prefer the
topic-independent model.
 Less than one half of the
remaining utterances prefer a
topic other than that assigned
at the conversation level.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
23
ME Method vs Interpolation
PPL
79.0
80
78.4
77.3
76.1
 ME model with only topic
dependent unigram constraints
outperforms the interpolated
topic dependent trigram
model.
 ME method is an effective
means of integrating topicdependent and topicindependent constraints.
73.5
76
72
E
M
ra
m
+T
3g
ra
m
+T
2g
+T
1g
N
on
e
ra
m
68
WER
38.5% 38.5%
38.6%
38.4%
38.2%
38.0%
37.8%
37.6%
37.4%
38.3%
38.1%
3gram
+topic
1-gram
+topic
2-gram
+topic
3-gram
ME
Size
499K
+70*11K
+70*26K
+70*55K
+16K
E
Model
M
am
+T
3g
r
am
+T
2g
r
am
+T
1g
r
N
on
e
37.8%
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
24
ME vs Cache-Based Models
80
79.0
75.2
76
P ( w i | w i - 2 , w i -1 )
PPL
73.5
+ (1 - l )  Pc ( w i )
72
68
3gram
Cache
ME
WER
38.9%
39.0%
38.5%
= l  P3 ( w i | w i - 2 , w i -1 )
38.5%
37.8%
38.0%
37.5%
37.0%
3gram
Cache
ME
 Cache-based model reduces
the perplexity, but increase the
WER.
 Cache-based model brings
(0.6%) more repeated errors
than the trigram model does.
 Cache model may not be
practical when the baseline
WER is high.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
25
Summary of Topic-Dependent
Language Modeling
 We significantly reduce both the perplexity (7%) and
WER (0.7% absolute) by incorporating a small number
of topic constraints with N-grams using the ME method.
 Using N-best hypotheses causes little degradation (in
perplexity and WER).
 Topic assignment at utterance level is better than that at
conversation level.
 ME method is more efficient than linear interpolation in
combining topic dependencies with N-grams.
 The topic dependent model is better than the cachebased model in reducing WER when the baseline is poor.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
26
Exploiting Syntactic Dependencies
ended
VP
contract
nt i-1
NP
nt i-2
The
contract
DT
ended
NN
h
VBD
h
i-2
with
a
IN
DT
loss
of
NN
7
IN
cents
CD
w
i-1
w
i-2
after
NNS
i-1
wi
 All sentences in the training set are parsed by a left-toright parser.
 A stack of parse trees Ti for each sentence prefix is
generated.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
27
Exploiting Syntactic Dependencies
(Cont.)
 A probability is assigned to each word as:
i -1
1
P( wi | W
=
)=
i -1

r
 P(wi | Wi , Ti ) (Ti | Wi )
i -1
Ti Si
i -1

r
 P(wi | wi-2 , wi-1 , hi-2 , hi-1 , nti-2 , nti-1 ) (Ti | Wi )
Ti Si
ended
VP
contract
nt i-1
NP
nt i-2
The
contract
DT
ended
NN
h
VBD
h
i-2
with
a
IN
DT
loss
of
NN
7
IN
cents
CD
w
i-1
Center for Language and Speech Processing, The Johns Hopkins University.
after
NNS
w
i-2
i-1
August 2000
wi
28
Exploiting Syntactic Dependencies
(Cont.)
 A probability is assigned to each word as:
i -1
1
P( wi | W
=
)=
i -1

r
 P(wi | Wi , Ti ) (Ti | Wi )
i -1
Ti Si
i -1

r
 P(wi | wi-2 , wi-1 , hi-2 , hi-1 , nti-2 , nti-1 ) (Ti | Wi )
Ti Si
 It is assumed that most of the useful information is
embedded in the 2 preceding words and 2 preceding
heads.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
29
Training a Syntactic ME Model
 Estimate an ME model with syntactic constraints:
P ( wi | wi -1 , wi - 2 , hi -1 , hi - 2 , nti -1 , nti - 2 )
=
e
l ( wi )
 e l ( w -1 , wi )  e l ( w -2 , w -1 , wi )  e l ( hi-1 , wi )  e l ( hi-2 ,hi-1 , w )  e l ( nt i-1 , wi )  e l ( nt i-2 ,nti-1 , wi )
Z ( wi -1 , wi - 2 , hi -1 , hi - 2 , nti -1 , nti - 2 )
where

P(hi -1 , hi - 2 , nti -1 , nti - 2 , wi | wi - 2 , wi -1 ) =
#[ wi - 2 , wi -1 , wi ]
# [wi - 2 , wi -1 ]
P(wi -1 , wi - 2 , nti -1 , nti - 2 , wi | hi - 2 , hi -1 ) =
#[hi - 2 , hi -1 , wi ]
# [hi - 2 , hi -1 ]
hi - 2 , h -1 , nt i - 2 , nt i -1

w - 2 , wi -1 nt - 2 , nt -1

P(wi -1 , wi - 2 , hi -1 , hi - 2 , wi | nti - 2 , nti -1 ) =
w - 2 , wi -1hi - 2 , h -1
# [nt - 2 , nti -1 , wi ]
# [nti - 2 , nti -1 ]
 See Chelba and Jelinek ACL’98 and Wu and Khudanpur
ICASSP’00 for details.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
30
Experimental Results of Syntactic LMs
80
79.0
76
PPL
75.1
74.5
74.0
NT
HW
Both
72
68
3gram
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.7%
37.5%
37.5%
37.0%
3gram
NT
HW
Both
 Non-terminal (NT) N-gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.
 Head word N-gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.
 Non-terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
31
Experimental Results of Syntactic LMs
80
79.0
76
PPL
75.1
74.5
74.0
NT
HW
Both
72
68
3gram
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.7%
37.5%
37.5%
37.0%
3gram
NT
HW
Both
 Non-terminal (NT) N-gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.
 Head word N-gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.
 Non-terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
32
Experimental Results of Syntactic LMs
80
79.0
76
PPL
75.1
74.5
74.0
NT
HW
Both
72
68
3gram
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.7%
37.5%
37.5%
37.0%
3gram
NT
HW
Both
 Non-terminal (NT) N-gram
constraints alone reduce
perplexity by 5% and WER by
0.7% absolute.
 Head word N-gram constraints
result in 6% reduction in
perplexity and 0.8% absolute
in WER.
 Non-terminal constraints and
syntactic constraints together
reduce the perplexity by 6.3%
and WER by 1.0% absolute.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
33
ME vs Interpolation
P ( w i | w i - 2 , w i -1 , hi - 2 , hi -1 , nt i - 2 , nt i -1 )
= l  P3 ( w i | w i - 2 , w i -1 ) + (1 - l )  Pslm ( w i | hi - 2 , hi -1 , nt i - 2 , nt i -1 )
80
79.0
75.5
76
WER
PPL
39.0%
74.0
72
38.5%
38.5%
37.9%
38.0%
37.5%
37.5%
68
37.0%
3gram
Interp
ME
3gram
Interp
ME
 The ME model is more effective in using syntactic
dependencies than the interpolation model.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
34
Head Words inside vs. outside 3gram
Range
The
contract ended
DT
NP
with
VBD
a
IN
h
h
i-2
i-1
w
contract ended
DT
NP
h
VBD
h
i-2
with
a
IN
DT
w
loss
NN
i-1
of
7
IN
cents
CD
w
i-1
i
w
i-2
The
loss
DT
w
i-2
Center for Language and Speech Processing, The Johns Hopkins University.
after
NNS
i-1
wi
August 2000
35
Syntactic Heads inside vs. outside
Trigram Range
41%
40.3%
40%
39.4%
39%
38%
37%
36%
38.8%
37.8%
38.9%
37.2%
37.4%
36.9%
35%
Inside
Trigram
Outside
NT
HW
Both
27%
Inside
Outside
73%
 The WER of the baseline trigram
model is relatively high when syntactic
heads are beyond trigram range.
 Lexical head words are much more
helpful in reducing WER when they are
outside trigram range (1.5%) than
they are within trigram range.
 However, non-terminal N-gram
constraints help almost evenly in both
cases.
 Can this gain be obtained from POS
class model too?
 The WER reduction for the model with
both head word and non-terminal
constraints (1.4%) is more than the
overall reduction (1.0%) when head
words are beyond trigram range.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
37
Contrasting the Smoothing Effect of
NT Class LM vs POS Class LM
 An ME model with part-of-speech (POS) N-gram
constraints is built as:
P( wi | wi -1 , wi - 2 , posi -1 , posi - 2 ) =
PPL
80
l(w )
 e l ( wi-1 , wi )  e l ( wi-2 , wi-1 , wi )  e l ( posi-1 , wi )  e l ( pos -2 , posi-1 , wi )
Z ( wi -1 , wi - 2 , posi -1 , posi - 2 )
79.0
78
75.9
76
75.1
74
72
3gram
POS
NT
WER
38.8%
e
38.5%
38.4%
38.0%
38.0%
37.8%
37.6%
37.2%
3gram
POS
NT
 POS model reduces PPL by 4% and
WER by 0.5%.
 The overall gains from POS N-gram
constraints are smaller than those from
NT N-gram constraints.
 Syntactic analysis seems to perform
better than just using the two previous
word positions.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
38
POS Class LM vs NT Class LM
41%
40.3%
40%
39.2%
39.4%
39%
38%
37.8%
37.6%
37%
37.2%
36%
35%
Inside
Outside
Trigram
POS
NT
36.8
%
45.6
63.2
54.4
%
%
%
 When the syntactic heads are
beyond trigram range, the trigram
coverage in the test set is relatively
low.
 The back-off effect by the POS Ngram constraints is effective in
reducing WER in this case.
 NT N-gram constraints work in a
similar manner. Overall, they are
more effective perhaps because
they are linguistically more
meaningful.
 Performance improves further when
lexical head words are applied on
the top of the non-terminals.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
39
Summary of Syntactic Language
Modeling
 Syntactic heads in the language model are
complementary to N-grams: the model improves
significantly when the syntactic heads are beyond Ngram range.
 Head word constraints provide syntactic information.
Non-terminals mainly provide a smoothing effect.
 Non-terminals are linguistically more meaningful
predictors than POS tags, and therefore are more
effective in supplementing N-grams.
 The Syntactic model reduces perplexity by 6.3%, WER
by 1.0% (absolute).
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
40
Combining Topic, Syntactic and N-gram
Dependencies in an ME Framework
 Probabilities are assigned as:
-
P( wi | W1i 1 ) =
 P( wi | wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)  r (Ti | Wii 1 )
-
Ti Si
 The ME composite model is trained:
P(wi | wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)
=
l ( wi )
e
 el(wi-1,wi )  el(wi-2 ,wi-1,wi )  el(hi-1,wi )  el(hi-2 ,hi-1,wi )  el(nti-1,wi )  el(nti-2 ,hi-1,wi )  el(topic,wi )
Z (wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)
 Only marginal constraints are necessary.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
41
Combining Topic, Syntactic and N-gram
Dependencies in an ME Framework
 Probabilities are assigned as:
-
P( wi | W1i 1 ) =
 P( wi | wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)  r (Ti | Wii 1 )
-
Ti Si
 The ME composite model is trained:
P(wi | wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)
=
l ( wi )
e
 el(wi-1,wi )  el(wi-2 ,wi-1,wi )  el(hi-1,wi )  el(hi-2 ,hi-1,wi )  el(nti-1,wi )  el(nti-2 ,hi-1,wi )  el(topic,wi )
Z (wi-2 , wi-1, hi-2 , hi-1, nti-2 , nti-1, topic)
 Only marginal constraints are necessary.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
42
Overall Experimental Results
80
79.0
PPL
73.5
76
74.0
72
67.9
68
64
60
3gram
Topic
Syntax
Comp
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.5%
37.5%
37.0%
37.0%
36.5%
36.0%
3gram
Topic
Syntax
Comp
 Baseline trigram WER is 38.5%.
 Topic-dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.
 Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.
 Topic-dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.
The gains from topic and syntactic dependencies are nearly additive.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
43
Overall Experimental Results
80
79.0
PPL
73.5
76
74.0
72
67.9
68
64
60
3gram
Topic
Syntax
Comp
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.5%
37.5%
37.0%
37.0%
36.5%
36.0%
3gram
Topic
Syntax
Comp
 Baseline trigram WER is 38.5%.
 Topic-dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.
 Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.
 Topic-dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.
The gains from topic and syntactic dependencies are nearly additive.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
44
Overall Experimental Results
80
79.0
PPL
73.5
76
74.0
72
67.9
68
64
60
3gram
Topic
Syntax
Comp
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.5%
37.5%
37.0%
37.0%
36.5%
36.0%
3gram
Topic
Syntax
Comp
 Baseline trigram WER is 38.5%.
 Topic-dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.
 Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.
 Topic-dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.
The gains from topic and syntactic dependencies are nearly additive.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
45
Overall Experimental Results
80
79.0
PPL
73.5
76
74.0
72
67.9
68
64
60
3gram
Topic
Syntax
Comp
WER
39.0%
38.5%
38.5%
37.8%
38.0%
37.5%
37.5%
37.0%
37.0%
36.5%
36.0%
3gram
Topic
Syntax
Comp
 Baseline trigram WER is 38.5%.
 Topic-dependent constraints
alone reduce perplexity by 7%
and WER by 0.7% absolute.
 Syntactic Heads result in 6%
reduction in perplexity and 1.0%
absolute in WER.
 Topic-dependent constraints and
syntactic constraints together
reduce the perplexity by 13%
and WER by 1.5% absolute.
The gains from topic and syntactic dependencies are nearly additive.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
46
Content Words vs. Stop words
44%
42.2%
42%
40.8%
41.9%
40%
38%
37.6%
40.1%
36.9%
36%
36.3%
36.2%
34%
32%
Stop Wds
Trigram
Topic
Content Wds
Syntactic
Composite
22%
Stop Wds
Cont. Wds
78%
 1/5 of test tokens are contentbearing words.
 The topic sensitive model
reduces WER by 1.4% on
content words, which is twice
as much as the overall
improvement (0.7%).
 The syntactic model improves
WER on both content words
and stop words evenly.
 The composite model has the
advantage of both models and
reduces WER on content words
more significantly (2.1%).
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
47
Head Words inside vs. outside 3gram
Range
40.3%
41%
40%
39%
38%
37%
39.1%
37.8%
38.9%
38.1%
37.3%
36.9%
36.5%
36%
35%
34%
Inside
Trigram
Outside
Topic
Syntactic
Composite
27%
Inside
Outside
73%
 The WER of the baseline trigram
model is relatively high when head
words are beyond trigram range.
 Topic model helps when trigram is
inappropriate.
 The WER reduction for syntactic
model (1.4%) is more than the
overall reduction (1.0%) when head
words are outside trigram range.
 The WER reduction for composite
model (2.2%) is more than the
overall reduction (1.5%) when head
words are inside trigram range.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
48
Further Insight Into the Performance
50.0%
48.0%
6%
46.0%
48.6%
44.0%
46.1%
47.2%
42.0%
46.0%
40.0%
34.0%
32.0%
30.0%
57%
40.0%
38.0%
36.0%
15%
22%
38.3%
39.0%
37.3%
36.8%
36.3%
37.3%
39.4%
38.1%
36.0%
36.8%
36.4%
Stop Wds, inside
Stop Wds, outside
Stop Wds,
Content Wds,
Stop Wds,
Content Wds,
Inside
Inside
Outside
Outside
Trigram
Topic
Syntactic
Cont. Wds, inside
Cont. Wds, outside
Composite
 The composite model reduces the WER of content
words by 2.6% absolute when the syntactic
predicting information is beyond trigram range.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
49
Concluding Remarks
 A language model incorporating two diverse sources of
long-range dependence with N-grams has been built.
 The WER on content words reduces by 2.1%, most of it
due to topic dependence.
 The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.
 These two sources of non-local dependencies are
complementary and their gains are almost additive.
 Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
50
Concluding Remarks
 A language model incorporating two diverse sources of
long-range dependence with N-grams has been built.
 The WER on content words reduces by 2.1%, most of it
due to topic dependence.
 The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.
 These two sources of non-local dependencies are
complementary and their gains are almost additive.
 Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
51
Concluding Remarks
 A language model incorporating two diverse sources of
long-range dependence with N-grams has been built.
 The WER on content words reduces by 2.1%, most of it
due to topic dependence.
 The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.
 These two sources of non-local dependencies are
complementary and their gains are almost additive.
 Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
52
Concluding Remarks
 A language model incorporating two diverse sources of
long-range dependence with N-grams has been built.
 The WER on content words reduces by 2.1%, most of it
due to topic dependence.
 The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.
 These two sources of non-local dependencies are
complementary and their gains are almost additive.
 Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
53
Concluding Remarks
 A language model incorporating two diverse sources of
long-range dependence with N-grams has been built.
 The WER on content words reduces by 2.1%, most of it
due to topic dependence.
 The WER on head words beyond trigram range reduces
by 2.2%, most of it due to syntactic dependence.
 These two sources of non-local dependencies are
complementary and their gains are almost additive.
 Overall perplexity reduction of 13% and WER reduction
of 1.5% (absolute) are achieved on Switchboard.
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
54
Ongoing and Future Work
 Improve the training algorithm.
 Apply this method to other tasks (Broadcast News).
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
55
Acknowledgement
 We thank Radu Florian and David Yarowsky for their
help on topic detection and data clustering and Ciprian
Chelba and Frederick Jelinek for providing the syntactic
model (parser) for the experimental results reported
here.
 This work is supported by National Science Foundation,
a STIMULATE grant (IRI-9618874).
Center for Language and Speech Processing, The Johns Hopkins University.
August 2000
56
Descargar

Outline - Johns Hopkins University