Towards unsupervised induction
of morphophonological rules
Erwin Chan
University of Pennsylvania
Morphochallenge workshop
19 Sept 2007
Goals of unsup morphology induction
1. Provide analysis of input data
2. Analyzer for unseen data
Key task: generalize analysis of input data
by inducing phonological characteristics
Example: inducing phonology
(English plural nouns)
1. Input
corpus
processes
witnesses
matches
hatches
maids
ferns
mates
2. Induce
segmentation
3. Induce
phonology
4. Apply to
novel words
process.es
witness.es
match.es
hatch.es
maid.s
fern.s
mate.s
es: ends in
ch or sh
bench.es
fate.s
foe.s
wish.es
s: other
characters
Base-and-transforms model of
morphological paradigms
Apply transforms to base forms to generate inflections
Lexeme 1
t1
t2
base 1
Lexeme 2
t1
base 2
t3
t5
t2
t4
Lexeme 3
t1
base 3
t3
t5
t2
t4
t3
t5
t4
Base forms
• Base form serves as lexical entry for all
inflections of a lexeme
e.g. base of {help, helps, helping, helped} is help
• Same fine-grained POS type for all lexemes
e.g. “nominative singular” for all nouns
Transforms
• Generates inflected form from base
• Format: ( A, B )
A, B: simple regular expressions
A: characters in base to replace
B: characters in inflected to replace
Transform examples
Base form
Inflected
Transform
eat
eating
( $, ing )
time
times
( $, s )
time
timing
( e, ing )
hang
hung
( *a*, *u* )  non-concat
Comparison to phonological rules
• Standard rewrite rule:
AB/C_D
1. A  B: rewrite operation
2. C _ D: phonological context of application
• A transform is an ungeneralized rule
A  B / { set of base forms }
• Future work: induce phonological rules
Learn generalized phonological properties of base forms
Compare with stem-suffix model
• Stem-suffix
– saves = save + s
– saving = sav + ing
Drawback: multiple lexical representations
• Base-transform
– saves = save + ( $,s )
– saving = save + ( e,ing )
Limitations of model
• Simple morphotactic structure:
– assumes one suffix
– a word is either a base form,
or inflected from a base form
• Does not account for:
– agglutination
– compounds
– prefixing
– irregulars, suppletion
Distribution of morphological forms
• What information is available in corpora for
learning?
• Is there structure within the distribution of
morphological forms that a learner can
exploit?
• Examine annotated corpora for several
languages
Spanish newswire verbs
Sparse data
Log(freq)
Lemma
Inflection
Dist. of inflectional categories
10000
8000
6000
4000
2000
0
• # word types per inflection (Slovene 2.5 M)
• roughly Zipfian
High frequency of base form
Most frequent inflection (in types) often
matches intuitions of what inflection a base
form should be
Slovene:
Swedish:
Spanish:
A.Pos.Nom.Sg.Indef
N.Nom.Sg
V.Main.Ind.Pres.3.Sg
A.Pos.Sg.Indef.Nom
N.Sg.Indef.Nom
V.Inf.Act
A.Sg
N.Sg
V.Inf
Goals of induction algorithm
1. Select words from corpus to be base forms
2. Formulate transforms
Technique: take advantage of high type
frequency of base inflectional category
Start state
End state
Transforms =
Transforms = {}
Base forms
{($,s), ($,’s), …}
base
Inflected forms
inflected
unmodeled
unmodeled
Greedy algorithm
At each iteration,
1. construct potential transforms
2. add the transform(s) that accounts for most
data
Sources of words for transform
Current grammar
New transform
base
base
inflected
inflected
unmodeled
WSJ: Most freq. suffixes (1st iteration)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
$
s
e
d
ed
y
n
g
ng
ing
# types
42596
10730
4967
4800
3868
3648
3226
3107
2951
2869
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
les
ses
et
ck
ding
ning
ded
ment
ngs
rd
# types
237
230
224
223
220
219
219
217
216
211
WSJ: potential transforms (1st iteration)
# base
forms
# base
forms
1.
2.
3.
4.
( $, s )
( s, $ )
( ed, ing )
( ing, ed )
5257
5257
1922
1922
41.
42.
43.
44.
( $, ally )
( ally, $ )
( on, ve )
( ve, on )
154
154
147
147
5.
6.
7.
( $, ‘s )
( ‘s, $ )
( $, ed )
1609
1609
1481
45.
46.
47.
( $, ity )
( ity, $ )
( s, al )
144
144
143
8.
9.
10.
( ed, $ )
( $, ing )
( ing, $ )
1481
1335
1335
48.
49.
50.
( al, s )
( $, ion )
( ion, $ )
143
132
132
Choose direction of transform
base
alcoholic
alibi
alter
freq
inflected
3 alcoholics
8 alibis
15 alters
freq
4
1
1
alteration
amateur
ambition
7
25
19
7
2
15
alterations
amateurs
ambitions
b>d
d>b
x
x
x
x
x
• Table for ( $, s )
• Base greater: 3750, Inflected greater: 817
• Choose ( $, s ) instead of ( s, $ )
WSJ: sequence of transforms added
1
2
3
4
5
6
7
8
9
10
( $, s )
( $, ed )
( $, ing )
( $, 's )
( $, d )
( $, ly )
( e, ing )
( y, ies )
( $, s' )
( $, es )
11
12
13
14
15
16
17
18
19
20
( $, e )
( $, y )
( $, ness )
( $, man )
( $, ers )
( er, ing )
( r, st )
( er, ed )
( e, ion )
( $, able )
Morphochallenge English data
• High number of word types ( ~250,000 )
leads to spurious transforms
• ( $, a )
(music, musica) (naam,naama)
(nucci,nuccia) (retin,retina)
(mash,masha) (gab,gaba)
• ( $, o )
(rutili,rutilio)
(vern,verno)
(rikky,rikkyo)
(lazar,lazaro)
(berk,berko)
(economic,economico)
Summary
• Base-and-transforms model of morphological
paradigms
– First step towards learning morphophonological rules
– More linguistically satisfying than stem-and-suffix
• Algorithm:
– learn inventory of base forms
– learn transforms (base-specific rules)
• Exploits high freq. of base inflectional category
More slides available…
• Longer version of this presentation
– base forms simplify POS induction
• Different system: transforms in parallel
– Slovene, Spanish
Descargar

Towards unsupervised induction of morphophonemic …