Finite-State Methods in Natural
Language Processing
Lauri Karttunen
LSA 2005 Summer Institute
July 18, 2005
Course Outline
July 18:
Intro to computational morphology
XFST
Readings
Lauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194,
University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:
Regular expressions
More on XFST
Readings
Chapter 2: “Systematic Introduction”
Chapter 3: “The XFST interface”
July 25
Concatenative morphotactics
Constraining non-local dependencies
Readings
Chapter 4. “The LEXC Language”
Chapter 5. “Flag Diacritics”
July 27
Non-concatenative morphotactics
Reduplication, interdigitation
Readings
Chapter 8. “Non-Concatenative Morphotactics”
August 1
Realizational morphology
Readings
Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm
Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”,
Lecture Notes in Computer Science, Volume 2588, Alexander
Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3
Optimality theory
Readings
Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to
Finnic and Saami Linguistics, Diane Nelson and Satu Manninen
(eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse
constraint". Phonology 16. 273-329.
Getting credit for LSA 207
There will be three assignments, given on each
Wednesday. The first two are to be turned in by
the following Monday, the last one by the following
Friday.
You will get credit for the course if you solve at least
two of the three assignments. The solutions will
involve programming in the xfst scripting
language. The problems will be easy to solve if you
have attended the class.
If you have any problems in doing the assignments,
Michael Wagner and I will be happy to help you.
Textbook
Copies will arrive in the
Linguistics Department
tomorrow afternoon.
You can purchase a copy there
tomorrow as soon as the books
have arrived.
Starting Wednesday, books can
Be purchased from our TA,
Michael Wagner.
The price is $35.
With the book comes a
software CD for Solaris,
Linux, MacOSX and Windows
operating systems.
LSA 207 Web site
http://lsa.dlp.mit.edu/Class/207
You can use this username and password to access
materials:
Username: LSA207
Password: seunsehi207
Your are free to copy, modify and use the slides for
whatever purpose provided that you give
appropriate credit to the original source.
The readings for Wednesday’s class (“Finite-State
Constraints”, “25 Years of Finite-State
Morphology” and “Gentle Introduction” (Chapter 1
of B&K book) are posted on the web site).
Software
The software on the Book CD dates back to the
Spring of 2003. For an update, point your browser
to
http://www.stanford.edu/~laurik/.lsa207/
Please read the README file and the License
Agreement before downloading the software.
The updated software supports UTF-8 encoded
Unicode input/output. The Book version supports
only Latin-1 (ISO-8859-1).
The XFST application will be available locally on some
computers (ask Michael).
Check out the web site for the Book:
http://www.fsmbook.com/
Finite-State Methods in NLP
Domains of Application
Tokenization
Sentence breaking
Spelling correction
Morphology (analysis/generation)
Phonological disambiguation (Speech Recognition)
Morphological disambiguation (“Tagging”)
Pattern matching (“Named Entity Recognition”)
Shallow Parsing
Types of Finite-State Systems
Classical (non-weighted) automata
Weighted (associated with weights in a semi-ring)
Binary relations (simple transducers)
N-ary relations (multi-tape transducers)
Computational morphology
Analysis
leaf N Pl
leave N Pl
leaves
Generation
leave V Sg3
hang V Past
hanged
hung
Two challenges
Morphotactics
Words are composed of smaller elements that
must be combined in a certain order:
piti-less-ness is English
piti-ness-less is not English
Phonological alternations
The shape of an element may vary depending on
the context
pity is realized as piti in pitilessness
die becomes dy in dying
Morphology is regular (=rational)
The relation between the surface forms of a language
and the corresponding lexical forms can be described
as a regular relation.
A regular relation consists of ordered pairs of strings.
leaf+N+Pl : leaves
hang+V+Past : hung
Any finite collection of such pairs is a regular relation.
Regular relations are closed under operations such as
concatenation, iteration, union, and composition.
Complex regular relations can be derived from simple relations.
Morphology is finite-state
A regular relation can be defined using the
metalanguage of regular expressions.
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
A regular expression can be compiled into a finitestate transducer that implements the relation
computationally.
Compilation
Regular expression
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
Finite-state transducer
+Base:
+3rdSg:s
a
t
+Progr:i
a
w
o
initial
state
final
state
l
k
r
+Past:e
:n
:g
:d
Generation
work+3rdSg --> works
+Base:
+3rdSg:s
a:a
t:t
+Progr:i
w:w
a:a
o:o
l:l
r:r
k:k
+Past:e
:n
:g
:d
Analysis
+Base:
+3rdSg:s
a:a
t:t
+Progr:i
w:w
a:a
o:o
l:l
r:r
k:k
+Past:e
talked --> talk+Past
:n
:g
:d
XFST Demo 1
start xfst
% xfst
xfst[0]:
compile a regular expression
xfst[0]: regex
[{talk} | {walk} | {work}]
[% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
xfst[1]: apply up walked
walk+Past
xfst[1]: apply down talk+SgGen3
talks
apply the result
Lexical transducer
Bidirectional: generation or analysis
Compact and fast
Comprehensive systems have been
built for over 40 languages:
vouloir +IndP +SG + P3
Finite-state
transducer
English, German, Dutch, French,
Italian, Spanish, Portuguese,
Finnish, Russian, Turkish,
Japanese, Korean, Basque, Greek,
Arabic, Hebrew, Bulgarian, …
veut
citation form
v
o
u
v
e
u
l
o
inflection codes
i
r
+IndP
+SG
+P3
t
inflected form
How lexical transducers are made
Morphotactics
Lexicon
Regular Expression
Lexicon
FST
Compiler
Lexical Transducer
(a single FST)
composition
Rules
Regular Expressions
Rule
FSTs
Alternations
f
a
t
f
a
t
+Adj
t
+Comp
e
r
Sequential Model
Lexical form
fst 1
Intermediate form
fst 2
...
fst n
Surface form
Ordered sequence
of rewrite rules
(Chomsky & Halle ‘68)
can be modeled
by a cascade of
finite-state transducers
Johnson ‘72
Kaplan & Kay ‘81
Discovery and Rediscovery
C. Douglas Johnson (1972) showed that
– phonological rewrite rules are interpreted in a way that
makes them less powerful than they appear
– rewrite rules can be modeled by finite transducers
– for any two finite transducers applied in a sequence there
exists an equivalent single transducer (Schützenberger
1961).
Johnson’s result was ignored and forgotten,
rediscovered by Ronald M. Kaplan and Martin Kay
at Xerox around 1980.
Application constraint
Phonological rewrite rules are not as powerful as they appear
because of the constraint that a rule does not apply to its own
output. (Johnson 1972, Kaplan&Kay 1980).
Sequential application
k a N p a n
N -> m / _ p
k a m p a n
p -> m / m _
k a m m a n
Sequential application in detail
k a N p a n
N:m
p
m
0 0 0 2 0 0 0
?
N:m
m
0
?
p
k a m p a n
1
N
0 0 0 1 0 0 0
N
m
p
k a m m a n
2
?
0
?
p:m
1
m
Composition
N:m
3
p:m
k a N p a n
0 0 0 3 0 0 0
k a m m a n
N:m
m
?
0
2
?
p:m
p
N
?
m
N:m
N
m
1
N
Parallel Model
Lexical form
fst 1
fst 2
...
Surface form
Set of parallel
of two-level rules (constraints)
compiled into finite-state automata
interpreted as transducers
Koskenniemi ‘83
fst n
Sequential vs. parallel rules
Chomsky&Halle 1968
Koskenniemi 1983
Lexical form
Lexical form
rule 1
rule 1
rule 2
...
Intermediate form
rule 1
Surface form
...
rule n
Surface form
FST
rule n
Rewrite rules
? u: t
y ? A s
Epenthesis
? u: t I y ? A s
Harmony
? u: t u y ? a s
Lowering
? o: t u y ? a s
Yawelmani Vowel
Harmony
Kisseberth 1969
Two-level constraints
? u: t 0 y ? A s
? o: t u y ? a s
Epenthesis: Insert u or i (underspecification)
Harmony: Rounding next to a round V of the same height.
Lowering: Long u always realized as long o.
Underlying representation controls all three alternations.
Rewrite Rules vs. Constraints
•
•
Two different ways of decomposing the
complex relation between lexical and
surface forms into a set of simpler relations
that can be more easily understood and
manipulated.
One approach may be more convenient than
the other for particular applications.
The Big Picture
{a}
Language
or
Relation
describes
Regular
Expression
a
encodes
compiles
into
Finite-State
Network
a
XFST Demo 2
xfst[0]: define Cat {cat} | {tiger} | {lion};
defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths. ...
xfst[0]:
xfst[0]: set verbose off
xfst[0]: define Dog {dog} | {spaniel} | {poodle};
xfst[0]: regex Cat | Dog ;
xfst[1]: apply up
apply up> dog
dog
apply up> panther
apply up>
apply up> END;
xfst[1]: define Animal
xfst[0]:
xfst[0]: regex Cat & Dog;
xfst[1]: print net
Sigma: a c d e g i l n o p r s t
Size: 13, Label Map: Default
Net:
Flags: deterministic, pruned, minimized, epsilon_free, ...
s0:
(no arcs)
xfst[1]:
xfst[1]: pop
xfst[0]:
xfst[0]: regex Animal - Dog;
xfst[1]: push Cat
xfst[2]: test equivalent
1, (0=NO,1=YES)
xfst[2]: clear
xfst[0]:
Compiling networks from words
Network
xfst[0]: read text
c
l
e
clear
clever
e
ear
ever
f
fat
a
t
father
^D
432 bytes. 10 states, 12 arcs, 6 paths.
a
v
r
e
h
read text < file
read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;
Regular Expression Calculus
Symbols
Simple symbols vs. symbol pairs
Special symbols: ANY, EPSILON
Common regular expression operators
concatenation, union, intersection, negation,
composition
Xerox operators
contains, restriction, replacement
Symbols and Labels
Single and multicharacter symbols
a, b, c, … , +Adj, +SG, ^Fin
Special symbols
0 EPSILON
? ANY
Symbols vs. symbol pairs
In general, no distinction is made between
a
the language {“a”}
a:a
the identity relation {<“a”, “a”>}
a
Common RE Operators
concatenation
* +
iteration
|
union
&
intersection*
~ \ - complementation*, minus*
.x. : crossproduct
.o.
composition
* = not applicable to regular relations because the
result may not be encodable by a finite-state
network.
Iteration
A*
A+
zero or more contatenations of A
one or more concatenations of A
?*
the universal language/
the universal identity relation
?
b:B
a:A
c:C
[a:A | b:B | c:C | d:D | …
d:D
]*
Negation
\A
~A
any single symbol that is not in A
\? the null language
any string that is not in A
a
a
\a
~a
Sigma: a, ?
a
?
a
?
?
a
?
Crossproduct
A .x. B
The relation that maps every string in A to every
string in B, and vice versa
A:B
Same as [A .x. B].
a:x
a b c .x. x y
b:y
c:0
[a b c] : [x y]
{abc}:{xy}
Composition
A .o. B
The relation C such that if A maps
x to y and B maps y to z, C maps x to z.
a
b
b:B
a:A
c:C
d:D
c
a:A
b:B
c:C
{abc} .o. [a:A | b:B | c:C | d:D]*
Xerox RE Operators
$
=>
-> @->
containment
restriction
replacement
Make it easier to describe complex
languages and relations without
extending the formal power of finite-state
systems.
Containment
$a
?
[?* a ?*]
a
?
a
Restriction
b
a => b _ c
b
?
“Any a must be preceded by b
and followed by c.”
c
c
?
c
~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression
a
Descargar

Document