Search and Decoding in
Speech Recognition
Words and Transducers
Introduction


From Ch 1. – regular expressions we sow how easy it is to search for a plural of the
woodchuck (woodchucks).
However searching for plural of fox, fish, peccary or wild goose is not as trivial as just
tacking on an s.




Main Entry: fox
Pronunciation: 'fäks
Function: noun
Inflected Form(s): plural fox·es also fox
Usage: often attributive
Etymology: Middle English, from Old English; akin to Old High German fuhs fox and perhaps to
Sanskrit puccha tail
Main Entry: fish
Pronunciation: 'fish
Function: noun
Inflected Form(s): plural fish or fish·es
Usage: often attributive
Etymology: Middle English, from Old English fisc; akin to Old High German fisc fish, Latin piscis
Main Entry: pec·ca·ry
Pronunciation: 'pe-k&-rE
Function: noun
Inflected Form(s): plural -ries
Etymology: of Cariban origin; akin to Suriname Carib paki:ra peccary
: any of several largely nocturnal gregarious American mammals resembling the related pigs: as
a : a grizzled animal (Tayassu tajacu) with an indistinct white collar b : a blackish animal
(Tayassu pecari) with a whitish mouth region
Main Entry: goose
Pronunciation: 'güs
Function: noun
Inflected Form(s): plural geese /'gEs/
Etymology: Middle English gos, from Old English gOs; akin to Old High German gans goose, Latin
anser, Greek chEn
3 October 2015
Veton Këpuska
2
Introduction

Required knowledge to correctly search for singulars and plurals
in English language:
1.
2.
3.

Orthographic rules: Words ending in –y are pluralized by chaning the –
y to –i and adding an –es.
Morphological rules: tell us that fish has null plural and that the plural
of goose is formed by changing the vowel.
Morphological parsing: recognizing that a word (like foxes) break down
into component morphemes (fox and -es) and building a structured
representation of it.
Parsing means taking an input and producing some sort of
linguistic structure for it.

Parsing can be thought in broad terms producing structures based on:




Morphology,
Syntax
Semantics,
Discourse



String,
Tree, or
Network
Producing a:
3 October 2015
Veton Këpuska
3
Introduction
 Morphological parsing (or stemming)
applies to many affixes other than plurals;
 Example: Parsing any English verbs ending in –
ing (e.g., going, talking, congratulating) into its
verbal stem plus the –ing morpheme.
 going ⇨ VERB-go + GERUND-ing
 Morphological parsing important for speech and
language processing:
 Part-of-speech tagging
 Dictionaries (spell-checking)
 Machine translation
3 October 2015
Veton Këpuska
4
Introduction

To solve morphological parsing problem one could just store all the
plural forms of English nouns and –ing forms of English verbs in
dictionary as in English Speech Recognition tasks.



For many Natural Language Processing applications this is not
possible because –ing is a productive suffix: that is it applies to
every verb.
Similarly –s applies to almost every noun.
Productive suffixes apply to new words:


Example: fax and faxing
New words (e.g., acronyms and proper nouns) are created constantly –
need to add the plural morpheme –s to each.


Plural form of new nouns depends on the spelling/pronunciation of the
singular form (eg. The nouns ending in –z the plural is formed by
replacing it with –es).
In other languages (e.g., Turkish) one cannot list all the
morphological variants of every word:

3 October 2015
Turkish verbs have 40,000 possible forms not counting derivational
suffixes.
Veton Këpuska
5
Outline
1.
Survey of morphological knowledge for English and
some other languages
2. Introduction of finite-state transducer as the key
algorithm for morphological parsing.
i. Finite-state transducers are key algorithms for
speech and language processing.
3.
Related algorithms:
i. Stemming: mapping from the word to its root or
stem. Important to Information Retrieval tasks.
ii. Need to know if two words have a similar root
despite their surface differences
a. Example: sang and sung. The word sing is
called the common lemma of these words, and
mapping form all these to sing is called
lemmatization.
3 October 2015
Veton Këpuska
6
Outline
i.
Tokenization or Word Segmentation – a related
algorithms to morphological parsing that is defined as a
task of separating out (tokenizing) words from running
text.
a.
1.
English language text separates words by white space
but:
 New York, rock ‘n’ roll – are considered single words
 I’m – is considered two words “I” and “am”
For many applications we need to know how similar
two words are orthographically.
i. Morphological parsing is one method for computing
similarity,
ii. Comparison of strings of letters via minimum edit
distance algorithm.
3 October 2015
Veton Këpuska
7
Survey of English Morphology
 Morphology is the study of the way words
are built up from smaller meaning-bearing
units - morphemes.
 Morpheme is often defined as the minimal
meaning-bearing unit in a language.
 Main Entry: mor·pheme
Pronunciation: 'mor-"fEm
Function: noun
Etymology: French morphème, from Greek
morphE form
: a distinctive collocation of phonemes (as the
free form pin or the bound form -s of pins)
having no smaller meaningful parts
3 October 2015
Veton Këpuska
8
Survey of English Morphology
 Example:



fox consists of a single morpheme: fox.
cats consists of two morphemes: cat and –s.
Two broad classes of morphemes:
1. Stems - main morpheme of a word, and
2. Affixes – add additional meaning to the word.
i. Prefixes – preceding the stem: unbuckle
ii. Suffixes – following the stem: eats
iii. Infixes – inserted in the stem: humingi (Philippine
language Tagalog)
iv. Circumfixes – precede and follow the stem. gesagt
(German past participle of sagen)
3 October 2015
Veton Këpuska
9
Survey of English Morphology
 A word can have more than one affix:
 rewrites:
 Prefix
 Stem
 Suffix
- re
- write
-s
 Prefix
 Stem
 Suffix
- un
- believe
- able, ly
 unbelievably:
 English language does not tend to stack more than
four or five affixes
 Turkish can have words with nine or ten affixes –
languages like Turkish are called agglutinative
languages.
3 October 2015
Veton Këpuska
10
Survey of English Morphology
 There are many ways to combine
morphemes to create a word. Four methods
are common and play important role in
speech and language processing:
 Inflection
 Combination of a word stem with a grammatical
morpheme, usually resulting in a word of the
same class as the original stem, and usually filling
some syntactic function like agreement.
 Example:
 -s: plural of nouns
 -ed: past tense of verbs.
3 October 2015
Veton Këpuska
11
Survey of English Morphology
 Derivation
 Combination of word stem with a grammatical
morpheme, usually resulting in a word of a different
class, often with a meaning hard to predict.
 Example:


Computerize – verb
Computerization – noun.
 Compounding
 Combination of multiple word stems together.
 Example:

Doghouse – dog + house.
 Cliticization
 Combination of a word stem with a clitic. A clitic is a
morpheme that acts syntactically like a word, but is
reduced in form and attached (phonologically and
sometimes orthographically) to another word.
 Example:

3 October 2015
I’ve = I + ‘ve = I + have
Veton Këpuska
12
Inflectional Morphology



English language has a relatively simple inflelectional system; Only



Nouns
Verbs
Adjectives (sometimes)



Plural
Possessive
Many (but not all) nouns can either appear in
 bare stem or singular form, or
 Take a plural suffix
Number of possible inflectional affixes is quite small.
Nouns (English):
Regular Nouns
Irregular Nouns
Singular Cat
Thrush
Mouse
Ox
Plural
Thrushes
Mice
Oxen
3 October 2015
Cats
Veton Këpuska
13
Inflectional Morphology
 Regular plural spelled:


-s
-es after words ending in





–s (ibis/ibises)
-z (waltz/waltzes)
-sh (thrush/thrushes)
-ch (finch/finches)
-x (box/boxes); sometimes
 Nouns ending in –y preceded by a consonant change the –y
to –i (butterfly/butterflies).
 The possessive suffix is realized by apostrophe + -s for


Regular singular nouns (llama’s), and
Plural nouns not ending in –s (children’s), and often


Regular plural nouns (llamas’), and some
Names ending in –s or –z (Euripides’ Comedies).
 Lone apostrophe after
3 October 2015
Veton Këpuska
14
Inflectional Morphology
 English language inflection of verbs is
more complicated than nominal inflection.
1. English has three kinds of verbs
i. Main verbs (eat, sleep, impeach)
ii. Modal verbs (can, will, should)
iii. Primary verbs (be, have, do)
 Concerned with main and primary verbs
because these have inflectional endings.
 Of these verbs a large class are regular (all
verbs in this class have the same endings
marking the same functions)
3 October 2015
Veton Këpuska
15
Regular Verbs


Regular Verbs have four morphological forms.
For regular verbs we know the other forms by
adding one of three predictable endings and
making some regular spelling changes .
Morphological Form
Class
Regularly Inflected Verbs
Stem Walk
-s form Walks
-ing participle Walking
Past form or –ed Walked
participle
3 October 2015
Merge
Try
Map
Merges
Tries
Maps
Merging
Trying
Mapping
Merged
Tried
Mapped
Veton Këpuska
16
Regular Verbs
 Since regular verbs
1. Cover majority of the verbs and forms,
and
2. Regular class is productive,
they are significant in the morphology of
English language.
 Productive class is one that
automatically includes any new words
that enter the language.
3 October 2015
Veton Këpuska
17
Irregular Verbs


Irregular Verbs are those that have some more or less
idiosyncratic forms of inflection.
English irregular verbs




often have five different forms, but can have
as many as eight (e.g., the verb be), or
as few as three (e.g., cut or hit)
They constitute a smaller class of verbs estimated to be about 250
Morphological Form Class
Irregularly Inflected Verbs
Stem Eat
-s form Eats
-ing participle Eating
Past form Ate
–ed participle Eaten
3 October 2015
Veton Këpuska
Catch
Cut
Catches
Cuts
Catching
Cutting
Caught
Cut
Caught
Cut
18
Usage of Morphological Forms for
Irregular Verbs
 The –s form:

Used in “habitual present” form to distinguish the third-person
singular ending: “She jogs every Tuesday” from the other
choices of person and number “I/you/we/they jog every
Tuesday”.
 The stem form:

Used in in the infinitive form, and also after certain other verbs
“I’d rather walk home, I want to walk home”
 The –ing participle is used in the progressive construction to
mark a present or ongoing activity “It is raining”, or when
the verb is treated as a noun (this particular kind of
nominal use of a verb is called gerund use: “Fishing is fine if
you live near water”)
 The –ed participle is used in the perfect construction “He’s
eaten lunch already”, or passive construction “The verdict
was overturned yesterday”
3 October 2015
Veton Këpuska
19
Spelling Changes
 I number of regular spelling changes occur at morpheme
boundaries.
 Example:




A single consonant letter is double before adding the –ing and
–ed suffixes: beg/begging/begged
If the final letter is “c”, the doubling is spelled “ck”:
picnic/picnicking/picnicked
If the base ends in a silent –e, it is deleted before adding –ing
and –ed: merge/merging/merged
Just as for nouns, the –s ending is spelled






–es after verb stems ending in –s (toss/tosses)
-z (waltz/waltzes)
-sh (wash/washes)
-ch (catch/catches)
-x (tax/taxes) sometimes.
Also like nouns, verbs ending in –y preceded by a consonant
change the –y to –i (try/tries).
3 October 2015
Veton Këpuska
20
Derivational Morphology
 Derivation is combination of a word
stem with a grammatical morpheme
 Usually resulting in a word of a different
class,
 Often with a meaning hard to predict
exactly
 English inflection is relatively simple
compared to other languages.
 Derivation in English language is
quite complex.
3 October 2015
Veton Këpuska
21
Derivational Morphology
 A common kind of derivation in English is the
formation of new nouns
 From verbs, or
 Adjectives, called nominalization.
 Example:
 Suffix –ation produces nouns from verbs ending often
in then suffix –ize (computerize → computerization)
Suffix
Base Verb/Adjective
Derived Noun
-ation
Computerize (V)
Computerization
-ee
Appoint (V)
Appointee
-er
Kill (V)
Killer
-ness
Fuzzy (A)
Fuzziness
3 October 2015
Veton Këpuska
22
Derivational Morphology
 Adjectives can also be derived from nouns
and verbs
Suffix
Base Noun/Verb
Derived Adjective
-al
Computation (N)
Computational
-able
Embrace (V)
Embraceable
-less
Clue (N)
Clueless
3 October 2015
Veton Këpuska
23
Complexity of Derivation in English
Language
 There a number of reasons for complexity
in Derivation in English:
1. Generally less productive:
 Nominalizing suffix like –ation, which can be
added to almost any verb ending in –ize, cannot
be added to absolutely every verb.
 Example: we can’t say *eatation or *spellation
(* marks stem of words that do not have the
named suffix in English)
2. There are subtle and complex meaning
differences among nominalizing suffixes
 Example: sincerity vs sincereness
3 October 2015
Veton Këpuska
24
Cliticization
 Clitic is a unit whose status lies in between
that of an affix and a word.
 Phonological behavior:
 Short
 Unaccented
 Syntactic behaviour:
 Words, acting as:
 Pronouns,
 Articles,
 Conjunctions
 Verbs
3 October 2015
Veton Këpuska
25
Cliticization
Full From
am
are
is
will
Clitic
‘m
‘re
‘s
‘ll
Full Form
have
has
had
would
Clitic
‘ve
‘s
‘d
‘d
 Ambiguity
 She’s → she is or she has
3 October 2015
Veton Këpuska
26
Agreement
 In English language plural is marked
on both nouns and verbs.
 Consequently the subject noun and
the main verb have to agree in
number:
 Both must be either singular or plural.
3 October 2015
Veton Këpuska
27
Finite-State Morphological
Parsing
Finite-State Morphological Parsing
 Goal of Morphological Parsing is to take the
column 1 following table and produce
output forms like those in the 2 column
Input
Morphologically Parsed Output
Input
Morphologically Parsed Output
cats
cat+N+PL
goose
goose+V
Cat
cat+N+SG
gooses
goose+V+1P+SG
cities
city+N+PL
merging
merge+V+PresPart
geese
goose+N+PL
caught
catch+V+PastPart
goose
goose+N+SG
caught
catch+V+Past
3 October 2015
Veton Këpuska
29
Finite-State Morphological
 The second/fourth column of the previous slide table
contains stem of each word as well as assorted
morphological features. These features provide
additional information about the stem.
 Example:
 +N – word is a noun
 +SG – word is singular
 Some of the input forms may be ambiguous:
 Caught
 Goose
 For now we will consider the goal of morphological
parsing as merely listing of all possible parses. Task of
disambiguation among morphological parses will be
discussed in Chapter 5.
3 October 2015
Veton Këpuska
30
Requirements of Morphological
Parser
1.
Lexicon: the list of stems and affixes, together with
basic information about them.
 Example: whether a stem is a Noun or a Verb, etc.
2. Morphotactics: the model of morpheme ordering that
explains which classes of morphemes can follow
other classes of morphemes inside a word.
 Example: English plural morpheme follows the noun
and does not precede it.
3. Orthographic rules: these are spelling rules that are
used to model the changes that occur in a word,
usually when two morphemes combine.
 Example: They *y → *ie spelling rule as in city + -s
→ cities.
3 October 2015
Veton Këpuska
31
Requirements of Morphological
Parser
 In the next section we will present:
 Representation of a simple lexicon for
the sub-problem of morphological
recognition
 Build FSAs to model morphotactic
knowledge
 Finite-State Transducer (FST) is
introduced as a way of modeling
morphological features in the lexicon
3 October 2015
Veton Këpuska
32
Building a Finite-State
Lexicon
Lexicon
 A lexicon is a repository for words.



The simplest possible lexicon would consist of an explicit list of
every word of the language
Every word means including
 Abbreviations: AAA
 Proper Names:
Example:
Jane or Beijing, etc.
 a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, …
 For the various reasons discussed above in general it will be
inconvenient or even impossible to list every word in a
language.
 Computational lexicons are usually structured with


a list of each of the stems and affixes of the language.
Representation of the morphotactics
 One of the most common way to model morphotactics is
with the finite-state-automaton.
3 October 2015
Veton Këpuska
34
Example of FSA for English nominal
inflection
reg-noun
q0
Start State
plural-s
q1
q2
irreg-pl-noun
irreg-sg-noun
 This FSA assumes that the lexicon includes


regular nouns (reg-nouns), that take
Regular –s plural: cat, dog, aardvark

irregular noun forms that don’t take –s; both:
 Ignoring for now that the plural of words like fox have
inserted e: foxes.
 singular (irreg-sg-noun): goose, mouse, sheep
 plural (irreg-pl-noun): geese, mice, sheep
3 October 2015
Veton Këpuska
35
Example for English verbal
inflection
Irreg-past-verb-form
past (-ed)
q0
reg-verb-stem
q1
q3
past paticiple (-ed)
reg-verb-stem
irreg-verb-stem
pres part (-ing)
q2
 This lexicon has three
stem classes:
 reg-verb-stem
 irreg-verb-stem
 irreg-past-verb-form
3 October 2015
3sg(-s)
 Four affix classes:




Veton Këpuska
-ed: past
-ed: participle
-ing: participle
-s: third case singular
36
Examples of Verb Inflections
regverbstem
irregverbstem
irregpastverb
past
pastpart
prespart
3sg
walk
cut
caught
-ed
-ed
-ing
-s
fry
speak
ate
talk
sing
eaten
impeach
3 October 2015
sang
Veton Këpuska
37
English Derivational Morphology
 As it has been discussed earlier in this
chapter, English derivational morphology is
significantly more complex than English
inflectional morphology.
 FSA for that model derivational morphology
are thus tend to be quite complex.
 Some models of English derivation are
based on the more complex context-free
grammars.
3 October 2015
Veton Këpuska
38
Morphotactics of English Adjectives

Example of Simple Case of Derivation from Antworth (1990):


big, bigger, biggest
happy, happier, happiest,
happily
unhappy unhappier,
unhappiest, unhappily
clear, clearer, clearest,
clearly, unclear, unclearly


cool, cooler, coolest, coolly
red, redder, reddest
real, unreal, really
adj-root
unq0



q1
-er -est -ly
q2
q3
e
3 October 2015
Veton Këpuska
39
Problem Issues
 While previous slide FSA will
recognize all the adjectives in the table
presented earlier,
 it will also recognize ungrammatical forms like:
 unbig, unfast, oranger, or smally

 adj-root would include adjectives that:
1. can occur with un- and –ly: clear, happy and
real
2. can not occur: big, small, etc.
 This simple example gives un idea of the
complexity to be expected from English
derivation.
3 October 2015
Veton Këpuska
40
Derivational Morphology Example 2

FSA models a number of derivational facts:



generalization that any verb ending in –ize can be followed by the nominalizing
suffix –ation
-al or –able → -ity or -ness
Exercise 3.1 to discover some of the individual exceptions to many of these
constructs.
-ize/V
nouni

Example:






fossil → fossilize → fossilization
equal → equalize → equalization
formal → formalize → formalization
realize → realizable → realization
natural → naturalness
casual → casualness
q0
q1
-able/A
adj-al
3 October 2015
nounl
Veton Këpuska
-er/N
q6
-ly/Adv
adj-ous
q8
verbk
FSA models for another fragment
of English derivational
morphology
q4
-ness/N
-ive/A

-ity/N
q5
verbj
q10
q3
q2
adj-al
q7
-ation/N
-ness/N
q9
-ly/Adv
-ative/A
-ful/A
q11
41
Solving the Problem of
Morphological Recognition
 Using FSAs above one could solve the
problem of Morphological Recognition;
 Given an input string of letters does it constitute
legitimate English word or not.
 Taking morphotactic FSAs and plugging in
each “sub-lexicon” into the FSA.
 Expanding each arc (e.g., reg-noun-stem arc)
with all morphemes that make up the stem of
reg-noun-stem.
 The resulting FSA can then be defined at the
level of the individual letter.
3 October 2015
Veton Këpuska
42
Solving the Problem of
Morphological Recognition


Noun-recognition FSA produced by expanding the Nominal
Inflection FSA of in previous slide with sample regular and irregular
nouns for each class.
We can use below figure to recognize strings like aardvarks by
simply starting at the initial state, and comparing the input letter
by letter with each word on each outgoing arc, and so on, just as
we saw in Ch. 2.
reg-noun
q0
Start State
plural-s
q1
o
q2
x
f
irreg-pl-noun
a
c
s
t
e
irreg-sg-noun
g
o
o
s
e
e
e
e
s
Expanded FSA for a few English nouns with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see beginning on page 19 (ref book) how to correctly deal
with the inserted e in foxes.
3 October 2015
Veton Këpuska
43
Finite-State Transducers
Finite-State Transducers (FST)
 FSA can represent morphotactic structure of a lexicon and
thus can be used for word recogntion.
 In this section we will introduce the finite-state transducers
and we will show how they can be applied to morphological
parsing.
 A transducers maps between one representation and
another;




A finite-state transducer or FST is a type of finite
automaton which maps between two sets of symbols.
FST can be visualized as a two-tape automaton which
recognizes or generates pairs of strings.
Intuitively, we can do this by labeling each arc in the the FSM
(finite-state machine) with two symbol strings, one from each
tape.
In the figure in the next slide an FST is depicted where each
arc is labeled by an input and output string, separated by a
colon.
3 October 2015
Veton Këpuska
45
A Finite-State Transducer (FST)
aa:b
b:a
b:0
q0
b:b
q1
a:ba
 FST has a more general function than an FSA;
 Where an FSA defines a formal language by defining
a set of strings, an FST defines a relation between
sets of strings.
 FST can be thought of as a machine that reads one
string and generates another.
3 October 2015
Veton Këpuska
46
A Finite-State Transducer (FST)
 FST as recognizer:

A transducer that takes a pair of strings as input and
outputs:
 accept if the string-pair is in the string-pair language,
and
 reject if it is not.
 FST as a generator:

A machine that outputs pairs of strings of the language
and outputs:
 yes or no, and
 a pair of output string.
 FST as translator:

A machine that reads a string and outputs another
string.
 FST as a set relater:

A machine that computes relations between sets.
3 October 2015
Veton Këpuska
47
A Finite-State Transducer (FST)
 All four categories of FST in previous
slide have applications in speech and
language processing.
 Morphological parsing (and for many
other NLP applications):
 Apply FST translator metaphor:
 Input: a string of letters
 Output: a string of morphemes.
3 October 2015
Veton Këpuska
48
Formal Definition of FST
Q
A finite set of N states q0, q1, …, qN-1,
S
A finite set corresponding to the input alphabet
∆
A finite set corresponding to the output alphabet
q0 ∈ Q
The start state
F⊆Q
The set of final states
d(q,w)
The transition function or transition matrix between states; Given a
state q ∈ Q and a string w ∈ *. d returns a set of new states q’ ∈ Q. d
is thus a function from Q x  *, to 2Q (because there are 2Q possible
subsets of Q). d returns a set of states rather than a single state
because a given input may be ambiguous in which state it maps to.
s(q,w)
The output function giving the set of possible output strings for each
state and input. Given a state q ∈ Q and string w ∈ *, s(q,w) gives a
set of ouput strings, each string o ∈ D*. s is thus a function from Q x
*
* to 2D
3 October 2015
Veton Këpuska
49
Properties of FST
 FSAs are isomorphic to regular languages ⇔ FSTs are
isomorphic to regular relations.
 FSTs and regular relations are closed under union
 Generally FSTs are not closed under difference,
complementation and intersection.
 In addition to union, FSTs have two closure properties
that run out to be extremely useful:
 Inversion: The inversion of a trasducer T(T-1) simply
switches the input and output labels. Thus if T maps
from the input alphabet I to the output alphabet O,
T-1 maps from O to I.
 Composition: If T1 is a transducer from I1 to O1 and
T2 a transducer from O1 to O2, then T1∘T2 maps from
I1 to O2.
3 October 2015
Veton Këpuska
50
Properties of FST
 Inversion is useful because it makes it easy
to convert a FST-as-parser into an FST-asgenerator.
 Composition is useful because it allow us to
take two transducers that run in series and
replace them with one more complex
transducer.
 Composition works as in algebra:
 Applying T1∘T2 to an input sequence S is
identical to applying T1 to S and then T2 to the
resulting sequence; thus T1∘T2(S) = T2(T1(S))
3 October 2015
Veton Këpuska
51
FST Composition Example
a:b
b:c
a:b
q0
a:c
b:c
q1
q0
a:c
q1
=
q0
q1
 The composition of [a:b]+ with [b:c]+ to
produce [a:c]+
3 October 2015
Veton Këpuska
52
Projection
 The Projection of an FST is the FSA
that is produced by extracting only
one side of the relation.
 We refer to the projection to the left or
upper side of the relation as the upper
or first projection and the projection to
the lower or right side of the relation as
the lower or second projection.
3 October 2015
Veton Këpuska
53
Sequential Transducers and
Determinism
 Transducers as have been described may
be nondeterministic; given an input there
may be many possible output symbols.
 Thus using general FSTs requires the kinds of
search algorithms discussed in Chapter 1, which
makes FSTs quite slow in the general case.
 This fact implies that it would be nice to have an
algorithm to convert a nondeterministic FST to a
deterministic one.
 Every non-deterministic FSA is equivalent to
some deterministic FSA
 Not all FSTs can be determinized.
3 October 2015
Veton Këpuska
54
Sequential Transducers
 Sequential transducers are a subtype of
transducers that are deterministic on their
input.
 At any state of a sequential transducer, each
given symbol of the input alphabet S can label
at most one transition out of that state.
q0
a:b
b:0
b:b
q1
a:ba
3 October 2015
Veton Këpuska
55
Sequential Transducers
 Sequential transducers are not necessarily sequential
on their output. In example of FST in previous slide,
two distinct transitions leaving form state q0 have the
same output (b).

Inverse of a sequential transducer may thus not be
sequential, thus we always need to specify the direction
of the transduction when discussing sequentiality.
 Formally, the definition of sequential transducers
modifies the d and s functions slightly:


d becomes a function form Q x S* to Q (rather than to
2Q), and
s becomes a function from Q x S* to D* (rather than
2D*).
 Subsequential transducer is one generalization of
sequential transducer which generates an additional
output string at the final states, concatenating it onto
the output produced so far.
3 October 2015
Veton Këpuska
56
Importance of Sequential and
Subsequential Transducers

Sequential and Subsequential FSTs are:





efficient because their inputs are deterministic → they can be
processed in time proportional to the number of symbols in the input
(linear in their input length) rather then being proportional to some
much larger number which is function of the number of states.
Efficient algorithms for detrminization exists for Subsequential
transducers (Mohri 1997) and for their minimization (Mohri, 2000).
While Sequential and Subsequential FSTs are deterministic and
efficient, neither of them is able to handle ambiguity; they
transduce each input string to exactly one possible output string.
Since ambiguity is a crucial property of natural language, it will be
useful to have an extension of subsequential transducers that can
deal with ambiguity, but still retain the efficiency and other useful
properties of sequential transducers. →
One such generalization of subsequential transducers is: psubsequentiual transducer.


A p-subsequentiual transducer allows for p(p≥1) final output strings
to be associated with each final state (Mohri 1996).
They can handle a finite amount of ambiguity, which is useful for many
NLP tasks.
3 October 2015
Veton Këpuska
57
2-subsequential FSA (Mohri 1997)
q1
a:a
a:a
q0
a
q3
b:a
b:b
b
q2
 Mohri (1996, 1997) shows a number of tasks whose
ambiguity can be limited in this way, including the
representation of dictionaries, the compilation of
morphological and phonological rules, and local syntactic
constraints. For each of these kinds of problems, he and
others have shown that they are p-subsequentializable,
and thus can be determinized and minimized. This class of
transducers includes many, although not necessarily all,
morphological rules.
3 October 2015
Veton Këpuska
58
FSTs for Morphological
Parsing
Task of Morphological
 Example: cats → cat+N+PL
 In the finite-state morphology paradigm, we
represent a word as correspondence between:


A lexical level – which represents a concatenation of
morphemes make up a word, and
The surface level – which represents the concatenation of
letters which make up the actual spelling of the word
Lexical
c
a
t
+N
Surface
c
a
t
s
+PL
Schematic examples of the lexical and surface tapes; the actual transducers will involve
intermediate tapes as well.
3 October 2015
Veton Këpuska
60
Lexical Tape

For finite-state morphology it is convenient to view an FST as having
two tapes.
 The upper or lexical tape is composed from characters from one
alphabet S.
 The lower or surface tape, is composed of characters from
another alphabet D.

In the two-level morphology of Koskenniemi (1983), each arc is
allowed to have only a single symbol from each alphabet.
 Two symbol alphabets can be obtained by combining alphabets S
and D to make a new alphabet S’.
 New alphabet S’ makes the relationship to FSAs clear:
 S’ is a finite alphabet of complex symbols:
 Each complex symbol is composed of an input-output pair
i:o;
 i – one symbol from input alphabet S
 o – one symbol from output alphabet D
 S’ ⊆ S x D
 S and D may each also include the epsilon symbol e.
3 October 2015
Veton Këpuska
61
Lexical Tape
 Comparing FSA to FST modeling
morphological aspect of a language
the following can be observed:
 FSA – accepts a language stated over a
finite alphabet of single symbols; e.g.
Sheep language:
 S={b,a,!}
 FST as defined in previous slides accepts
a language stated over pairs of symbols,
as in:
 S={a:a, b:b, !:!, a:!, a:e, e:!}
3 October 2015
Veton Këpuska
62
Feasible Pairs
 In two-level morphology, the pairs of
symbols in S’ are also called feasible
pairs.
 Each feasible pair symbol a:b in the transducer
alphabet S’ expresses how the symbol a from
one tape is mapped to the symbol b on the
other tape.
 Example: a:e means that an a on the upper
tape will correspond to nothing on the lower
tape.
 We can write regular expressions in the
complex alphabet S’ just as in the case of
FSA.
3 October 2015
Veton Këpuska
63
Default Pairs
 Since it is most common for symbols to map to
themselves (in two-level morphology) we call pairs
like a:a default pairs, and just refer to them by
the single letter a.
3 October 2015
Veton Këpuska
64
FST Morphological Parser
 From morphotactic FSAs covered earlier, and by
adding
 Lexical tape and the appropriate morphological
features we can build and FST morphological
parser.
 In the next slide is presented a figure that is
augmentation of the figure in the slide FSA for
English Nominal Inflection with nominal
morphological features (+Sg and +Pl) that
correspond to each morpheme.
 The symbol ^ indicates a morpheme boundary,
while
 The symbol # indicates a word boundary.
3 October 2015
Veton Këpuska
65
A Schematic Transducer for English
Nominal Number Inflection Tnum
reg-noun
q0
irreg-sg-noun
irreg-pl-noun
q1
+N
q2
+N
q3
+N
e
e
e
q4
+Sg
#
q5
q6
+Pl
^s#
+Sg
#
qq77
+Pl
#
 The symbols above of in each arc represent elements of
the morphological parse in the lexical tape.
 The symbols below each arc represent the surface tape
(or the intermediate tape, to be described later), using
the morpheme-boundary symbol ^ and word-boundary
marker #.
 The arcs need to be expanded by individual words in the
lexicon.
3 October 2015
Veton Këpuska
66
Transducer & Lexicon
 In order to use the Transducer in the previous slide as a
morphological noun parser it needs to be expanded with
all the individual regular and irregular noun stems:
replacing the labels reg-noun etc.
 This expansion can be done by updating the lexicon for
this transducer, so that irregular plurals like geese will
parse into the correct stem goose +N +Pl.
 This is achieved by allowing the lexicon to also have two
levels:
 Surface geese maps to lexical goose → new
lexical entry: “g:g o:e o:e s:s e:e”.
 Regular forms are simpler: two level entry for fox will
now be “f:f o:o x:x”
 Relying on the orthographic convention that f stands for
f:f and so on, we can simply refer to it as fox and the
form for geese as “g o:e o:e s e”.
3 October 2015
Veton Këpuska
67
Lexicon
reg-noun
irreg-pl-noun
irreg-sg-noun
fox
g o:e o:e s e
goose
cat
sheep
sheep
aardvark
m o:i u:e s:c e
mouse
o
o
1
f
f
c
c
0
g
g
3
2
a
a
x
x
t
t
4
5
+N
e
+Pl
^s#
6
+Sg
#
o
o
o
o
s
s
e
e
+N
e
+Sg
#
+Pl
#
e
e
s
e
e
+N
e
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
3 October 2015
Veton Këpuska
68
FST
 The resulting transducer shown in previous slide will
map:


Plural nouns into the stem plus the morphological
marker +Pl
Singular nouns into the stem plus the morphological
marker +Sg.
 Example:




cats → cat +N +Pl
c:c a:a t:t +N: +Pl:^s#
Output symbols include the morpheme and word
boundary markers: ^ and # respectively. Thus the lower
labels do not correspond exactly to the surface level.
Hence the tapes (output) with these morpheme
boundary markers is referred to as intermediate as
shown in the next figure.
3 October 2015
Veton Këpuska
69
Lexical and Intermediate Tapes
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
#
A schematic view of the lexical and intermediate tapes
3 October 2015
Veton Këpuska
70
Transducers and
Orthographic Rules
Transducers and Orthographic
Rules
 The method described in previous section
will successfully recognize words like
aardvarks and mice.
 However, concatenating morphemes won’t
work for cases where there is a spelling
change:
 Incorrect rejection of the input like foxes, and
 Incorrect acceptance of the input like foxs.
 This is due to the fact that English language
often requires spelling changes at
morpheme boundaries:
 Introduction of spelling (or orthographic)
rules
3 October 2015
Veton Këpuska
72
Notations for Writing Orthographic
Rules
 Implementing a rule in a transducer is important in
general for speech and language processing.
 The table bellow introduces a number of rules.
Name
Description of Rule
Example
Consonant
doubling
1-letter consonant doubled before –ing/-ed
beg/begging
E deletion
Silent -e dropped before –ing and –ed
make/making
E insertion
-e added after –s, -z, -x, -ch, -sh before s
watch/watches
Y
replacement
-y changes to –ie before –s, -i before -ed
try/tries
K insertion
Verbs ending with vowel + -c add –k
panic/panicked
3 October 2015
Veton Këpuska
73
Lexical → Intermediate → Surface
 An example of the lexical, intermediate, and surface
tapes. Between each pari of tapes is a two-level
transducer; the lexical transducer between lexical and
intermediate levels, and the E-insertion spelling rule
between the intermediate and surface levels. The Einsertion spelling rule insert and e on the surface tape
when the intermediate tape has a morpheme boundary
^ followed by the morpheme -s
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
3 October 2015
Veton Këpuska
#
74
Orthographic Rule Example
 E-insertion rule form the table might be formulated as:

Insert an e on the surface tape just when the lexical tape
has a morpheme ending in x (or z, etc) and the next
morpheme is –s”. Bellow is formalization of this rule:
x
 
e  e /  s ^ _ s #
z
 
 This rule notation is due to Chomsky and Halle (1968);
and is of the form:

a → b/c_d; “rewrite a with b when it occurs between c and
d”.
3 October 2015
Veton Këpuska
75
Orthographic Rule Example
 Symbol e means an empty transition;
replacing it means inserting
something.
 Morpheme boundaries ^ are deleted
by default on the surface level (^:e)
 Since # symbol marks a word
boundary, the rule in the previous
slide means:
 Insert an e after a morpheme-final x, s,
or z, and before the morpheme s.
3 October 2015
Veton Këpuska
76
Transducer for E-insertion rule
other
q5
z, s, x
^:e
other
#
s
^:e
z, s, x
z, s, x
e:e
^:e
q1
q0
q2
s
q3
q4
z, x
#, other
#, other
#
s:s
x:x
z:z
^:e
e:e
#
other
q0:
1
1
1
0
-
0
0
q1:
1
1
1
2
-
0
0
q2:
5
1
1
0
3
0
0
q3
4
-
-
-
-
-
-
q4
-
-
-
-
-
0
-
q5
1
1
1
2
-
-
0
State/Input
3 October 2015
Veton Këpuska
77
Combining FST Lexicon and
Rules
Combining FST Lexicon and Rules
 Ready to combine Lexicon and Rule Transducers for
parsing and generating.
 The figure below depicts the architecture of twolevel cascade of FSTs with lexicon and rules
f
o
x
+N
+Pl
LEXICON-FST
f
o
3 October 2015
^
s
Orthigraphic Rules
FST1
f
x
o
x
e
#
FSTN
s
Veton Këpuska
79
Tlex FST Combined with Te-insert FST
o
o
1
f
f
c
c
0
g
g
2
a
a
3
x
x
t
t
4
5
+N
e
+Pl
^s#
6
+Sg
#
o
o
o
o
s
s
e
e
+N
e
+Sg
#
+Pl
#
e
e
e
e
s
Lexical
f
o
x
+N
+PL
+N
e
Tlex
0
1
2
5
6
7
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
Intermediate
other
Te-insert
q5
z, s, x
^:e
other
#
Surface
s
e:e
^:e
q1
q0
0
o
0
f
x
0
o
^
1
x
s
2
e
3
#
4
0
s
^:e
z, s, x
z, s, x
f
q2
s
q3
q4
z, x
#, other
#, other
#
3 October 2015
Veton Këpuska
80
Finite-State Transducers
 The exact same cascade with the same state sequences is
used when the machine is


Generating the surface tape from the lexical tape
Parsing the lexical tape from the surface tape.

Example: foxes can also be a verb (meaning “to baffle or
confuse”), and hence lexical parse for foxes could be
 Parsing is slightly more complicated than generation,
because of the problem of ambiguity.
 fox +V +3Sg: That trickster foxes me every time, as well as
 fox +N +Pl: I saw two foxes yesterday.



For ambiguous cases of this sort the transducers is not capable
of deciding.
Disambiguation will require some external evidence such as
surrounding words. Disambiguation algorithms are discussed
in Ch 5, Ch. 20.
In lack of such external evidence the best that FST can do is
enumerate all possible choices – transducing fox^s# into both
fox +V +3Sg and fox +N +Pl.
3 October 2015
Veton Këpuska
81
FSTs and Ambiguation
 There is a kind of ambiguity that FSTs need to handle:


Local ambiguity that occurs during the process of parsing.
Example: Parsing input verb assess.
 After seeing ass, E-insertion transducer may propose that the
e that follows is inserted by the spelling rule - as far as the
FST we might have been parsing the words asses.
 Thus is not until we don’t see the # after asses but rather run
into another s, that we can realize that we have gone down an
incorrect path.
 Because of this non-determinism, FST-parsing algorithm
need to incorporate some sort of search algorithm.
Homework – Exercise 3.7 asks to modify the algorithm for
non-deterministic FSA recognition to do FST parsing.
 Note that many possible spurious segmentations of the
input, such as parsing assess as ^a^s^ses^s will be
ruled out since no entry in the lexicon will match this string.
3 October 2015
Veton Këpuska
82
FSTs and Cascading
 Running a cascade of number of FST can be unwieldy.
Using cascading property of FSTs we can compose a
single more complex transducer from cascade of a
number of transducers in series.
 Transducers in parallel can be combined by
automation intersection.
 The automaton intersection algorithm just takes the
Cartesian product of the states, i.e., for each state qi
in machine 1 and state qj in machine 2, we create a
new state qij.
 For any input symbol a, if machine 1 would transition
to sate qn and machine 2 would transition to state
qm, the new transducer will transition to state qnm.
 The figure in the next slide sketches how the
intersection (∧) and composition (o) process might be
carried out.
3 October 2015
Veton Këpuska
83
Intersection and Composition of
Transducers
LEXICON-FST
FST1
Orthographic Rules
LEXICON-FST
}
FSTN
intersect
FSTA=(FST1^FST2^…^FSTN)
}
compose
LEXICON-FST O FSTA
 Since there are a number of rule → FST compilers, it is
almost never necessary in practice to write an FST by hand.



Kaplan and Key (1994) give the mathematics that define the
mapping from rules to two-level relations
Antworth (1990) gives details of the algorithms for rule
compilation.
Mohri (1997) gives algorithms for transducer minimization and
determinization
3 October 2015
Veton Këpuska
84
Lexocon-Free FSTs: The Porter
Stemmer
 Building the FSTs from a lexicon plus rules is the
standard algorithm for morphological parsing.
 However, there are simpler algorithms that do not
require the large on-line lexicon demanded by this
standard algorithm.
 These new algorithms are especially used in
Information Retrieval (IR) tasks like web search:
 Boolean combination of relevant keywords or
phrases: marsupial OR kangaroo OR koala
 Since a document with the word marsupials might not
match the keyword marsupial, some IR systems first
run a stemmer on the query and document words.
 Morphological information in IR is thus only
used to determine that two words have the
same stem; the suffixes are thrown away.
3 October 2015
Veton Këpuska
85
Porter (1980) Stemming Algorithm
 Porter Stemming Algorithm is based on a
series of simple cascaded rewrite rules.
 Considering that cascaded rewrite rules can be
easily implemented as an FST, the Porter
algorithm is thought of as a lexicon-free FST
stemmer (Homework Exercise 3.6).
 The algorithm contains rules like this:
 ATIONAL → ATE (e.g., relational → relate)
 ING → e if stem contains vowel (e.g, motoring →
motor)
 More details from:
http://tartarus.org/~martin/PorterStemmer
/index.html
3 October 2015
Veton Këpuska
86
Porter (1980) Stemming Algorithm
 Not all IR engines use stemming, partly
because of stemmer errors such as these
noted by Krovetz:
Errors of Commission
Errors of Omission
organization
organ
European
Europe
doing
doe
analysis
analyzes
generalization generic
matrices
matrix
numerical
numerous
noise
noisy
policy
police
sparse
sparisty
3 October 2015
Veton Këpuska
87
Word and Sentence
Tokenization
Word and Sentence Tokenization
 Focused so far on the problem of
segmentation:
 How words can be segmented into morphemes.
 Related problem is of segmenting running
text into words and sentences:
tokenization.
 Word tokenization may seem very simple in
a language like English that separates
words via special ‘space’ character.
However, not every language does this
(e.g., Chinese, Japanese, and Thai)
3 October 2015
Veton Këpuska
89
Segmentation Problem
1. Whitespace character is not sufficient by itself:


Mr. Sherwood said reaction to Sea Containers’ proposal
has been “very positive.” In New York Stock Exchange
composite trading yesterday, Sea Containers closed at
$62.625, up 62.5 cents. “I said, ‘what’re you? Crazy?’ “
said Sadowsky. “I can’t afford to do that.”
Segmenting purley on white-space would produce
words like these:

cents.
said,
positive,”
2. In speech recognition systems
Crazy?
1. the whitespace character is not produced by the
recognizer.
2. Tokens are phonetic labels (not orthographic): AE
3. Incomplete number of phones (recognizer deletion)
4. Insertion of non-existing phone (recognizer insertion)
5. Erroneous phone label (recognizer substitution)
3 October 2015
Veton Këpuska
90
Segmentation Errors


Segmentation errors produced by using only white-space could be
addressed by treating punctuation as word boundary in addition to
white-space.
Punctuation however often occurs word internally:




m.p.g, Ph.D, AT&T, cap’n, 01/02/06, google.com, etc.
Also, if we want for 62.5 in the example in previous slide to be a word,
segmenting every period “.” must be avoided.
Number expressions introduce additional problems when
expressed in numeric form or as words:



777,777.77
seven hundred and seventy seven dollars and seventy seven cents.
9/15/2007

European continental languages use comma to mark the decimal
“point” and spaces or sometimes periods where English language puts
commas:
777 777,77 or 777.777,77
15.09.2007
Languages differ on punctuation styles for numbers, dates.


3 October 2015
Veton Këpuska
91
Other Tokenizer Tasks
 Expansion of clitic contractions that
are marked by apostrophes:
 what’re → what are
 we’re → we are
 Complications:
 Apostrophes are ambiguous since they
are used as
 genitive markers: “the book’s over” or
“Containers’ “
 quatative markers: “ ‘what’re you?
Crazy?’ “
3 October 2015
Veton Këpuska
92
Other Tokenizer Tasks
 Multi-word expressions:


New York
Rock ‘n’ roll

Segmentation of a text into sentences is generally based on
punctuation:
Sentence boundaries are marked usually with: “.”, “?”, “!”.
“?”, “!” are relatively unambiguous compared to “.”
 Requires multiword expression dictionary, also
 Must detect names, dates, organizations (names) – named
entity detection (Ch22)
 Sentence Segmentation – crucial first step in text
processing.



 Example: Sentence boundary “.” vs “Mr.” or “Inc.”
 Previous example used “Inc.” to mark abbreviation as well as
the sentence boundary.
Consequently word tokenization and sentence tokenization
tend to be addressed jointly.
3 October 2015
Veton Këpuska
93
Sentence Tokenization
 Typically sentence tokenization methods work by building a
binary classifier:
 Based on sequence of rules, or

Based on some machine learning algorithm (introduced in
the subsequent chapters)
Deciding if a period is part of the word or is a sentence boundary
marker.


Abbreviation dictionary is useful in determining if a period is
attached to a commonly used abbreviation.
Useful first step in sentence tokenization can be taken via a
sequence of regular expressions. Introduction of the first part of
this algorithm: a word tokenization is presented in the next slide
implemented in the perl script.

Free scripting language can be downloaded from:
http://aspn.activestate.com/ASPN/Downloads/ActivePerl/
3 October 2015
Veton Këpuska
94
Word Tokenization with Perl
#!/usr/bin/perl
$letternumber = "[A-Za-z0-9]";
$notletter = "[ˆA-Za-z0-9]";
$alwayssep = "[\\?!()\";/\\|‘]";
$clitic = "(’|:||’S|’D|’M|’LL|’RE|’VE|N’T|’s|’d|’m|’ll
|’re|’ve|n’t)";
$abbr{"Co."} = 1; $abbr{"Dr."} = 1;
$abbr{"Jan."} = 1; $abbr{"Feb."} = 1;
while (<>){
# put whitespace around unambiguous separators
# now deal with periods. For each possible word
@possiblewords=split(/\s+/,$_);
foreach $word (@possiblewords) {
# if it ends in a period,
if (($word =˜/$letternumber\./)
&& !($abbr{$word})
# and isn’t on the abbreviation list
# and isn’t a sequence of letters and periods (U.S.)
# and doesn’t resemble an abbreviation (no vowels: Inc.)
&& !($word =˜/ˆ([A-Za-z]\.([A-Zaz]\.)+|[A-Z][bcdfghjnptvxz]+\.)$/)) {
s/$alwayssep/ $& /g;
#
put whitespace around commas that aren’t inside numbers
s/([ˆ0-9]),/$1 , /g;
s/,([ˆ0-9])/ , $1/g;
# then segment off the period
$word =˜s/\.$/ \./;
}
# distinguish singlequotes from apostrophes by
# expand clitics
# segmenting off single quotes not preceded by letter
$word =˜
s/’ve/have/;
$word =˜
s/’m/am/;
print $word," ";
s/ˆ’/$& /g;
s/($notletter)’/$1 ’/g;
# segment off unambiguous word-final clitics and punctuation
s/$clitic$/ $&/g;
s/$clitic($notletter)/ $1 $2/g;
}
print "\n";
}
3 October 2015
Veton Këpuska
95
Sentence Tokenization
 The fact that the simple tokenizer can be
build with such simple regular expression
patterns like the one in the previous perl
example suggest that they can be easily
implemented in FSTs.
 The examples of FSTs that have been built :
 Karttunen et al. - 1996
 Beesley and Karttunen – 2003, give descriptions
of such FST-based tokenizers.
3 October 2015
Veton Këpuska
96
Detecting and Correcting
Spelling Errors
Detecting and Correcting Spelling
Errors
 The problem of detecting and correcting spelling
errors is introduced. Since the standard algorithm for
spelling error correction is probabilistic, we will
continue our spell-checking discussion in Ch. 5 after
the probabilistic noisy channel is introduced.
 The detection and correction of spelling errors is an
integral part of
 Modern word processors and search engines
 In optical character recognition (OCR),
 Automatic recognition of machine or hand-printed
characters, and
 On-line handwriting recognition, printed or cursive
handwriting.
3 October 2015
Veton Këpuska
98
Broader Problems in Increased
Order of Spell-Checking System
 Non-word error detection: detecting spelling errors
that result in non-words (e.g., graffe for giraffe)
 Isolated-word error correction: correcting spelling
errors that result in non-words (e.g., correcting
graffe to giraffe but looking only at the word in
isolation)
 Context-dependent error detection and
correction: using the context to help detect and
correct spelling errors even if they accidentally result
in an actual word of English (real-word errors). This
happen from typographical errors (insertion, deletion,
transposition) which accidentally produce a real word
(e.g., there for three), or because the writer
substituted the wrong spelling of a homophone or
near-homophone (e.g., dessert for desert, or piece
for peace)
3 October 2015
Veton Këpuska
99
Non-Word Error Detection
 Marking any word that is not found in a dictionary.
 Dictionary would not have a word entry graffe.
 Early research (Peterson 1986) suggested that such
spelling dictionaries would need to be kept small:
 Large dictionaries contain rare words that resemble
misspellings of other words.
 Example:
 wont or veery.
 won’t and very.
 Larger dictionary proved more help than harm by
avoiding marking rare words as errors in spite the fact
that some misspellings were hidden by real words
(Damerau and Mays 1989).
 Modern spell-checking systems tend to be based on
large dictionaries.
3 October 2015
Veton Këpuska
100
Dictionary Implementation
 A finite-state morphological parsers described throughout
this chapter provide a technology for implementing such
large dictionaries.



By giving a morphological parser for a word, an FST parser is
inherently a word recognizer.
An FST morphological parser can be turned into an even more
efficient FSA word recognizer by using the projection
operation to extract the lower-side language graph.
Such FST dictionaries also have the advantage of representing
productive morphology:
 English –s and –ed inflections described previously.
 Important for dealing with new legitimate combinations of
stems and inflections.


3 October 2015
New stem can be added to the dictionary and then all the inflected
forms are easily recognized.
This makes FST dictionaries especially powerful for spell0chekcing
in morphologically rich languages where a single stem can have
tens of hundreds of possible surface forms.
Veton Këpuska
101
Dictionary Implementation

FST dictionaries can help with non-word error detection. But
how about error correction?

Algorithms for isolated-word error correction operate by finding
words which are the likely source of the erroneous form:
 Correcting spelling error graffe requires:
 Searching through all possible like:
 giraffe, graf, craft, grail, etc.
 Selection of the best choices from all possible ones:
 Computation of distance metric between source
and the surface error.
 Intuitively giraffe is a more likely source than grail for
graffe.
The most powerful way to capture this similarity intuition
requires use of the probability theory and will be discussed in
Ch. 4. However this algorithm is bade on the non-probabilistic
minimum edit distance algorithm that is introduced next.

3 October 2015
Veton Këpuska
102
Minimum Edit Distance
 Deciding which two words is closer to some third word
in spelling is a special case of the general problem of
string distance.
 The distance between two strings is a measure of how
alike are to each other.
 Class of algorithm that solve this problems is know as
minimum edit distance.
 Minimum edit distance between two strings is the
minimum number of editing operations:
 Insertions
 Deletions
 Substitutions
Needed to transform one string into the other.
3 October 2015
Veton Këpuska
103
Minimum Edit Distance
I
N
T
D
S
S
*
E
X



E
E
*
N
I
S
C
U
T
I
O
N
T
I
O
N
Representing the minimum edit distance between two strings as an
alignment. The alignment is achieved with operations of
 Deletion (D)
 Insertion (I)
 Substitution (S)
To get a numeric representation for minimum edit distance we can
assign a particular cost or weight to each of these operations.
Levenshtein distance between two sequences is the simplest weighting
factor in which each of the three operations has a cost of 1
(Levenshtein 1966).
 In the example above the distance between intention and
execution is 5.
 Another version will weigh the D and I with a cost of “1” and S
with a cost of “2” making the distance above equal to 8.
3 October 2015
Veton Këpuska
104
Minimum Edit Distance
 The minimum edit distance is computed by
dynamic programming. Dynamic
programming is the name for a class of
algorithms first introduced by Bellman
(1957) 50 years of ago.
 It applies a table-driven method to solve
problems by combining solutions to subproblems.
 This class of algorithms includes most
commonly-used algorithms in speech and
language processing like:
 Viterbi & Forward (Ch. 6)
 CYK and Earley (Ch. 13)
3 October 2015
Veton Këpuska
105
Distance Matrix
N
O
I
T
N
S
E
T
N
I
I
S
D
E X E C U T
3 October 2015
Veton Këpuska
I O N
106
DP Algorithm Details
E
C
H
 A char-char matrix illustrates the alignment process
E
i-1, j
S
P
i-1
j-1 i, j-1
S
3 October 2015
s
P
E
E
Veton Këpuska
h
H
107
DP Algorithm Details
 One way to find the path that yields the best match
(i.e., lowest global distance) between the test string
and a given reference string is by evaluating all
possible paths in the matrix.
 That approach is time consuming because the number
of possible paths is exponential with the length of the
utterance.
 The matching process can be shortened by requiring
that
1. A path cannot go backwards; (i.e., to the left or down
in the Figure)
2. A path must include an association for every
character in the test string, and
3. Local distance scores are combined by adding to give
a global distance.
3 October 2015
Veton Këpuska
108
DTW - DP Algorithm Details
 For the moment, assume that a path must include an
association for every frame in the template and every
frame in the utterance.
 Then, for a point (i, j) in the time-time matrix (where
i indexes the utterance frame, j the template frame),
the previous point must have been (i-1, j-1), (i-1, j)
or (i, j-1) (because of the prohibition of going
backward in time).
(i-1,j)
(i,j)
j
(i,j-1)
(i-1,j-1)
i
3 October 2015
Veton Këpuska
109
DTW - DP Algorithm Details
 The principle of dynamic programming (DP) is that a point
(i, j), the next selected point on the path, comes from one
among (i-1, j-1), (i-1, j) or (i, j-1) that has the lowest
distance.
 DP finds the lowest distance path through the matrix, while
minimizing the amount of computation.
 The DP algorithm operates in a time-synchronous manner
by considering each column of the time-time matrix in
succession (which is equivalent to processing the utterance
frame-by-frame).
 For a template of length N (corresponding to an N-row
matrix), the maximum number of paths being considered at
any time is N.
 A test utterance feature vector j is compared to all
reference template features, 1…N, thus generating a vector
of corresponding local distances d(1, j), d(2, j), … d(N, j).
3 October 2015
Veton Këpuska
110
DP Algorithm Details
 If D(i, j) is the global distance up to, but not including
matrix point (i, j) and the local distance of (i, j) matching
character i of the test string with character j of reference
string is given by d(i, j), then
D(i, j) = min [D(i-1, j-1), D(i-1, j), D(i, j-1)] + d(i, j)
 Given that D(1, 1) = d(1, 1) (this is the initial condition),
we have the basis for an efficient recursive algorithm for
computing D(i, j).
 The final global distance D(M, N) at the end of the path
gives the overall lowest matching score of the template with
the utterance, where M is the number of vectors of the
utterance.
 The test string is then matched against the dictionary word
with a lowest matching score producing the minimum edit
distance.
3 October 2015
Veton Këpuska
111
DP Algorithm Details
 Equation presented in the previous slide enforces the
rule that the only directions in which a path can move
when at (i, j) in the time-time matrix is:
 up,
 right, or
 diagonally up and right.
 Computationally, that equation is in a form that could
be recursively programmed. However, unless the
language is optimized for recursion, this method can
be slow even for relatively small pattern sizes.
 Another method that is both quicker and requires less
memory storage uses two nested "for" loops. This
method only needs two arrays that hold adjacent
columns of the time-time matrix.
3 October 2015
Veton Këpuska
112
DP Algorithm Details


Referring to next Figure, the algorithm to find the least global
distance path is as follows (Note that from the Figure, which
shows a representative set of rows and columns, the cells at (i,
j) and (i, 0) have different possible originator cells. The path to
(i, 0) can originate only from (i-1, 0).
But the path to any other (i, j) can originate from the three
standard locations):
i-1, i, j
j
i-1
j-1 i, j-1
Prev. Col.:
i-1
3 October 2015
Cur.
Column: i
Veton Këpuska
113
DP Algorithm Details
1.
Calculate the global distance for the bottom most cell of the leftmost column, column 0.



2.

Calculate the global distance to the bottom most cell of the next
column, column 1 (which is designated the curCol, for current
column).

3.
5.
The global distance to that bottom most cell is the local distance for
that cell plus the global distance to the bottom most cell of the
predecessor column.
Calculate the global distance of the rest of the cells of curCol.

4.
The global distance up to this cell is just its local distance.
The calculation then proceeds upward in column 0.
The global distance at each successive cell is the local distance for that
cell plus the global distance to the cell below it.
Column 0 is then designated the predCol (predecessor column).
For example, at cell (i, j) this is the local distance at (i, j) plus the
minimum global distance at either (i-1, j), (i-1, j-1) or (i, j-1).
curCol becomes predCol and step 2 is repeated until all columns
have been calculated.
Minimum global distance is the value stored in the top most cell
of the last column.
3 October 2015
Veton Këpuska
114
DP Algorithm Details
 The pseudocode for this process is:
Calculate First Column Distances (Prev. Column)
for i=1 to Number of Input Feature Vectors
CurCol[0] = local distance at (i,0)
+ global cost at (i-1,0)
for j=1 to Number of Template Feature Vectors
CurCol[j] = local distance at (i,j)
+ minimal global cost at (i-1,j), (i-1,j-1), (i,j-1)
end
end
3 October 2015
Veton Këpuska
115
DP Algorithm Details
 To perform recognition on an utterance, the
algorithm is repeated for each template.
 The template file that gives the lowest
global matching score is picked as the most
likely word.
 Note that the minimum global matching
score for test word need be compared to all
dictionary candidates.
 The candidate words can be sorted based
on minimum edit distance (DP score)
3 October 2015
Veton Këpuska
116
DP Algorithm Details
 Note that the minimum global matching
score for test word need be compared to all
dictionary candidates.
 The candidate words can be sorted based
on minimum edit distance (DP score)
insertion_ cost target i 1 
 D i  1, j  

D i , j   min  D i  1, j  1   substituti on_cost source j 1 ,target
 D i , j  1  
deletion_c ost source j 1 

3 October 2015
Veton Këpuska
i1
117

Distance Computation
function MIN-EDIT-DISTANCE(target, source) returns min-distance
n←LENGTH(target)
m←LENGTH(source)
Create a distance matrix distance[n+1,m+1]
Initialize the zeroth row and column to be the distance from the
empty string
distance[0,0] = 0
for each column i from 1 to n do
distance[i,0]←distance[i-1,0] + ins-cost(target[i])
for each row j from 1 to m do
distance[0,j]←distance[0,j-1] + del-cost(source[j])
for each column i from 1 to n do
for each row j from 1 to m do
distance[i, j]← MIN(distance[i−1, j] + ins-cost(targeti−1),
distance[i−1, j−1] + subst-cost(source j−1, targeti−1),
distance[i, j−1] + del-cost(source j−1))
return distance[n,m]
3 October 2015
Veton Këpuska
118
Minimum Edit Distance Algorithm
 The minimum edit distance algorithm, an
example of the class of dynamic
programming algorithms is shown in the
previous slide.
 The various costs can either be
 fixed (e.g. ∀x, ins-cost(x) = 1),or
 can be specific to the letter (to model the fact
that some letters are more likely to be inserted
than others).
 We assume that there is no cost for
substituting a letter for itself (i.e. substcost(x,x) = 0).
3 October 2015
Veton Këpuska
119
Example of Distance Matrix Computation
(Cost = 1 for INS & DEL, Cost = 2 for SUB)
N
9
8
9
10
11
12
11
10
9
8
O
8
7
8
9
10
11
10
9
8
9
I
7
6
7
8
9
10
9
8
9
10
T
6
5
6
7
8
9
8
9
10
11
N
5
4
5
6
7
8
9
10
11
10
E
4
3
4
5
6
7
8
9
10
9
T
3
4
5
6
7
8
7
8
9
8
N
2
3
4
5
6
7
8
7
8
7
I
1
2
3
4
5
6
7
6
7
8
#
0
1
2
3
4
5
6
7
8
9
#
E
X
E
C
U
T
I
O
N
3 October 2015
Veton Këpuska
120
Alignement
 DP Algorithm can be used to find the
minimum cost alignment between two
strings. Useful in
 Speech and language processing,
 Speech recognition
 Machine translation
3 October 2015
Veton Këpuska
121
Alignment Path
3 October 2015
Veton Këpuska
122
Back-tracking/Backtrace
 Requires a minimal augmentation of
original algorithm by storing in each
cell of the matrix information form
which cell the minimal score was
derived.
 Homework: Exercise 3.12. Write a
complete DP algorithm to match two
strings by finding minimal edit
distance and the best matching path.
 Matlab, C++, C, perl, php
3 October 2015
Veton Këpuska
123
Publicly Available Packages
 UNIX diff (Win32 cygwin package):
 http://cygwin.com/
 sclite and other software tools for
speech & language processing tools
from NIST:
 http://www.nist.gov/speech/tools/index.
htm
3 October 2015
Veton Këpuska
124
Human Morphological
Processing
Human Morphological Processing
 Brief survey of psycholinguistic studies on how multimorphemic words are represented in the minds of speakers
of English language.
 Example:




walk → walks and walked.
Are all three in the human lexicon? Or
walk + –ed and –s?
happy → happily and happiness.
 Two extreme possibilities:

Full listing – proposes that all words of a language are listed in
the mental lexicon without any internal morphological
structure.
 This hypothesis is certainly untenable for morphologically
complex languages like Turkish.

Minimum redundancy hypothesis suggests that only the
constituent morphemes are represented in the lexicon and
when processing walks (whether for reading, listening, or
talking) we must access both morphemes (walk and –s) and
combine them.
3 October 2015
Veton Këpuska
126
Evidence of Human Lexicon
 Earliest evidence comes from speech errors – also
called slips of the tongue.
 In conversational speech, speakers often mix up the
order of the words or sounds:
 If you break it it’ll drop.
 In the slips of tongue collected by Fromkin and
Ratner (1998) and Garrett (1975) inflectional and
derivational affixes can appear separately from their
stems.
 It’s not only us who have screw looses (for “screws
loose” )
 Words of rule formation (for “rules of word formation”)
 Easy enoughly (for “easily enough). ⇒
 Suggests that the mental lexicon contains some
representation of morphological structure.
3 October 2015
Veton Këpuska
127
Evidence of Human Lexicon
 More recent experimental evidence
suggests neither the full listing nor the
minimum redundancy hypotheses may be
completely true.
 Stanners et al. evidence for:
 Words like: happily and happiness, are
stored separately from its stems: happy
 Regularly inflected forms: pouring are
not distinct in the lexicon from their
stems: pour.
3 October 2015
Veton Këpuska
128
Evidence of Human Lexicon
-ure
-al
department
depart
-s
govern
-ing
 Marslen-Wilson et al. (1994) result. Derived words are
linked to their stems only if semantically related.
3 October 2015
Veton Këpuska
129
Summary
Summary
 Morphology: an area of language
processing dealing with the subparts
of words.
 Finite-State Transducer:
computational device that is
important for morphology
 Stemming: Morphological Parsing
(stem + affixs),
 Word and Sentence Tokenization, and
 Spelling Error Detection
3 October 2015
Veton Këpuska
131
Summary










Morphological parsing is the process of finding the constituent
morphemes in a word (e.g., cat +N +PL for cats).
English mainly uses prefixes and suffixes to express inflectional
and derivational morphology.
English inflectional morphology is relatively simple and includes
person and number agreement (-s) and tense markings (-ed and -ing).
English derivational morphology is more complex and includes
suffixes like -ation, -ness, -able as well as prefixes like co- and re-.
Many constraints on the English morphotactics (allowable morpheme
sequences) can be represented by finite automata.
Finite-state transducers are an extension of finite-state automata
that can generate output symbols.
Important operations for FSTs include composition, projection, and
intersection.
Finite-state morphology and two-level morphology are
applications of finitestate transducers to morphological representation
and parsing.
Spelling rules can be implemented as transducers.
There are automatic transducer-compilers that can produce a
transducer for any simple rewrite rule.
3 October 2015
Veton Këpuska
132
Summary
 The lexicon and spelling rules can be combined by
composing and intersecting various transducers.
 The Porter algorithm is a simple and efficient way to do
stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be
preferable for applications like information retrieval in
which exact morphological structure is not needed.
 Word tokenization can be done by simple regular
expressions substitutions or by transducers.
 Spelling error detection is normally done by finding
words which are not in a dictionary; an FST dictionary can
be useful for this.
 The minimum edit distance between two strings is the
minimum number of operations it takes to edit one into the
other. Minimum edit distance can be computed by dynamic
programming, which also results in an alignment of the
two strings.
3 October 2015
Veton Këpuska
133
Descargar

Digital Systems: Hardware Organization and Design