Chapter 14: Models and
Theories of Speech Production
and Perception
Perry C. Hanavan, Au.D.
Theory
A proposed description,
explanation, or model of
the manner of
interaction of a set of
natural phenomena,
capable of predicting
future occurrences or
observations of the
same kind, and capable
of being tested through
experiment or otherwise
falsified through
empirical observation.
Model
A model is a
conceptual
representation of
some phenomenon
Speech Perception
• Puzzles and Speech Perception by
Hawkins
Speech Production
• Serial-Order Issue
– Order of phonemes determines how the word
will be perceived or recognized
k a t
tak
• Degrees of Freedom
– Muscle movement for each sound can vary,
yet understanding of sound occurs
• Context-Sensitivity Problem
– The context in which a sound is made can
have significant implication on meaning
Theories of Speech Production
•
•
•
•
•
Target
Feedback and
Feedforward
Dynamic Systems
Connectionist
Target Models
• “process in which a speaker attempts to
attain a sequence of targets corresponding
to the speech sounds he is attempting to
produce.”
– (Borden et al., 1994)
Target Models
• Spatial?
– Internal map of the vocal tract in the CNS
– Coarticulation
• varient –movements of the articulator for a specific
phoneme must change depending on starting point
– ki
– ku
– Feedback information to brain to regulate fine
movements and correct errors
Target Models
• Acoustic-auditory?
– Goal is acoustic output (targets)
• Articulatory movements used to achieve goal
– Variations in articulatory movements used when:
» Adjacent phonemes vary
» Speaker rate varies
» Stress variations in utterance
Target Models
• Perkell, Matthies, & Svirsky (1995)
framework
– Ultimate goal:
• Articulatory movements result in understandable
acoustic events (speech)
Feedback
• Feedback was introduced The Basics of
Cybernetics", and is especially important to
speech production theory.
• Gracco and Abbs (1987) are among many to
point out that continuous speech involves
continuous feedback, that is to say, that the
continuous execution of a motor program
requires an equally continuous stream of sensory
information from muscle and cutaneous senses
throughout the respiratory, laryngeal, and
orofacial regions.
Feedback
• Levelt (1989) types of feedback .....
• Am I saying what I meant to say?
• Is this the way I meant to say it?
• Is what I am saying socially appropriate?
• Am I selecting the right words?
• Am I using the right syntax and morphology?
• Am I making any phonological errors?
• Is my articulation at the right speed and pitch?
Feedback
•Successful speech production is a constant battle
against error, and those errors can pop up
anywhere.
–The phrases we then use to interrupt and correct
ourselves (phrases such as "sorry", "I mean", "let me put
that another way", etc.) are known generically as "editing
expressions" (Hockett, 1967). Levelt (1989) summarised
the issue thus .....
Feedback
• "The major feature of editor theories [of monitoring] is that
production results are fed back through a device that is
external to the production system.
–Such a device is called an editor or a monitor.
–This device can be distributed in the sense that it can check inbetween results at different levels of processing.
–The editor may, for instance, monitor the construction of the
preverbal message, the appropriateness of lexical access, the
well-formedness of syntax, or the flawlessness of phonologicalform access.
–There is, so to speak, a watchful little homonculus connected to
each processor." (Levelt, 1989, pp467-468; italics original; bold
emphasis added)
Feedback
• Lee interpreted these findings as evidence of a
multiple loop control hierarchy, with four levels
of feedback, as follows:
– The "Thought Loop": The top control level releases individual
thoughts for action, and then monitors that action for successful
progress and completion. The highest level feedback loop then
monitors the output for what would nowadays be termed its
pragmatic appropriacy
– The "Word Loop": The second highest loop monitors speech
production for word selection accuracy.
– The "Voice Loop": The third highest loop monitors speech
production at whole-syllable level for morphological accuracy.
– The Articulating Loop": Finally, the lowest loop monitors speech
production checking that the right phonemes have been used
within each syllable.
Feedforward
• Feedback is used to detect and correct
errors in speech output
• Feedforward signals are used to make
articulatory adjustment online
Dynamic Systems Models
• Speech as a dynamic pattern of trajectories
through articulatory
– Groups of muscles link up together to perform a
particular task
• The lip and jaw muscles function as a coordinative unit in
bilabial closure
Connectionist Models
• Parallel-distributed processing models
• Spreading activation models
– Non-hierachial models
Speech Perception Issues
•
•
•
•
•
Linearity
Segmentation
Speaker Normalization
Basic Unit of Perception
Specialization of speech perception
Categories of Speech Perception
Theories
• Active vs. Passive
• Bottom-up vs. Top-down
• Autonomous vs. Interactive
Theories of Speech Perception
•
•
•
•
•
•
•
•
Motor
Acoustic Invariance
Direct Realism
TRACE
Cohort
Fuzzy Logical
Logogen
Native Language
Magnet
Speech Perception Issues
•
•
•
•
•
Linearity
Segmentation
Speaker normalization
Basic unit of perception
Specialization of speech perception
Linearity & Segmentation
• Linearity Principle:
– A specific sound in a word corresponds
to specific phoneme
• Segmentation
– the ability to break the spoken language
signal into the parts that make up words
• Thus, these two principles suggest
speech perception is based on a
linear correspondence between the
acoustic signal and the phoneme
units
• Although we perceive speech as a
series of separate and distinct
phonemes and words, the acoustic
boundaries between phonemes is
blurred
– eg. /ki/ vs. /ku/ (speech is not invariant)
Speaker Normalization
• How are listeners able to recognize speech
sounds and words despite wide variations in
speaker production?
– Speaker variations
– Gender differences
– Age differences
• Normalization is the ability to perceive words
spoken by different speakers, at different rates,
and in different phonetic contexts as the same.
Basic Unit of Perception
• What is the basic unit of speech perceptions?
–
–
–
–
–
Acoustic-phoneme features
Allophones
Phonemes
Syllables
Words
• Listening in noise (focus on smaller units)
• Young children focus on syllables and formant
transitions
Specialization of Speech Perception
• Is speech perception a
specialized function/process in
humans
– However, animals have been able
to demonstrate categorical
perception
Specialization of Speech Perception
• Perceptual magnet effect not
demonstrated in animals (e.g.,
whereby `good' variants in F1/F2
coordinate space are poorly
discriminated from typical vowel
prototypes)
(A) Formant frequencies of vowels surrounding an
American/i/prototype (red) and a Swedish/y/prototype
(blue). (B) Results of tests on American and Swedish
infants indicating an effect of linguistic experience.
Infants showed greater generalization when tested
with the native-language prototype. PME, Perceptual
magnet effect. [American Association for the
Advancement of Science]
Categories of Speech Perception
Theories
• Active vs. Passive
• Bottom up Top Down
• Autonomous vs. Interactive
Active vs. Passive
• Active theories suggests that speech
perception and production are closely related
– Listener knowledge of how sounds are produced
facilitates recognition of sounds
• Passive theories emphasizes the sensory
aspects of speech perception
– Listeners utilize internal filtering mechanisms
– Knowledge of vocal tract characteristics plays a minor
role, for example when listening in noise conditions
Bottom up Top Down
• Top-down processing works with
knowledge a listener has about a
language, context, experience,
etc.
– Listeners use stored information
about language and the world to
make sense of the speech
• Bottom-up processing works in
the absence of a knowledge base
providing top-down information
– listeners receive auditory information,
convert it into a neural signal and
process the phonetic feature
information
Autonomous vs. Interactive
• Autonomous theories posit
feed-forward processing with
lexical influence restricted to
post-perceptual decision
processes (uni-directional)
• Interactive theories posit
information and knowledge
from many sources available
to the listener a re involved at
any or all stages of the
processing of the signal (bidirectional)
Speech Perception Theories
•
•
•
•
•
•
•
•
Motor Theory
Acoustic Invariance Theory
Direct Realism
Trace Model
Logogen Theory
Cohort Theory
Fuzzy Logic Model of Perception
Native Language Magnet Theory
Question
This theory postulates speech is perceived
by reference to how it is produced:
A.
B.
C.
D.
E.
F.
G.
H.
Motor Theory
Acoustic Invariance Theory
Direct Realism
Trace Model
Logogen Theory
Cohort Theory
Fuzzy Logic Model of Perception
Native Language Magnet Theory
Senteo Question
T o set the properties right click and select
Senteo Question Object->Properties...
Motor Theory
• Postulates speech is perceived by
reference to how it is produced
– when perceiving speech, listeners access
their own knowledge of how phonemes
are articulated
– Articulatory gestures (such as rounding or
pressing the lips together) are units of
perception that directly provide the
listener with phonetic information
Liberman, Cooper, Shankweiler, & StuddertKennedy, 1967
Motor Theory
Three main claims:
(1) speech processing is special,
(2) perceiving speech is perceiving gestures, and
(3) the motor system is recruited for perceiving
speech.
Today, the theory is more closely connected with
research and theorizing in the broad context of
cognitive science than with research and
theorizing in the field of speech.
Question
The acoustic properties of the landmarks
constitute the basis for establishing the
distinctive features:
A.
B.
C.
D.
E.
F.
G.
H.
Motor Theory
Acoustic Invariance Theory
Direct Realism
Trace Model
Logogen Theory
Cohort Theory
Fuzzy Logic Model of Perception
Native Language Magnet Theory
Acoustic Invariance Theory
• Listeners inspect the incoming signal for the socalled acoustic landmarks which are particular
events in the spectrum carrying information
about gestures which produced them.
• Gestures are limited by the capacities of
humans’ articulators and listeners are sensitive
to their auditory correlates, the lack of
invariance simply does not exist in this model.
• The acoustic properties of the landmarks
constitute the basis for establishing the
distinctive features.
• Bundles of the distinctive features uniquely
specify phonetic segments (phonemes,
syllables, words).
Stevens, K.N. (2002). "Toward a model of lexical access based on
acoustic landmarks and distinctive features" (PDF). Journal of the
Acoustical Society of America 111 (4): 1872–1891.
Acoustic Invariance Theory
Two principal claims:
1.There are invariant acoustic patterns in the
speech signal which correspond to phonetic
features and which remain invariant across
speakers phonetic contexts, and languages.
2.That human perceivers use these properties
(invariant acoustic patterns) to provide the
phonetic framework for natural language and to
process the sounds of speech in ongoing
perception
Question
Hypothesizes that perception allows listeners to
have direct awareness of the world because it
involves direct recovery of the distal source of
the event that is perceived.
A.
B.
C.
D.
E.
F.
G.
H.
Motor Theory
Acoustic Invariance Theory
Direct Realism
Trace Model
Logogen Theory
Cohort Theory
Fuzzy Logic Model of Perception
Native Language Magnet Theory
Senteo Question
T o set the properties right click and select
Senteo Question Object->Properties...
Direct Realism
• Hypothesizes that perception allows listeners to have
direct awareness of the world because it involves
direct recovery of the distal source of the event that is
perceived.
• Asserts that the objects of perception are actual vocal
tract movements, or gestures, and not abstract
phonemes or (as in the Motor Theory) events that are
causally antecedent to these movements, i.e.
intended gestures.
• Listeners perceive gestures not by means of a
specialized decoder (as in the motor theory) but
because information in the acoustic signal specifies
the gestures that form it.
• Suggests that the actual articulatory gestures that
produce different speech sounds are themselves the
units of speech perception.
Fowler, C. A. (1986). "An event approach to the study of speech perception from a
direct-realist perspective". Journal of Phonetics14: 3–28.
Direct Realism
• In essence, the object of our perceptions (ie., a
sound, words, etc.), known as a ‘percept’, is the
distal (causal) stimulus, not the proximal
(sensory) stimulus
• If listening to a question, you don’t necessarily
explicitly perceive the rise in the Fo of the
auditory signal; instead, you perceive that a
question has been asked
• Claims there is a strong, top-down effect on
perception
• Supports a connectionist view of cognition
TRACE Model
• Assumes there is a cognitive unit for each feature (for
example, nasality) at the feature level, for each phoneme
at the phoneme level, and for each word at the word
level.
• At any given time, all of these units are activated to a
greater or lesser extent, as opposed to all or none.
• When units are activated above a certain threshold, they
may influence other units at the same or different levels.
– These effects may be either excitatory or inhibitory;
that is, they may increase or decrease the activation
of other units.
• The entire network of units is referred to as the trace,
because “the pattern of activation left by a spoken input
is a trace of the analysis of the input at each of the three
processing levels”
• The network is active and changes with subsequent
input.
McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech
perception. Cognitive Psychology, 18, 1-86
TRACE Model
• For example, a listener hears the
beginning of bald, and the words bald,
ball, bad, bill become active in memory.
Then, soon after, only bald and ball
remain in competition (bad, bill have
been eliminated because the vowel
sound doesn't match the input).
– Soon after, bald is recognized.
• TRACE simulates this process by
representing the temporal dimension of
speech, allowing words in the lexicon to
vary in activation strength, and by having
words compete during processing.
– Figure 1 shows a line graph of word
activation in a simple TRACE simulation.
TRACE Model
Neural net model
– Aims to identify single words
– Account for categorical perception, Ganong
effect and other traditional phonetic findings
that were considered important in 1970s1980s
– Connectionist model of speech perception
(McClelland and Elman, 1986)
Logogen Theory
• Model designed to explain word recognition
using a new type of unit known as a “logogen"
• A critical element, lexicons, or specialized
aspects of memory that include semantic and
phonemic information about each item that is
contained in memory.
• A given lexicon consists of many smaller,
abstract items known as logogens.
• Logogens contain a variety of properties about
given word such as their appearance, sound,
and meaning.
• Logogens do not store words within themselves,
they store information that is specifically
necessary for retrieval of whatever word is being
searched for.
Morton, J. (1969). Interaction of information in word recognition.
Psychological Review, 76, 165-178
Logogen Theory
• A given logogen will become activated
by stimuli or contextual information
(words) consistent with the properties
of that specific logogen and when the
logogen's activation level rises to or
above its threshold level, the
pronunciation of the given word is sent
to output system.
• Certain stimuli can affect the activation
levels of more than one word at a time,
usually involving words similar to one
another.
• When this occurs, whichever words'
activation levels reaches the threshold
level, it is that word sent to the output
system with the listener remaining
unaware of any partially excited
logogens.
Cohort Theory
• Designed specifically to account for
auditory word recognition.
• Breaks word down.
• Model posits that when a word is heard, all
words beginning with first sound of target
word are activated.
• This set of words is considered the Cohort.
• Once first cohort has been activated, other
information, or sounds in word narrow
down choices.
• Listener recognizes word when left with a
single choice; considered "recognition
point."
Marslen-Wilson, W. (1987). "Functional parallelism in
spoken word recognition." Cognition, 25, 71-102.
Cohort Model
• Designed specifically to account for auditory
word recognition
• Works by breaking a word down when a word is
heard--all words that begin with the first sound of
the target word are activated
– This set of words is considered the “first” cohort
• Once the first cohort activated, other sounds in
the word narrow down the choices.
• Listener recognizes the word when left with a
single choice
– this is considered the "recognition point".
Cohort Theory
Fuzzy Logic Model of Perception
• Proposes that people remember speech sounds
in a probabilistic, or graded, way.
• Suggests people remember descriptions of the
perceptual units of language, called prototypes.
• Within each prototype various features may
combine.
• Features are not binary (true or false) -- there is
a fuzzy value corresponding to how likely it is
that a sound belongs to a particular speech
category.
Massaro D. The logic of the fuzzy logical model of perception Behavioral and Brain
Sciences (1989), 12: 778-794 Cambridge University Press
Fuzzy Logic Model of Perception
• When perceiving a speech signal, decision about
what is actually heard is based on the relative
goodness of the match between the stimulus
information and values of particular prototypes.
• The final decision is based on multiple features
or sources of information, even visual information
(this explains the McGurk effect).
• Computer models of the fuzzy logical theory
demonstrate that the theory's predictions of how
speech sounds are categorized correspond to
the behavior of human listeners
• iBaldi
Massaro D. The logic of the fuzzy logical model of perception Behavioral and Brain
Sciences (1989), 12: 778-794 Cambridge University Press
Native Language Magnet Theory
• According to Kuhl’s (1994) Native Language Magnet
(NLM) theory the phonetic perceptual space is
organized in terms of prototypes.
• Prototypes are defined as particularly good category
exemplars that function as perceptual references for
linguistic phonetic units (mental representations or
perceptual maps of the speech).
• Prototypes function as “perceptual-magnets” and
exert an attracting force on neighboring auditory
representations which tend to be assimilated by
(attracted towards) the prototype that conforms to
phonetic categories in the language that is heard
• Thus, the perceptual space appears to be warped in
the neighborhood of a prototype because a prototype
attracts exemplars that fall within its zone of
influence.
Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and
Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet
theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.
Native Language Magnet Theory
Data show infants perceptually “map” critical aspects
of ambient language in the first year of life before
they can speak.
Statistical properties of speech are picked up through
exposure to ambient language.
Linguistic experience alters infants' perception of
speech, warping perception in the service of
language.
Infants are neither the tabula rasas Skinner described
nor innate grammarians Chomsky envisioned.
Infants have inherent perceptual biases that segment
phonetic units without providing innate descriptions
of them.
Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and
Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet
theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.
Native Language Magnet Theory
Infants use inherent learning strategies that were not
expected, ones thought to be too complex and
difficult for infants to use.
Adults addressing infants unconsciously modify
speech in ways that assist the brain mapping of
language.
In combination, these factors provide a powerful
discovery procedure for language.
Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and
Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet
theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.
Native Language Magnet Theory
Six tenets of a new view of language acquisition are offered:
(i) infants' initially parse the basic units of speech allowing them
to acquire higher-order units created by their combinations;
(ii) the developmental process is not a selectionist one in which
innately specified options are selected on the basis of
experience;
(iii) Perceptual learning process, unrelated to Skinnerian
learning, commences with exposure to language, during
which infants detect patterns, exploit statistical properties,
and are perceptually altered by that experience;
(iv) Vocal imitation links speech perception and production early,
and auditory, visual, and motor information are coregistered
for speech categories
(v) adults addressing infants unconsciously alter speech to
match infants' learning strategies, and this is instrumental in
supporting infants' initial mapping of speech; and
(vi) Critical period for language is influenced not only by time, but
by the neural commitment that results from experience.
Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and
Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet
theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.
Descargar

Chapter 14: Models and Theories of Speech Production …