An Introduction to
Speech Perception
Ph.D. student: Li Yujia, Rain
Supervisor: Prof. Tan Lee
Jan. 28, 2005
CUHK-EE-DSPSTL
1
Contents
• Basic Knowledge
• Speech Perception
• Perception Theories
• Speech Perception versus
Music Perception
• Applications
CUHK-EE-DSPSTL
2
Basics Speech Perception Theories Speech vs. Music Applications
Basic Knowledge
• Three levels of speech
• Segments vs. supra-segments
• Basic acoustic features
• Auditory components of human
speech perception
• Basic methodology of perception
research
CUHK-EE-DSPSTL
3
Basics Speech Perception Theories Speech vs. Music Applications
Three levels of speech
Linguistic
Define rules
Acoustic
Speech realization
Perceptual
Interpretation
CUHK-EE-DSPSTL
Speaker
Listener
4
Basics Speech Perception Theories Speech vs. Music Applications
Segments vs. Supra-segments
Segments
(phonemes)
Supra-segments
(prosody)
Vowel, consonant
F0, duration, energy
Intelligibility
Naturalness
(stress, rhythm, intonation,
emotion)
CUHK-EE-DSPSTL
5
Basics Speech Perception Theories Speech vs. Music Applications
Basic acoustic features
waveform
1/f0
formants
spectrogram
CUHK-EE-DSPSTL
6
Basics Speech Perception Theories Speech vs. Music Applications
Auditory components of human
speech perception
• The peripheral
auditory organs –
ear
(signal processing)
• The auditory
nervous system –
brain
semantic
prosody
(interpretation)
CUHK-EE-DSPSTL
7
Basics Speech Perception Theories Speech vs. Music Applications
Basic methodology of
perception research
• Stimuli: synthesized speech
• Testing: by human listening
• Results are affected by
– Intrinsic factors: attributes to speech
sounds
– Extrinsic factors: resulted from
experimental conditions
CUHK-EE-DSPSTL
8
Basics Speech Perception Theories Speech vs. Music Applications
Speech Perception
• Perception of vowels
• Perception of consonants
• Perception of prosody
CUHK-EE-DSPSTL
9
Basics Speech Perception Theories Speech vs. Music Applications
Perception of vowels (1)
• Vowel sounds are perceptually
specified by their formant frequencies.
Spectrogram of an /i/
vowel with first and
second formant labeled.
CUHK-EE-DSPSTL
10
Basics Speech Perception Theories Speech vs. Music Applications
Perception of vowels (2)
• Evidence
– From production:
• Vowel-tongue position-vocal tract-formant
frequencies.
– From perception:
• Synthesized speech-first two formants-different
vowel sound
– From physics:
• “There is some evidence that the human auditory
nerve already reacts directly to formant frequencies.”
(Delgutte, 1980)
CUHK-EE-DSPSTL
11
Basics Speech Perception Theories Speech vs. Music Applications
Perception of consonants (1)
• In perception, many consonants depend on
vowels; much of stop consonants depend on
the rapidly changing formant transitions.
transition
steady state
Schematic of first two
formant frequency pattern
for a /di/ syllable
CUHK-EE-DSPSTL
12
Basics Speech Perception Theories Speech vs. Music Applications
Perception of consonants (2)
Schematic representations of first two formant frequency patterns for /d/
in front of different vowels
•Lack of acoustic invariance:
the lack of something
constant in the spectrographic representation (visual representation
of speech) to explain the perception of a particular consonant.
•Locus theory:
the second formant frequency transitions all
seem to be pointing toward the same frequency which is called
locus.
CUHK-EE-DSPSTL
13
Basics Speech Perception Theories Speech vs. Music Applications
Perception of consonants (3)
• What is the basic unit for speech
perception?
– Because we cannot isolate stop consonants from
vowels in perception, researchers began to think
of speech as encoded (vowels and consonants are
squeezed together), perhaps in syllable-sized
units.
• Speech can be presented at a faster
speed rate (30 phonemes per second) than
other sounds, and still retain its
perceptual intelligibility.
CUHK-EE-DSPSTL
14
Basics Speech Perception Theories Speech vs. Music Applications
Perception of prosody(1)
• The perception of prosody has been
described as dependent on the
“melody of speech”, the fluctuations
in the pitch, rhythm, and stress
(Monrad-Krohn, 1947).
• Related acoustic features are f0,
duration and energy intensity.
CUHK-EE-DSPSTL
15
Basics Speech Perception Theories Speech vs. Music Applications
Perception of prosody(2)
• Perception of prosody is more complex
– The relatively vague definition.
– The perception of prosody is nonlinear to the
acoustic features.
(double f0 ≠ double pitch; double duration ≠ double stress)
– Perceived over long time in a relative sense.
(the degree of contrast between the values of the acoustic
variables over a number of syllables)
– An perceived attribute of prosody may be related
to several acoustic features.
(f0 is most powerful cue to stress, followed by duration
and energy intensity)
CUHK-EE-DSPSTL
16
Basics Speech Perception Theories Speech vs. Music Applications
Perception of prosody(3)
• Research is relatively sparse
• The target of our research will be:
– From acoustic to perception to determine
how one or several acoustic features
contribute to the perceived naturalness.
– Improve the naturalness of synthesized
speech in an effective way.
CUHK-EE-DSPSTL
17
Basics Speech Perception Theories Speech vs. Music Applications
Perception Theories
• Masking
• Categorical perception
• Motor theory
• Analysis-by-synthesis
• Bottom-up versus top-down
CUHK-EE-DSPSTL
18
Basics Speech Perception Theories Speech vs. Music Applications
Masking
• Frequency masking
– One sound cannot be perceived if another
sound close in frequency has a high enough
level.
• Temporal masking
– A sound cannot be perceived if it is too close in
time to another sound.
– Pre-masking tends to last 5 ms; post-masking
can last from 50 to 300 ms.
A
B
Pre-masking
Post-masking
B
A
50-300ms
5ms
CUHK-EE-DSPSTL
19
Basics Speech Perception Theories Speech vs. Music Applications
Categorical perception (1)
• Voice onset time (VOT)
(Lisker and Abramson, 1964)
– Voiced versus voiceless (if the vocal fold
vibrates, eg. /z/ and /s/)
– The difference between voiced and
voiceless stop consonants (eg. /b/and/p/;
/d/and/t/;/g/and/k/) is actually one of the
relative timing of the onset of the onset of
vocal fold vibration.
– The timing difference is referred to as voice
onset time (VOT)
CUHK-EE-DSPSTL
20
Basics Speech Perception Theories Speech vs. Music Applications
Categorical perception (2)
• Voice onset time (VOT)
– voiced stop consonants have a relatively short VOT;
whereas voiceless consonants have a longer VOT.
VOT
VOT measure for a /b/
VOT
VOT measure for a /p/
CUHK-EE-DSPSTL
21
Basics Speech Perception Theories Speech vs. Music Applications
Categorical perception (3)
• VOT categories
– From production:
VOT productions of a single
normal adult speaker of
American English for words
beginning with /d/ and /t/.
– From perception:
CUHK-EE-DSPSTL
Identification functions
of a single listener for
VOT continuum from
/d/ to /t/ in
approximately 11 ms
steps. Each stimulus is
presented 10 times
each in random order
22
Basics Speech Perception Theories Speech vs. Music Applications
Categorical perception (4)
• Categorical Perception
– The insensitivity to differences within a category,
but keen sensitivity to cross-category differences,
is referred to as categorical perception.
– It’s characteristic of certain speech sound
distinctions, and it’s generally not found for
nonspeech sounds (Cutting, 1972).
– It represents one of the human perceptual
mechanisms coping with tremendous amount of
variations rapidly (ignore nonessential variation
within a category)
CUHK-EE-DSPSTL
23
Basics Speech Perception Theories Speech vs. Music Applications
Motor theory (1)
• Motor commands:
– The neural message that the brain sends to set
the articulators in motion to produce speech.
• Motivation:
– When a stop consonant is produced in various
vowel context, because of the lack of acoustic
invariance , there must be constant motor
commands to the articulators to produce the
same consonant.
CUHK-EE-DSPSTL
24
Basics Speech Perception Theories Speech vs. Music Applications
Motor theory (2)
• Original theory:
– “Though we cannot exclude the possibility
that a purely auditory decoder exists, we
find it more plausible to assume that
speech is perceived by processes that are
also involved in its production” (Liberman,
Cooper, Shankweiler, & Studdert-Kennedy,
1967).
CUHK-EE-DSPSTL
25
Basics Speech Perception Theories Speech vs. Music Applications
Motor theory (3)
• Weak version: ()
– Speech production offers important cues
about speech perception which can be
used by listeners.
• Strong version:
– Speech production forms the basis for
speech perception.
CUHK-EE-DSPSTL
26
Basics Speech Perception Theories Speech vs. Music Applications
Analysis-by-synthesis
Listeners are hypothesized to decode
the acoustic signal by internally
generating matching signals.
The signal that provides the best
match is the one “perceived” by the
listener.
CUHK-EE-DSPSTL
27
Basics Speech Perception Theories Speech vs. Music Applications
Bottom-up versus top-down (1)
• Bottom-up:
– Use the acoustic information to discover
what is being uttered.
• Top-down:
– Use linguistic information
CUHK-EE-DSPSTL
28
Basics Speech Perception Theories Speech vs. Music Applications
Bottom-up versus top-down (2)
• Bottom-up information is important at the
beginning of utterance, while top-down
information becomes primary when more
syllables in an sentence are uttered.
Bottom-up
Top-down
• The role of top-down information is
supported, because good organization and
prosody will speed up the understanding
of a speech.
29
CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications
Speech Perception versus
Music Perception
• Physical difference in perception
For speech
For music
• Categorical perception in speech; continuous
perception in music
– We can discriminate about 1200 different pitches
in music, but we can only absolutely identify
about 7 ( Liberman, 1967).
– For certain sound difference relevant to speech,
listeners can only discriminate accurately about
as many sounds as they can identify.
CUHK-EE-DSPSTL
30
Basics Speech Perception Theories Speech vs. Music Applications
Applications
• Speech recognition
• Speech synthesis
• Speaker recognition
• Hearing aid
CUHK-EE-DSPSTL
31
Summary
• Speech perception
– vowel, consonant, prosody
• Perception theories
– Masking, categorical perception, motor
theory, analysis-by-synthesis, bottomup and top-down
• Speech vs. music perception
CUHK-EE-DSPSTL
32
Conclusions
• What we have known for speech
perception is very limited, especially
for prosody perception.
• Speech perception will help speech
technology much.
CUHK-EE-DSPSTL
33
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. :
Singular Pub. Group.
Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech
perception”, chapter 6 in Speech science primer : physiology, acoustics, and
perception of speech, Philadelphia : Lippincott Williams & Wilkins.
Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences,
San Diego : Singular Pub. Group.
Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in
The two sides of perception, Cambridge, Mass. : MIT Press.
J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech
perception theory, and technology, Boston: Allyn and Bacon.
Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”,
chapter 2 in Spoken language processing : a guide to theory, algorithm, and
system development. Upper Saddle River, N.J. : Prentice Hall PTR.
J.Liu, 2001. Tonal behavior in some tone languages. Ph.D. Dissertation. City
University of Hong Kong, 2001.
Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of
juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996,
Volume 1, pp 725-728.
Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech
delexicalization, and its application to the perception of French prosody”, in the
Proceedings of ICSLP 1996, volume 2, pp 821-824.
Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”,
in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364.
Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of
prosodic naturalness on segmental acceptability in synthetic speech”, in the
Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146.
Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the
Korean read speech”, in the Proceedings of ICSLP 1996, Volume 1,pp 118-120.
House, D., 1996, “Differential perception of tonal contours through the syllable”, in
34
the Proceedings of ICSLP 1996,CUHK-EE-DSPSTL
Volume 4,pp 2048 – 2051.
Descargar

An Introduction to Perceived Prosody