Landmark-Based Speech
Recognition:
Spectrogram Reading,
Support Vector Machines,
Dynamic Bayesian Networks,
and Phonology
Mark Hasegawa-Johnson
[email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 3: Spectral Dynamics and the
Production of Consonants
• International Phonetic Alphabet
• Events in the Closure of a Nasal Consonant
– Formant transitions: a perturbation model
– Nasalized vowel
– Nasal murmur
• Events in the Release of a Stop Consonant
–
–
–
–
–
Pre-voicing (voiced stops in carefully read English)
Transient (stops and affricates)
Frication (stops, affricates, and fricatives)
Aspiration (aspirated stops and /h/)
Formant Transitions (any consonant-vowel transition)
• Formant Tracking
– Does it help Speech Recognition?
– Methods for Vowels, and for Aspiration & Nasals
• Reminder – lab 1 due Monday!
International Phonetic Alphabet:
Purpose and Brief History
• Purpose of the alphabet: to provide a universal notation for
the sounds of the world’s languages
– “Universal” = If any language on Earth distinguishes two
phonemes, IPA must also distinguish them
– “Distinguish” = Meaning of a word changes when the phoneme
changes, e.g. “cat” vs. “bat.”
• Very Brief History:
– 1876: Alexander Bell publishes a distinctive-feature-based
phonetic notation in “Visible Speech: The Science of the Universal
Alphabetic.” His notation is rejected as being too expensive to print
– 1886: International Phonetic Association founded in Paris by
phoneticians from across Europe
– 1991: Unicode provides a standard method for including IPA
notation in computer documents
International Phonetic Alphabet:
Vowels
Pinyin
ARPABET
(Approx.)
Pinyin ARPABET
(Approx.)
i /u (xu)
/ u (zhu) / UW
IY / UX
EY
EH
Pinyin:e
o
UH
/ oa
/ OW
/o
AH / AO
a (ma)
AA
ARPA:AX
a (zhang) AE
a (ma)
IPA: Regular Consonants
Tongue Body
Tongue Blade
NG
Q
DX
HH/HV
R
Y
ARPABET: F/V (labiodental), TH/DH (dental), S/Z (alveolar), SH/ZH (postalveolar or palatal)
Pinyin: s (alveolar), x (postalveolar), sh/r (retroflex)
Affricates and Doubly-Articulated
Consonants
ARPABET
WH
W
Affricates in English and Chinese:
Pinyin
ARPABET
Alveolar:
c/z
Post-alveolar: q/j
CH/JH
Retroflex:
ch/zh
IPA
ts/dz
tʃ/dʒ
ţş/ɖʐ
Non-Pulmonic Consonants
Events in the Closure of a
Syllable-Final Nasal
Consonant
Events in the Closure of a Nasal
Consonant
Formant Transitions
Vowel Nasalization
Nasal Murmur
Formant Transitions: A Perturbation
Theory Model
“the mom”
Formant
Transitions:
Labial
Consonants
“the bug”
“the supper”
Formant
Transitions:
Alveolar
Consonants
“the tug”
“the shoe”
Formant
Transitions:
Post-alveolar
Consonants
“the zsazsa”
Formant
Transitions:
Velar
Consonants
“the gut”
“sing a song”
Formant Transitions: A Perceptual
Study
The study: (1) Synthesize speech with different formant patterns, (2) record
subject responses. Delattre, Liberman and Cooper, J. Acoust. Soc. Am. 1955.
Perception of Formant Transitions:
Conclusions
Vowel Nasalization
Vowel Nasalization
Additive Terms in the Log Spectrum
Transfer Function of a Nasalized
Vowel
Nasal Murmur
“the mug”
“the nut”
“sing a song”
Observations:
Low-frequency resonance (about 300Hz) always present
Low-frequency resonance has wide bandwidth (about 150Hz)
Energy of low-frequency resonance is very constant
Most high-frequency resonances cancelled by zeros
Different places of articulation have different high frequency spectra
High-frequency spectrum is talker-dependent and variable
Resonances of a Nasal Consonant
Reference: Fujimura, JASA 1962
Anti-Resonances of a Nasal
Consonant
Events in the Release of a
Stop (Plosive) Consonant
Events in the Release of a Stop
“Burst” = transient + frication (the part of the spectrogram whose transfer
function has poles only at the front cavity resonance frequencies, not at the
back cavity resonances).
Events in the Release of a Stop
Unaspirated (/b/)
Transient
Frication Aspiration Voicing
Aspirated (/t/)
Pre-voicing during Closure
To make a voiced stop
in most European
languages:
Tongue root is
relaxed, allowing it to
expandm so that vocal
folds can continue to
vibrating for a little
while after oral closure.
Result is a lowfrequency “voice bar”
that may continue well
into closure.
In English, closure
voicing is typical of
read speech, but not
casual speech.
“the bug”
Transient: The Release of Pressure
Transfer Function During Transient
and Frication: Poles
Turbulence striking an
obstacle makes noise
Front cavity
resonance
frequency:
FR = c/4Lf
Transfer Function During Frication:
An Important Zero
Transfer Function During Frication:
An Important Zero
Transfer Function During Aspiration
Are Formant Frequencies Useful for
Speech Recognition?
• Kopec and Bush (1992): WER(formants alone) >
WER(cepstrum alone) > WER(formants and cepstrum
together)
• How should we track formants?
– In vowels: Autoregressive (AR) modeling (also
known as LPC)
– In aspiration, nasals: Autoregressive Moving
Average (ARMA) modeling. Problem: no closedform solution
– In aspiration, nasals: Exponentially Weighted
Autoregressive (EWAR; Zheng and HasegawaJohnson, ICASSP 2004)
Formant Tracking for Vowels:
Autoregressive Model (LPC)
Formant Tracking for Aspiration:
“Auto-Regressive Moving Average”
Model (ARMA)
Formant Tracking for Aspiration:
“Exponentially Weighted AutoRegressive” Model (EWAR)
(Zheng and Hasegawa-Johnson, ICSLP 2004)
Solving the EWAR Model
Results: Stop Classification, MFCC
alone vs. MFCC+formants
Results: Stop Classification, MFCC
alone vs. MFCC+formants
Summary
• International Phonetic Alphabet:
– Useful on any computer with unicode
– International encoding for all sounds of the world’s languages
• Events in a nasal closure:
– Formant transitions (perturbation model)
– Vowel nasalization (sum of TFs)
– Nasal murmur (impedance match at juncture)
• Events in release of a stop:
–
–
–
–
Pre-voicing in English voiced stops (read speech)
Transient (dp/dt ~ dA/dt)
Frication ((zero at f=0)/(front cavity resonances))
Aspiration ((zero at f=0)/(same poles as the vowel))
• Formant tracking
– In a vowel: use LPC
– In aspiration, frication, or nasal murmur: ARMA is theoretically
optimum, but computationally expensive
– Aspiration etcetera: EWAR can be a good approximation to ARMA
Descargar

Landmark-Based Speech Recognition