Cairo University
Faculty of Computers and Information
HMM Based Speech
Synthesis
Presented by
Ossama Abdel-Hamid Mohamed
December 2006
Agenda
 Speech Synthesis
 HMM Based Speech Synthesis
 Proposed System
 Challenges
2
HMM Based Speech Synthesis
Speech Synthesis
 What is speech synthesis?
– Generating human like speech using computers.
 Applications
– Text To Speech.
– Conversation systems.
– Speech to speech translation.
– Concept to speech.
 Systems built since late 1970s.
– MITTALK 1979
– Klattalk 1980
3
HMM Based Speech Synthesis
Speech Synthesis, Cont.
 Challenges:
– Intelligibility.
– Naturalness.
– Pleasantness.
– Emotions.
4
HMM Based Speech Synthesis
Speech Synthesis, Techniques
•Techniques
•Formant Based
5
•Concatenative
HMM Based
•Rule Based
•Instance Based
•Statistical Based
•Difficult to make
•Based on corpus
•Based on corpus
•Machine Like
•Better quality
•Newest technique
•Not flexible
•More flexible
HMM Based Speech Synthesis
Agenda
 Speech Synthesis
 HMM Based Speech Synthesis
 Proposed System
 Challenges
6
HMM Based Speech Synthesis
HMM Based Speech Synthesis Overview
 HMM has been used successfully in speech
recognition.
 In Recogntion
 *  arg max P ( O |  )

 In Speech Synthesis:
O *  arg max P ( O |  )
O
7
HMM Based Speech Synthesis
HMM Based Speech Synthesis Overview,
Cont.
 Include delta and acceleration to get smooth
output
8
HMM Based Speech Synthesis
The Overall System
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
9
Training
Part
HMM Based Speech Synthesis
Excitation
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
The Overall System
Modeled using
25 Mel-Cepstral
MSD-HMM
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
10
Training
Part
HMM Based Speech Synthesis
Excitation
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
The Overall System
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
11
HMM Based Speech Synthesis
Training
Part
Excitation
Context Dependant
Models
Each model 5 States
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
The Overall System
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
12
Training
Part
HMM Based Speech Synthesis
Excitation
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
The Overall System
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Each Frame is
either voiced
or unvoiced
13
Training
Part
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
HMM Based Speech Synthesis
Excitation
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
The Overall System
Speech
Database
F0
Extraction
Mel-Cepstral
Analysis
f0
Text Analysis
Text
Text Analysis
Mel-cepstrum
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
f0
Pulse or Noise
Excitation
14
Training
Part
HMM Based Speech Synthesis
Excitation
Synthesis
Part
Mel-cepstrum
MLSA filter
Speech
Advantages
1. Its voice characteristics can be easily
modified,
2. It can be applied to various languages with
little modification,
3. A variety of speaking styles or emotional
speech can be synthesized using the small
amount of speech data,
4. Techniques developed in ASR can be easily
applied,
5. Its footprint is relatively small.
 An HMM based TTS system produced best
results in Blizzard challenge.
15
HMM Based Speech Synthesis
Agenda
 Speech Synthesis
 HMM Based Speech Synthesis
 Proposed System
 Challenges
16
HMM Based Speech Synthesis
Problems we tried to solve
1. Marking each frame as either voiced or
unvoiced degrades quality, because there are
some unvoiced components on most voiced
speech parts, and there are mixed-excitation
phonemes.
2. Used speech signal analysis / synthesis
techniques and parameters degrades quality.
17
HMM Based Speech Synthesis
Multi-Band Excitation
 In MBE (Multi-Band Excitation) speech is divided into a
number of frequency bands, and voicing is estimated
in each band (used 17 bands).
18
HMM Based Speech Synthesis
Mixed Excitation
 In synthesis periodic and noise excitations are
mixed according to voicing parameters
19
HMM Based Speech Synthesis
Spectral Envelop Estimation
Find values for a fixed number of samples
Use sinusoidal model for synthesis
20
HMM Based Speech Synthesis
Modified System
Speech
Database
F0
Extraction
f0
Text Analysis
Text
Text Analysis
Bands Voicing
detection
Spectral Envelop
Analysis
Bands Voicing
Training
Part
Spectral Envelop
Samples
HMM Training
Labels and
context features
Models
Labels and
context features
Parameters
Generation
Bands Voicing
Synthesis
Part
Spec. Env. Samples
+ f0
Noise + STFT filter
Harmonics
Synthesis
21
HMM Based Speech Synthesis
Unvoiced Speech
Bands Mixing
Voiced Speech
Speech
Result
 MOS scores
5
4.5
Score
4
3.5
3
2.5
2
1.5
1
Baseline
System
22
HMM Based Speech Synthesis
Baseline +
MBE
Proposed
System
Agenda
 Speech Synthesis
 HMM Based Speech Synthesis
 Proposed System
 Challenges
23
HMM Based Speech Synthesis
Other Challenges
 Speech is overly smoothed
– Use global variance.
 Modeling accuracy, the system uses same
modeling as recognition.
– Hidden semi markov models (duration).
– Trajectory HMMs,
– Minimum Generation error training
– More states clusters and use acoustic context
(under research).
24
HMM Based Speech Synthesis
More States Clusters
 Instead of computing one Gaussian per state, we store
all occurrences. And record the context of each
occurrence.
Previous
Current
Next
 At synthesis we get the best sequence using dynamic
programming.
…
25
HMM Based Speech Synthesis
Thank You
26
HMM Based Speech Synthesis
Descargar

Slide 1