@ Cambridge University, UK
Dialectal Chinese Speech Recognition
Thomas Fang Zheng
Aug. 24, 2007
Center for Speech and Language Technologies, Tsinghua University
2
Outline
 Motivation
 Dialectal Chinese database collection
 Wu
 Min
 Chuan
 Approaches
 Chinese syllable mapping
 Lexicon adaptation
 State-dependent phoneme-based model merging (SDPBMM)
 Integration of SDPBMM with adaptation
 Remarks
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
3
Motivation
 Chinese ASR encounters an issue that is bigger than that
of any other language - dialect.
 There are 8 major dialectal regions in addition to
Mandarin (Northern China), including: Wu (Southern Jiangsu, Zhejiang, and Shanghai);
 Yue (Guangdong, Hong Kong, Nanning Guangxi);
 Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan);
 Hakka (Meixian Guangdong, Hsin-chu Taiwan);
 Gan (Jiangxi);
 Xiang (Hunan);
 Hui (Anhui)
 Jin (Shanxi, Hohehot Inner Mongolia).
 Can be further divided into over 40 sub-categories.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
4
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
5
 Chinese dialects share a same written language: The same Chinese pinyin set (canonically),
 The same Chinese character set (canonically), and
 The same vocabulary (canonically).
 And standard Chinese (known as Putonghua, or PTH) is widely spoken in
most regions over China.
 However, speech is strongly influenced by the native dialects, most Chinese
people speak in both standard Chinese and their own dialect, resulting in
dialectal Chinese - Putonghua influenced by native dialect
 In dialectal Chinese : Word usage, pronunciation, and syntax and grammar vary depending on the
speaker's dialect.
 ASR relies to a great extent on the consistent pronunciation and usage of words
within a language.
 ASR systems constructed to process PTH perform poorly for the great majority of
the population.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
6
Research Goal
 To develop a general framework to model in dialectal Chinese ASR tasks : Phonetic variability,
 Lexical variability, and
 Pronunciation variability
 To find suitable methods to modify the baseline PTH recognizer to obtain a
dialectal Chinese recognizer for the specific dialect of interest, which employ : dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and
 training/adaptation data (in relatively small quantities)
 Expectation: the resulted recognizer should also work for PTH, in other words,
it should be good for a mixture of PTH and dialectal Chinese.
 This proposal was selected as one of three projects for '2003 Johns Hopkins
University Summer Workshop from tens of proposals collected from
universities/companies over the world, and was postponed to 2004 due to
SARS.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
7
Standard Chinese
Speech Recognizer
Dialectal Chinese Related
Knowledge & Resources
+
Dialectal Chinese Speech
Recognition Framework
Dialectal Chinese
Speech Recognizer
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
8
 For practical reasons, during the summer we only focused on one
specific dialect, the Wu dialect (Shanghai Area), and the target
language was Wu dialectal Chinese (WDC for short);
 Why Wu dialect?
 Population: more than 70 million people use WU dialect, the 2nd popular
dialect in China;
 Economy: one of the most advanced city in China – Shanghai
 Wu dialect is a full-developed language
 The syntax of Wu dialect is very complex;
 The vocabulary is even more larger than Mandarin;
 Many literature masterpiece were influenced by WU dialect (in history).
Phoneme#
Motivation
Goal
WU
Mandarin
Cantonese
50
37
<33
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
9
Useful Dialect-Related Knowledge
 Chinese Syllable Mapping (CSM)
 This CSM is dialect-related.
 Two types:
 Word-independent CSM: e.g. in Southern Chinese, Initial mappings
include zhz, chc, shs, nl, and so on, and Final mappings
include engen, ingin, and so on;
 Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin
'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is
changed in word '过去(past)'.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
10
 The CSM could be N→1, 1→N,
or crossed.
Chuan Dialect
ke
[克]服
kuo
kui
上[课]
[扩]大
kei
kuo
[魁]梧
kui...
Standard Chinese Syllabe Set
Motivation
Goal
Knowledge
Data Collection
 The CSM is
not exact. For
any mapping
AB, it is
mostly that
the resulted
pronunciation
is not B
exactly, but
something
quite similar
to B, more
similar to B
than to any
other syllable.
Workshop
Conclusion I
A
B1
B2
B
B3
B4
Bi is a variation of B, such
as :nasalization,
centralization,
voiced,
voiceless,
rounding,
syllabic,
pharyngrealization,
aspiration
SDPBMM
Conclusion II
11
 Lexicon
 Linguists say the vocabulary similarity rate between PTH and Wu
dialect is about 60~70%
 A dialect-related lexicon containing two parts : a common part shared by standard Chinese and most dialectal
Chinese languages (over 50k words), and
 a dialect-related part (several hundreds).
 And in this lexicon : each word has one pinyin string for standard Chinese pronunciation
and a kind of representation for dialectal Chinese pronunciation, and
 each of those dialect-related words is corresponding to a word in the
common part with the same meaning
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
12
Language
Though it is difficult to collect dialect texts, dialectrelated lexical entry replacement rules could be learned
in advance, and therefore
The language post-processing or language model
adaptation techniques could be adopted.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
13
1
…
w1
w2
V
w3
…
我 做饭 给 你 吃 (PTH)
我 烧饭 给 你 吃(Wu)
Dialectal words substitute for some words
2
…
w3
w1
w32
w2
w3
w23
…
你 先 走 (PTH)
你 走 先 (Wu)
w2
Word-order changes
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
14
Dialect
AM2
Our
focus
AM0 = AM for standard Chinese
AM1 = AM with accent
AM2 = AM with dialect
LM0 = LM for standard Chinese
LM1 = LM with dialectal lexicon
LM2 = LM with dialectal lexicon/syntax
AM1
Seldom-seen in dialectal Chinese
AM0
LM0
Standard Chinese
Motivation
Goal
LM1
Knowledge
LM2
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
15
Database Collection
Data Creation for WDC
e-Dictionary
Database
Speech
Transcription
Database
Collection
Read
Speech
Spontaneous
Speech
PTH
Words
C-Chars
Wu Dialect
Words
Syllables
IFs/GIFs
PTH Words Only
PTH + Wu Words
IF & Syllable
Set Definition
Misc Info
PTH Pron.
PTH Pron.
Wu Dialect Pron.
Wu Dialect Pron.
Topics
PTH Synonym
IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
16
 Wu Dialectal Chinese (WDC) Database Collection (1)
 Collection:
 Totally 11 hours - Half read (R) + half spontaneous (S):
– 100 Shanghai speakers * (3R +3S) minutes / speaker
– 10 Beijing speakers * 6S minutes / speaker
 Read speech with well-balanced prompting sentences;
– Type I: each sentence contains PTH words only (5-6k)
– Type II: each sentence contains one or two most commonly used Wu
dialectal words while others are PTH words
 Spontaneous speech with Pre-defined talking topics;
– Conversations with PTH speaker on self-selected topic from:
sports, policy/economy, entertainment, lifestyles, technology
 Balanced Speaker (gender, age, education, PTH level, …)
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
17
Gender
Male : 50%
Female: 50%
Age
26-40 : 50%
41-50: 50%
Education
Ordinary: 20%
Well : 80%
Goal
Num of speakers
Age
Male
Female
Total
26-40
27
25
52
41-50
23
25
48
Well
41
41
82
Ordinary
9
9
18
Education
Actual WDC Data Diversity
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
18
70
60
50
40
30
20
10
0
1A
1B
2A
2B
3A
3B
Accent Assessment by experts
1A. CCTV-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;
2B. Less accented; 3A. More accented; 3B. Hard to understand but known it is PTH
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
19
35
30
25
20
26-40
41-50
15
10
5
0
1A
1B
2A
2B
3A
3B
Accent Assessment according to age
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
20
50
45
40
35
30
25
20
15
10
5
0
Ordinary
Well
1A
1B
2A
2B
3A
3B
Accent Assessment according to education level
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
21
35
30
25
20
Male
Female
15
10
5
0
1A
1B
2A
2B
3A
3B
Accent Assessment according to gender
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
22
 Wu Dialectal Chinese (WDC) Database Collection (2)
 Transcriptions include: For 100 Wu Dialectal Chinese speakers:– Canonical Chinese Initial/Final labels, and
– Generalized IF (GIF) labels.
 For 10 Beijing speakers:– Chinese character and pinyin transcriptions only
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
23
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
24
 Dialectal Lexicon Construction
 Establish a 50k-word electronic dialect dictionary with each
word having : PTH pronunciation in PTH IF string
 Wu dialect pronunciation in Wu IF string
 Purpose: summarizing Dialect-Related Knowledge
 Figure out Chinese syllable mappings:– Same written form (character), different pronunciations;
– Both word-independent and word-dependent;
 Find dialect-related word variations:–
–
–
–
Motivation
Same meanings in Chinese language;
Different written forms (character);
Uttered in standard Chinese manner;
For LM adaptation/modification
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
25
Word No.
Word
Pronunciation
in PTH
Pronunciation
in Wu Dialect
1644
本金
ben3 jin1
b en3 j in1
(unchanged)
1646
本科
ben3 ke1
b en3 k u1
(Final changed only)
1652
本领
ben3 ling3
b en3 l in2
(Final & tone)
1656
本末倒置
ben3 mo4 dao4 zhi4
b en3 m ek5 d o^3 z ii3
(Entering Sound, Final change,
CI Initial change, CD Final change )
1659
本票
ben3 piao4
b en3 p voe3 (Final & tone changes)
1660
本期
ben3 qi1
b en3 jj i2
(Voiced Initial, tone change)
1661
本钱
ben3 qian2
b en3 jj i2
(1660&1: Different in PTH, same in Wu)
1662
本人
ben3 ren2
b en3 n in2
(CD Initial & Final change)
e-Dictionary Word Examples
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
26
Post-workshop Database Collection -- Min and Chuan
Studio
Recording tool
Logitech USB Headset
(LPAC-50000)
8-port Sound Card
(Wamirack192X)
Operator
Mixer
2
(Spirit 4 )
With phantom power for Sony C-38B
Speaker
SENNHISER e835s
(Left Channel)
Sony C-38B
Condenser Microphone
(Right Channel)
Screen
Door
Operating room
* With aid of Chinese Academy of Social Sciences (CASS)
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
27
Name
Min-dialectal Chinese database
Dialectal accent
Xiamen city, Fujian province
Sampling rate
22 050 Hz
3 (Two conventional microphones, One USB
microphones)
Channels
Speakers
36
Age
18~30
Gender
18 females, 18 males
Constituent
200 long sentences, 10 digits, 26 English letters per speaker
Transcription
Motivation
Chinese Character/syllable/Initial-Final
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
28
Name
Chuan-dialectal Chinese database
Dialectal accent
Chengdu city, Sichuan province
Sampling rate
22 050 Hz
3 (Two conventional microphones, One USB
microphones)
Channels
Speakers
36
Age
20~30
Gender
18 females, 18 males
Constituent
200 long sentences, 10 digits, 26 English letters per speaker
Transcription
Motivation
Chinese Character/syllable/Initial-Final
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
29
25
Chuan
Min
20
15
10
5
0
light
medium
heavy
Accent distribution for Min/Chuan-dialectal Chinese corpora
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
30
Workshop Experiments
 Experiment Conditions:
 Using HTK 3.2.1;
 Data Set Division:
 Using spontaneous speech data only
 Data were split according to age (younger, older), education (higher, lower), and
PTH level into
– Training Set:
– devTest Set:
– Test Set:
80 speakers
20 speakers (a part of devTrain)
20 speakers
 Acoustic model:







Trained from Mandarin Broadcast News (MBN);
39 dimensional MFCC_E_D_A_Z;
diagonal covariance matrix;
4 states per unit;
103,041 units (triIF), 10,641 real units (triIF);
3,063 different states (after state tying);
16 mixtures per state, 28 mixtures per state for silence unit;
 Language model:
 Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Dialectal Training Data
Transcriptions
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
31
Observation on WDC Data
 IF-mapping / Syllable-mapping:
– Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker
often pronounce any of a certain set of IFs into another IF, and
there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on.
 Observations on three sets - Train (80 speakers), devTest (20),
and Test (20):
– Mapping pairs almost the same among all three sets;
– Mapping pairs almost identical to experts' knowledge;
– Mapping probabilities also almost equal;
 Remarks:
– Experts' knowledge could be useful;
– Mapping rules can be learned from less data.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
32
 Using only devTest set + dialect-based knowledge
 Step 1: Apply PTH-IF mapping rules;
 Step 2: Apply WDC-IF mapping rules;
 Step 3: Apply syllable-dependent mapping rules;
 Step 4: Perform multi-pronunciation expansion (MPE) based on
unigram probability.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
33
 Why trying this method?
 "IF-mapping" in dialectal Chinese is the fact (human uses it);
 "In-domain data training" will sure get a good result but
collecting data is a huge task, especially for 40 sub-dialects of
Chinese;
 "Mere adaptation" will be easier and better but might make it
hard to distinguish those mapping pairs, each pair tends to
become a single IF;
 This is not practical in such applications where you have no
more information about the speakers and a mixture of WDC and
PTH is used as Call Centers;
 It is expected that knowledge based method would result in an
overall good performance for both WDC and PTH.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
34
 Step 1: Applying PTH-IF mapping rules
 Rules are based on experts' knowledge (with AM unchanged)
 (zh, z)
(z, zh)
 (ch, c)
(c, ch)
 (sh, s)
(s, sh)
 (eng, en)
(en, eng)
 (ing, in)
(in, ing)
 (r, l)
 Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction
 Pronunciation entry probability does not help improve performance
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
35
 Step 2: Applying WDC-IF mapping rules
 There indeed are some Wu dialect Chinese specific IFs, such as
iao -> io^;
 Rules learned from devTest
 Newly introduced WDC specific IFs trained from devTest using
adaptation method
 8.66% absolute CER reduction
 MLLR adaptation outperforms MLLR+MAP
 About 10% difference
 Possibly due to less data
 We referred it to surface form (WDC) MLLR adaptation; for
comparison purpose, the base form (PTH) MLLR adaptation is
also evaluated where only canonical IFs are used.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
36
 Step 3: Apply syllable-dependent mapping rules
 Assumption: most IF-mappings are context-independent, but
some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we
believe there are others
 Rules learned from devTest
 We do not succeed in improving the accuracy, on the contrary,
the character accuracy reduced by about 6%
 We do not have a clear explanation yet
 So we keep using context-free mapping rules
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
37
 Step 4: Multi-pronunciation expansion (MPE) based on
unigram probability
 Motivation: more pronunciations help model pronunciation
variations, but lead to more confusion, there should be tradeoff;
 Accumulated unigram probability (AccProb) used as the criterion
 Only words with higher unigram probabilities will have multiple
pronunciations each;
 Words with lower unigram probabilities will have a single standard
pronunciation each;
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
38
Acc. Prob.
0.000
</s>
0.10782136
0.108
的
0.03608752
0.144
你
0.02161165
0.194
是
0.01907339
0.213
标准
0.00005742
0.899

Actual minimum
…
0.00005742
团
0.00005742
0.900

Desired point
…
0.00005742
最多
0.00005742
0.901

Actual maximum
鲫鱼
0.00000124
1.000-
黛
0.00000124
1.000-
…
…
Multi-Pronunciation
Expansion
Prob. (descending)
Single-Pronun
ciation
(Standard)
Word
The Multi-Pronunciation Expansion Criterion
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
65.5
65.0
64.5
64.0
63.5
63.0
62.5
62.0
0%
80%
90%
92%
94%
96%
100%
VocSizeRatio
1.00
1.01
1.05
1.07
1.10
1.15
1.87
CER-B
63.9
62.98
62.95
62.97
63.07
63.15
63.55
2.00
1.80
1.60
1.40
1.20
1.00
0% 80% 90% 92% 94% 96% 100%
Base-form MLLR + PTH-IF mapping + MPE (CER)
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Best result achieved at a suitable AccProb value,
say 94%, with VocSizeRatio=1.10
AccProb: 0% means no multiple pronunciation
expansion, while 100% full expansion;
39
Conclusion II
65.5
65.0
64.5
64.0
63.5
63.0
62.5
62.0
0%
80%
90%
92%
94%
96%
100%
VocSizeRatio
1.00
1.04
1.12
1.17
1.24
1.35
3.03
CER-S
65.47
62.32
62.23
62.29
62.15
62.38
63.77
3.50
3.00
2.50
2.00
1.50
1.00
0% 80% 90% 92% 94% 96% 100%
Surface-form MLLR + WDC-IF mapping + MPE (CER)
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Best result achieved at a suitable AccProb value,
say 94%, with VocSizeRatio=1.24
AccProb: 0% means no multiple pronunciation
expansion, while 100% full expansion;
40
Conclusion II
41
0%
Best result achieved at a suitable AccProb value,
say 94%, with VocSizeRatio=1.24
AccProb: 0% means no multiple pronunciation
expansion, while 100% full expansion;
66.0
65.5
65.0
64.5
64.0
63.5
63.0
62.5
62.0
80% 90% 92% 94% 96% 100%
Base-form MLLR + PTH-IF mapping + MPE (CER)
Surface-form MLLR + WDC-IF mapping + MPE (CER)
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
42
85
Performance improvement comparison:
overall, and in terms of speaker clusters
80
CER%
75
70
65
60
55
Baseline
PTH-Mapping
WDC-Mapping
MPE
Methods
AO
Motivation
Goal
AY
Knowledge
GM
GF
Data Collection
EL
Workshop
EH
MA
Conclusion I
MS
SDPBMM
Total
Conclusion II
Q: How about recognizing PTH using the
resulted WDC recognizer?
We obtain WDC recognizer from PTH recognizer;
We get a CER reduction of over 10% when recognizing
WDC on an average;
How about using it to recognize PTH?
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
44
sh
Adaptation
sh
s
(Conventional Method)
s
sh
sh
MPE
+ Rule
(Our method)
s
Motivation
Goal
s
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
45
We can expect that using WDC recognizer to
recognize PTH, the performance will degrade;
But we would expect it will not decrease too much;
Results: using WDC recognizer, you get
Over 10% CER reduction to recognize WDC;
0.62% CER increase to recognize PTH.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
46
 Conclusions:
 The use of knowledge is useful and effective
 In this project, there are several problems to solve: channel,
speaking-style, dialect background, and domain problems.
 It is easier to solve all these problems by simply using the adaptation
method;
 Our method focuses only on the dialect problem;
 The results using our method could be better if we integrate those
methods related to channel, and speaking-style.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
47
State-Dependent Phoneme-Based Model Merging (SDPBMM)
 At acoustic level, approaches include:
 Retraining the AM based on the standard speech and a certain amount of
dialectal speech
 Interpolation between standard speech-based HMMs and their
corresponding dialectal speech based HMMs
 Combination of AM with state-level pronunciation modeling
 Adaptation with a certain amount of dialectal speech based on the
standard speech-based AM
 Existing problems:
 A large amount of dialectal speech to build dialect-specific acoustic
models
 The acoustic model cannot demonstrate good performance in standard
speech as well as dialectal speech recognition
 Some acoustic modeling methods are too complicated to be deployed
readily
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
48
What we proposed:
Taking a precise context-dependent HMM from the
standard speech and its corresponding less precise
context-independent HMM from dialectal speech into
consideration simultaneously
Merging HMMs on a state-level basis according to
certain criteria
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
49
Dialectal Chinese Mono-XIF
Standard Chinese Tri-XIF
an[2] / ang[2]
*-an+*[2]
R_Nasal?
y
n
L_Labial?
L_Stop?
y
y
n
L_Bilabial?
n
n
b-an+d[2]
…
l-an+d[2]
…
l-an+m[2]
…
y
f-an+m[2]
…
b-an+m[2]
…
Illustration for SDPBMM
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
50
K
p ( x si ) 
w
 x ;  ik ;  ik 
N
ik
k 1
p ' x si
   px
( sc )
si

M

 1   

( sc )
p x si
( dc )
, s im
 
( dc )
p s im
( sc )
si
m 1
K


 w ik( sc ) N ik( sc ) ( ) 
k 1
M
N
( dc )
1



P
s


 im

m 1 n 1
( sc )
si

 w im n N im n ( )
( dc )
( dc )
pdf for merged state
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II

51
The seen disadvantage so far
The scale of Gaussian mixtures in the merged state
is expanded
Is it possible to downsize the scale?
A straightforward criterion is distance measure
The larger distance, the more coverage acoustically
 merging,
if distance (d,s)  threshold
 no-merging, if distance (d,s) < threshold
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
52
 Pseudo-divergence (PD) based distance measure between
two states is defined as follows,
distance   A ,  B
1
   PD  A ,  B   PD  B ,  A 
2
w here
PD   P ,  Q

D ispersion  P , Q 
D ispersion  P , P 
D ispersion  P , Q  
 N

w Pi  w Qj d P ,Q  i , j 


 j 1

M

i 1
and
 P, Q   
1
d
8
P
 Q

T
 P  Q 


2


1

P
 Q

1
ln
2

P
P
 Q
/2
1
1
2
Q
2
is the B hattachyaryya distance m eas ure.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
53
Dialectal Chinese Mono-XIF
Standard Chinese Tri-XIF
an[2] / ang[2]
*-an+*[2]
R_Nasal?
y
n
L_Labial?
L_Stop?
y
y
n
L_Bilabial?
n
n
b-an+d[2]
…
l-an+d[2]
…
l-an+m[2]
…
f-an+m[2]
…
y
b-an+m[2]
…
Distinguishable states
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
54

Data division
Data set
Database
Details
Usage
PTH_Train
Standard Chinese
120 speakers, 20 hours,
24,000 long sentences
To bulid Putonghua AM
PTH_Test
Standard Chinese
12 speakers, 2.5 hours, 2,400
long sentences
Putonghua Test set
Min_Dev
Min-dialectal
Chinese
20 speakers, 1.0 hour, 1,000
long sentences
Adaptation/SDPBMM/pronu
nciation modeling etc.
Min_Test
Min-dialectal
Chinese
16 speakers, 50 minutes, 800
long sentences
Dialectal Chinese test set
Wu_Dev
Wu-dialectal
Chinese
10 speakers, 40 minutes, 510
long sentences
Adaptation/SDPBMM/pronu
nciation modeling etc.
Wu_Test
Wu-dialectal
Chinese
20 speakers, 1.0 hour, 910
long sentences
Dialectal Chinese test set
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
55
Standard Chinese-based HMMs (Baseline)
Training set
Approximately 30 hours from MBN (HUB-4);
totally 34,493 utterances
Modeling method
HMM-based Decision-tree-based stateclustered cross-word tri-XIF
Topology
3 left-to-right states per tri-XIF, 14 mixtures
per state
Number of tri-XIFs
7,411
Number of states
3,230
Number of mixtures
45,220
Features
39 MFCC+ , , /CMN
Lexicon
406 toneless Chinese syllables
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
56
Acoustic model
Putonghua
SDPBMM+PDBDM
Gaussians
45,220
58,786
SER on Wu_Test
49.8%
43.9% (-5.9%)
SER on PTH_Test
30.5%
31.1% (+0.6%)
Evaluations on Putonghua and Wu-dialectal Chinese
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
57
Integration of SDPBMM with adaptation
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
58
Conclusions:
Simple but effective acoustic modeling approach using
only a small amount dialectal speech data
Significantly effective for the dialectal Chinese speech
recognition.
Good performance for both standard and dialectal
speech recognition.
Comparable to adaptation methods
Additive and complementary to adaptation methods
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
59
References















Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging with Pronunciation Modeling
Based on a Small Data Set for Dialectal Chinese Speech Recognition. Speech Communication, Second Review.
Linquan Liu, Thomas Fang Zheng, Makoto Akabane, Ruxin Chen,Wenhu Wu. Using a Small Development Data Set to Build a
Robust Dialectal Chinese Speech Recognizer, Interspeech, Antwerp, 2007.
Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging for Dialectal Chinese Speech
Recognition, ISCSLP, Singapore, 2006. (Also collected by Lecture Notes in Artificial Intelligence, 4274, pp. 282-293, 2006. )
Jing Li, Thomas Fang Zheng, William Byrne and Dan Jurafsky, “A dialectal Chinese speech recognition framework,” J. of
Computer Science and Technology, 21(1): 106-115, Jan. 2006
http://www.clsp.jhu.edu/ws04
XIONG Zhenyu, ZHENG Fang, LI Jing and WU Wenhu, “An automatic prompting texts selecting algorithm for di-IFs balanced
speech corpus,” National Conference on Man-Machine Speech Communications (NCMMSC7), pp. 252-256, Nov. 23-25, 2003,
Xiamen
Thomas Fang Zheng, “Making Full Use of Chinese Speech Corpora,” Invited Keynote Speech, Oriental-COCOSDA, pp.9-23,
Oct. 1-3, 2003, Sentosa, Singapore
Jing Li, Fang Zheng, Zhenyu Xiong, and Wenuhu Wu, “Construction of Large-Scale Shanghai Putonghua Speech Corpus for
Chinese Speech Recognition,” Oriental-COCOSDA, pp.62-69, Oct. 1-3, 2003, Sentosa, Singapore
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Reducing pronunciation lexicon confusion and using more
data without phonetic transcription for pronunciation modeling,” ICSLP’2002, pp. 2461-2464, Sep. 16-20, 2002, Colorado,
USA
Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne. “Mandarin Pronunciation Modeling Based on CASS Corpus,” J.
Computer Science & Technology, 17(3): 249-263, May 2002
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Variation Modeling,” National
Conference on Man-Machine Speech Communications (NCMMSC6), pp.K51-64, 20-22 Nov 2001, Shenzhen (Invited Keynote
Speech)
Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne, “Modeling Pronunciation Variation Using Context-Dependent
Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark
W. Byrne, V. Venkataramani, T. Kamm, T. F. Zheng, Z. Song, P. Fung, Y. Liu, U. Ruhi, "Automatic generation of pronunciation
lexicons for Mandarin spontaneous speech," ICASSP, May 7-11, 2001, Salt Lake City, USA
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne. “Mandarin pronunciation modeling based on CASS corpus,”
Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing
Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar
Ruhi, “Pronunciation Modeling of Mandarin Casual Speech,” Final Report for Workshop 2000 for Language Engineering for
Students and Professionals Integrating Research and Education, http://www.clsp.jhu.edu/ws2000/final_reports/mpm/.
Motivation
Goal
Knowledge
Data Collection
Workshop
Conclusion I
SDPBMM
Conclusion II
Thanks !
http://cslt.riit.tsinghua.edu.cn/~fzheng
Center for Speech and Language Technologies, Tsinghua University
Descargar

Slide 1