DYNAMIC ADAPTATION FOR LANGUAGE AND
DIALECT IN A SPEECH SYNTHESIS SYSTEM
Craig Olinsky
Media Lab Europe / University College Dublin
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS


Many of the areas which could most benefit from community-focused IT
resource development have very high illiteracy rates among their
populace. For such users, speech-based systems provide the most
obvious and natural mechanism for them to interface with computers.
Without the widespread available of high quality speech databases,
computer-readable lexicons, and other pre-processed linguistic
information that is available for, for instance, standard dialects of
French or German, it is expensive and difficult to build such systems.
(“learning from sample” case in other presentation)
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS

Even within a particular language (including those major ones), the
personalization of a speech Synthesis system for a particular use,
market, and especially accent can provide much benefit to a deployed
system. Recent articles have suggested, in fact, that humans connect
better as listeners with a speaker and voice who sound like them, not
only finding it easier to listen to and understand what is said, but also
finding it more natural to assign emotional state and judge such factors
as authority and honesty, and even intelligibility.
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS
Perhaps the system can LISTEN to the user, and then
CHANGE ITS OUTPUT to sound more like what it
hears?
 Instead of creating a dedicated system for every purpose, set up a

number of “baseline” systems (along different languages, language
families, etc.) and set them learning.
We benefit from the work put in developing the baseline system, while
requiring a (minimum?) of additional focused training data.
 Assumption:
Learning “Accent”, “Dialect”, “Language” – not a
distinct process, but all a matter of degree?
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS


HUMAN ANALOGUE: People who live for a period of time in an area
where a different accent or dialect of their language is spoken often
(involuntarily) start to pick up the local manners of speech.
SPEECH RECOGNITION ANALOGUE: “Speaker Adaptation” -- a
procedure in which the acoustic model of the recognition system (or in
limited cases the language mode as well), after being fully trained, is
provided with additional speech data. Based upon this data, the values,
parameters, nodes, weights, or other coefficients representing the
acoustic model are shifted “towards” the new information such that the
system should exhibit improved performance on data representing the
new training data, even though such data may not have been included
in its initial training procedure.
BACKGROUND: SPEAKER ADAPTATION FOR
SPEECH RECOGNITION SYSTEMS
QUICK PROCEDURE OVERVIEW:
Given a set of recording target utterances and associated transcripts:




Generate synthesized utterance from transcript using current
synthesizer (letter-to-sound rules, phones, speech database, etc.)
Compare target recording to generated source form to determine
how the two pronunciations differ.
Re-organize the phone units and speech unit selection process to
incorporate differences and info from target recording units.
Modify the lexical entries and letter-to-sound rules of the existing
synthesizer to produce output that closer resembles the target
utterance.
VARIATION AND ADAPTATION
Ignoring for a moment issues such as vocabulary choice and other
semantic issues of usage, it is possible to consider variation
from accent, “dialect”, and even across languages as a
difference in degree of variation in a few key areas:



the phonetic inventory which comprises the basic building blocks in
which things are pronounced;
a set of pronunciation rules or examples which dictate how the
phonetic units are put together to assign a pronunciation to an
orthographic form, and subsequently speak the desired text, and
a collection of conventionalized stress and intonational patterns
which help provide structure and syntactic/semantic context to the
overall produced utterances.
VARIATION AND ADAPTATION

Cross-Speaker Adaptation. In such a mode, a generalized speech
synthesizer is adapted towards the voice of a single user of the
system. This can be done in one of two ways: Assuming that the
original “voice” of the synthesizer is that of a professional speaker,
either qualities of the user’s voice can be applied to the default voice,
while still retaining the database of sound samples of the original
speaker for use as the concatenative synthetic voice; conversely, the
database can be expanded (or replaced) with samples of the user’s
voice, while some abstract “quality” of the original professional voice
is nonetheless retained, ideally providing some measure of the
clearness and understandability for which the original speaker was
initially retained. The ability to create natual-sounding speech from
concatenation of samples drawn from a speech database comprised
of recordings from multiple users, and/or of multiple quality, would
also help encourage an open-source “bazaar” of decentralized users
attempting to amass the large number of recorded forms necessary
for a multi-purpose unit-selection synthesizer.
VARIATION AND ADAPTATION


Cross-Dialect Adaptation. This is almost exactly the case
expressed above, except for that the “default” voice form and the
specific user’s voice different in dialect, or to some greater degree
than the average set of native speakers from a given area. That is,
we would expect not only quality of voice variation, but also limited
difference, in vocabulary, phonetic inventory, distinguishable minimalpairs, accent, and the like. The result is that not only the unitselection database, but also those components which assign
phonetic realizations to the given text: the letter-to-sound rules and
the pronunciation dictionary or lexicon, may need alteration.
Cross-Language Adaptation. In this case, we retain some degree
of phonetic inventory similarity between the source and destination
language, but our letter-to-sound rules and lexicon need gross
modification, or may even be unusable (even some language pairs
where are very similar in pronunciation, such as Japanese and
Korean, could nonetheless use unrelated orthographic form, or voice
versa).
VARIATION AND ADAPTATION


Cross-Language Adaptation, Single Speaker Variant. In this case,
we have recordings from a single speaker (i.e., the user), which we
want to be able to speak naturally in languages in which the user is
not a native speaker. We thus want to use information about these
other languages to adapt the synthesizer of the user’s voice to speak
multilingually. (This is especially significant in our global community,
where many proper nouns of personal names and locations cannot
be properly pronounced simply by following the phonological rules of
a single language).
Language “Acquisition”. In the extreme case, we wish to
bootstrap an “empty” synthesizer (with no lexicon or knowledge of
pronunciation rules whatsoever) to speak like us simply by speaking
to it, without hard-cording direct linguistic or phonetic knowledge.
This is a task that a non-technical, non-expert native speaker user
should be able to perform.
VARIATION AND ADAPTATION
Ignoring for a moment issues such as vocabulary choice and other
semantic issues of usage, it is possible to consider variation
from accent, “dialect”, and even across languages as a
difference in degree of variation in a few key areas:



the phonetic inventory which comprises the basic building blocks in
which things are pronounced;
a set of pronunciation rules or examples which dictate how the
phonetic units are put together to assign a pronunciation to an
orthographic form, and subsequently speak the desired text, and
a collection of conventionalized stress and intonational patterns
which help provide structure and syntactic/semantic context to the
overall produced utterances.
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS

Synthesis adds an additional problem to recognition adaptation: the fact
that the database of recorded segments themselves is itself used for
concatentation. This means that we can not just merge the entire set of
recorded data together – there would be noticeable discrepancies
between concatenative units taken from each individual speaker. On
the other hand, if we just use the new set of segments, we aren’t
adapting; we’re just building a new synthesizer. For this study, we take
the new target data to be a small data set; not enough to be a good set
of units for synthesis on its own.
OVERVIEW: ADAPTATION IN SPEECH
SYNTHESIS

We are thus required to use existing (source) units for synthesis.
However, these source recordings and their associated existing
synthetic voice have a specific accent/dialect, with a pre-defined phone
set. Even with a proper dictionary and proper letter-to-sound rules
providing use with a “proper” pronunciation taking into account
pronunciation variation for our target accent., stringing the “best match”
units together likely won’t sound like a native speaker of that accent.
The vowel quality might be vastly different, or phones might be missing
in the source language (e.g., a French /r/). We want to adapt for this.
Overall, we want to sound native in the target accent/dialect/language,
using units recorded from the speaker of a different one.
PHONE UNIT ADAPTATION





If the variation between source and target speech is large enough, it
is likely that describe the target speech with a different phone set
than that of the source speech.
We may still find that the pronunciation of a particular phone in the
target corresponds more closely with that of a different one than our
source pronunciation lexicon would suggest (for instance, schwa
reduction).
Or we might have an existing target pronunciation lexicon or
pronunciation rules with a predefined phone set we with to use.
To utilize data from our source synthesizer in such a case, we need
to assign appropriate mappings between source and target phones.
This can be seen as a matter of degree as to how much effort or
knowledge is incorporated into creating the mapping, how closely
such a mapping corresponds to the observered data, and thus our
(assumed) rating of the quality of such a mapping.
PHONE UNIT ADAPTATION
Figure 1: Degrees of Phoneme Mapping:
(alleged) WORST
Source
Phoneme Set
(alleged) BEST
Naïve Mapping
Linguistically-Motivated
Phoneme Mapping
Data-Driven
Mapping
Target
Phoneme Set
PHONE UNIT ADAPTATION



na(t)ive approach: this approach follows the principle a non-native
would follow when speaking a second language: he basically has the
phonetic inventory of the first language and partially uses that
inventory when speaking the second language….
phonetic approach: this strategy follows principles in the
production of sounds in the human vocal tract … that sound that
agrees in the most phonetic features with the untrained one is taken
instead of the unknown one of the goal language….
data-driven approach: this approach determines the similarity
among phones with the data given by the trained recognizer…
according to a distance measure the most similar units may be joined.
PRONUCIATION ADAPTATION


Typically taken for granted in multilingual speech adaptation
studies is the presences of a pronunciation dictionary and/or
rules for the target language –
On the far extremes, we assume the existing of well-targeted
pronunciation rules: in the worst case, one designed for the
source speech, and the best case, one specifically designed
for the target. In between, we use a number of methods to
derive or create a pronunciation module, based either upon
the existing source-language methods, the target speech data
itself, or some combination.
PRONUNCIATION ADAPTATION
Figure 2: Letter to sound rules/ lexicon
(alleged) WORST
Principled
Source-Only
“Foreign
Approximation”
(alleged) BEST
Langua
Neutral
Trrained
from Target data
Principled
Target-Only
PRONUNCIATION ADAPTATION


Principled Source-Only: this approach merely uses pronunciation
methods specifically designed for the source speech to generate a
pronunciation form for the target. This approach can result in
extremely inaccurate pronunciation approximations, such as one
might inspect from a native English attempt at a native pronunciation
of an unusual foreign
“Foreign Approximation”: this approach can be seen as akin to
the na(t)ive approach of phone mapping as discussed above. In this
case, the speaker recognizes that the word being pronounced is not
a native one, and relaxes some of the language-specific rules or
attempts to move the pronunciation closer to that of the “assumed”
language of the word in question. The result is closer, but still
inaccurate and strongly accented.
PRONUNCIATION ADAPTATION



Language-Neutral: this approach purely ignores all languagespecific information, assuming either a set of very generic or regular
pronunciation rules, proposing a (relatively) direct relation between
orthographic form and pronunciation. Such rules would closely
resembles those used for a language with artificially few
pronunciation exceptions, such as Esperanto, rather than that of
English.
Trained from Target Data: in this method, an aligned text and
speech signal are provided to a recognizer, along with (possibly) a
limited set of pronunciation transcriptions as training data. In some
automatic way, the system learns a set of pronunciation rules and/or
a lexicon of pronunciations which closely matches the training data.
Principled Target-Only: this approach assumes a provided
pronunciation modules specifically designed to generate correct
pronunciations for our target language/dialect/accent.
UNIT DATABASE COMPOSITION
Figure 3: Methods of Comprising Unit Database
(alleged) WORST
Source Speaker
Only
(alleged) BEST
Union of al
Recordings
(unprincipled)
Source Speaker
+ uncovered phones
from target only
Set of Digitally
Altered Segments
Target Speaker
Only
ADAPTATION FROM MIMICRY


We know from the beginning that our source unit database is of the
best quality (in terms of recording, segmentation, labelling, etc.)
But we can’t directly synthesize from the source database, because
we will get accented, non-native sounding speech.
Is there a way to generate in a non-accented or differentlyaccented way from a single speech database?


Try to find a “neutrally” accented speaker? (What does this mean –
someone heavily polylingual? Someone geographically in between
the two languages or accents?)
Look at mimicry studies – how someone (intentionally) modifies their
voice to sound like a different speaker.
ADAPTATION FROM MIMICRY

Anders Eriksson and Pär Wretling – “How Flexible is the Human
Voice? – A Case Study of Mimicry”
Close mimicry of global speech rate
No change for timing at segmental level
Mean fundamental frequency and variation matched timing closely
Formant frequencies attained with variant success:
Vowel imitation intermediate between voice and target
“Fundamental frequency changes were more successful than changes in
timing”
STAGES OF THE EXPERIMENT
 Our development efforts and systems will follow the four modes listed
in the research overview in order of ascribed complexity. For the
Cross-Speaker Adaptation case, we will utilize a base voice and
training speaker of native American English.
 For the Cross-Dialect Adaptation study, we will retain the use of English
for the basic case, adapting over a selection of American, British, and
Irish English dialects.
 We will then finish with two data sets for Cross-Language Adaptation,
proceeding in order of linguistic variation – variation over the set of
Celtic languages still in current use (Irish, Scottish Gaelic, and, slightly
more distantly, Welsh) and a selection of Asian Indian Languages,
including (at least) Bengali and Hindi.
Descargar

DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT …