Text-to-Speech
Text-to-Speech
Part I
Intelligent Robot Lecture Note
1
Text-to-Speech
Introduction
Intelligent Robot Lecture Note
2
Text-to-Speech
Introduction
• History
►
Long before electronic signal processing was invented, speech
researchers tried to build machine to create human speech.
◦ Gerbert of Aurillac (d. 1003 AD)
◦ Albertus Magnus (1198-1280)
◦ Roger Bacon (1214-1294)
►
In 1779, the Danish scientist Christian Kratzenstein built models of the
human vocal tract that could produce the vowel sounds.
◦ This was followed by the bellows-operated “acoustic-mechanical speech
machine” by Wolfgang von Kempelen.
◦ This machine added models of the tongue and lips, enabling it to produce
consonants as well as vowels.
◦ In 1837, Charles Wheatstone produced a “speaking machine” based on
von Kempelen’s design.
◦ In 1857, M. Faber built the “Euphonia”.
Intelligent Robot Lecture Note
3
Text-to-Speech
Introduction
• History
►
►
In the 1930s, Bell Labs developed the VOCODER, a keyboardoperated electronic speech analyzer and synthesizer that was said to
be clearly intelligible.
The pattern playback was build by Franklin S. Cooper and his
colleagues in the late 1940s and completed in 1950.
◦ There were several different versions of this hardware device but only one
currently survives.
◦ The machine converts pictures of the acoustic patterns of speech in the
form of a spectrogram back into sound.
◦ Using this device, Alvin Liberman and colleagues were able to discover
acoustic cues for the perception of phonetic segments (consonants and
vowels).
Intelligent Robot Lecture Note
4
Text-to-Speech
Text and Phonetic Analysis
Intelligent Robot Lecture Note
5
Text-to-Speech
Text and Phonetic Analysis
•
•
•
•
•
•
•
Modules and Data Flow
Text Normalization
Linguistic Analysis
Homograph Disambiguation
Morphological Analysis
Letter-to-Sound Conversion
Evaluation
Intelligent Robot Lecture Note
6
Text-to-Speech
Modules and Data Flow
raw text or tagged text
Document Structure Detection
Text Normalization
Text Analysis
Linguistic Analysis
Lexicon
tagged text
Homograph Disambiguation
Morphological Analysis
Phonetic Analysis
Letter-to-Sound Conversion
tagged text & phones
Modularized functional blocks for text and phonetic analysis components [Huang et al., 2001]
Intelligent Robot Lecture Note
7
Text-to-Speech
Modules and Data Flow
• Modules
►
Document structure detection
◦ Document structure is important to provide a context for all later
processes. In addition, some elements of document structure, such as
sentence breaking and paragraph segmentation, may have direct
implications for prosody. (not be covered here)
►
Text normalization
◦ Text normalization is the conversion from the variety of symbols, numbers,
and other non-orthographic entities of text into a common orthographic
transcription suitable for subsequent phonetic conversion.
►
Linguistic analysis
◦ Linguistic analysis recovers the syntactic constituency and semantic
features of words, phrases, clauses, and sentences, which is important for
both pronunciation and prosodic choices in the successive processes.
Intelligent Robot Lecture Note
8
Text-to-Speech
Modules and Data Flow
• Modules
►
Homograph disambiguation
◦ It is important to disambiguate words with different senses to determine
proper phonetic pronunciations, such as object (/ah b jh eh k t/) as a verb
or as a noun (/aa b jh eh k t/).
►
Morphological analysis
◦ Analyzing the component morphemes provides important cues to attain
the pronunciations for inflectional and derivational words.
►
Letter-to-sound conversion
◦ The last stage of the phonetic analysis generally includes general letter-tosound rules (or modules) and a dictionary lookup to produce accurate
pronunciations for any arbitrary word.
Intelligent Robot Lecture Note
9
Text-to-Speech
Modules and Data Flow
• Data flow
S [f1, f2, …, fn]
S
NP [f1, f2, …, fn]
VP [f1, f2, …, fn]
W
W1
W2
W3
Σ
A
skilled
e
lec
tri
cian
re
por
ted
ax
s
k
ih
l
d
ih
l
eh
k
t
r
ih
sh
ax
n
r
iy
p
ao
r
t
ax
d
F1
F2
F4
F5
C
P
IP1 [f1, f2, …, fn]
W4
F3
IP2 [f1, f2, …, fn]
U [f1, f2, …, fn]
Annotation tiers indicating incremental analysis based on an input (text) sentence “A skilled electrician reported.”
Flow of incremental annotation is indicated by arrows on the left side. [Huang et al., 2001]
Intelligent Robot Lecture Note
10
Text-to-Speech
Modules and Data Flow
• Data flow
►
W(ords)  Σ, C(ontrols)
◦ The syllabic structure (Σ) and the basic phonemic form of a word are
derived from lexical lookup and/or the application of rules. The Σ tier
shows the syllable divisions (written in text form for convenience). The C
tier, at this stage, shows the basic phonemic symbols for each word’s
syllables.
►
W(ords)  S(yntax/semantics)
◦ The word stream from text is used to infer a syntactic and possibly
semantic structure (S tier) for an input sentence. Syntactic and semantic
structure above the word would include syntactic constituents such as
Noun Phrase (NP), Verb Phrase (VP), etc. and any semantic features that
can be recovered from the current sentence or analysis of other contexts
that may be available (such as an entire paragraph or document). The
lower-level phrases such as NP and VP may be grouped into broader
constituents such as Sentence (S), depending on the parsing architecture.
Intelligent Robot Lecture Note
11
Text-to-Speech
Modules and Data Flow
• Data flow
►
S(yntex/semantics)  P(rosody)
◦ The P(rosodic) tier is also called the symbolic prosodic module. If a word is
semantically important in a sentence, that importance can be reflected in
speech with a little extra phonetic prominence, called an accent. Some
synthesizers begin building a prosodic structure by placing metrical foot
boundaries to the left of every accented syllable. The resulting metrical
foot structure is shown as F1, F2, etc. Over the metrical foot structure,
higher-order prosodic constituents, with their own characteristic relative
pitch ranges, boundary pitch movements, etc. can be constructed, shown
as intonational phrases IP1, IP2.
Intelligent Robot Lecture Note
12
Text-to-Speech
Text Normalization
• Abbreviations
►
►
An abbreviation is a shortened form of a word or phrase.
Since any abbreviation is potentially ambiguous, and there are several
distinct types of ambiguity, a system should combine knowledge from
a variety of contextual sources such as document structure and origin,
in order to resolve abbreviations.
◦ Dr. (doctor or drive)
◦ MD (medicinae doctor or Maryland)
►
Types of abbreviations
◦
◦
◦
◦
◦
◦
◦
Acronyms*
Apocope
Clipping
Elision
Syncope
Syllabic abbreviation
Portmanteau
Intelligent Robot Lecture Note
13
Text-to-Speech
Text Normalization
• Abbreviations
►
►
Acronyms are words created from the first letters or other words.
Examples of acronyms
◦ Pronounced as a word
– NATO: North Atlantic Treaty Organization
– scuba: self-contained underwater breathing apparatus
◦ Pronounced as the names of letters
– DNA: deoxyribonucleic acid
– LED: light-emitting diode
◦ Pronounced as the names of letters but with a shortcut
– IEEE: Institute of Electrical and Electronics Engineers
– W3C: World Wide Web Consortium
◦ Pseudo-acronyms
– IOU: “I owe you”
– CQR: “secure”, a brand of boat anchor
Intelligent Robot Lecture Note
14
Text-to-Speech
Text Normalization
• Number formats
►
Phone numbers
02-1234-5678
(02) 1234-5678
+82-2-1234-5678
►
Dates
12/19/94
December nineteenth ninety four
04/27/1992
April twenty seventh nineteen ninety two
May 27, 1995
May twenty seventh nineteen ninety five
1,994
one thousand nine hundred and ninety four
1994
nineteen ninety four
Intelligent Robot Lecture Note
15
Text-to-Speech
Text Normalization
• Number formats
►
►
Times
11:15
eleven fifteen
5:20 am
five twenty am
12:15:20
twelve hours fifteen minutes and twenty seconds
07:55:46
seven hours fifty-five minutes and forty-six seconds
Money and currency
$40
forty dollars
£200
two hundred pounds
5¥
five yen
300 €
three hundred euros
Intelligent Robot Lecture Note
16
Text-to-Speech
Text Normalization
• Number formats
►
►
Ordinal numbers
1/2
one half
1/3
one third
1/4
one quarter or one fourth
3/10
three tenths
Cardinal numbers
123
one two three
1,234
one thousand two hundred (and) thirty four
two four two six
one hundred (and) twenty three
twenty four twenty six
2426
two thousand four hundred (and) twenty six
Intelligent Robot Lecture Note
17
Text-to-Speech
Text Normalization
• Domain-specific tags
►
Mathematical expressions (MathML)
◦ (x + 2)2
<EXPR>
<EXPR>
x
<PLUS/>
2
</EXPR>
<POWER/>
2
</EXPR>
►
Chemical formulae (CML)
◦ C2OCHO4
<FORMULA>
<XVAR BUILTIN=“STOICH”>
C C O C O H H H H
</XVAR>
</FORMULA>
Intelligent Robot Lecture Note
18
Text-to-Speech
Text Normalization
• Miscellaneous formats
►
Approximately or tilde
◦ The symbol ~ is spoken as approximately before numeral or currency
amount, otherwise it is the character named tilde.
►
Accented Roman characters
◦ A table of folding characters can be provided so that for a term such as
Über-mensch, rather than spell out the word Über, or ignore it, the system
can convert it to its nearest English-orthography equivalent: Uber.
►
High ASCII characters
◦ Ⓒ (copyright), ™ (trademark), @ (at), ® (registered mark)
►
Asterisk
◦ “Larry has *never* been here.”
◦ This may be suppressed for asterisks spanning two or more words.
►
Emoticons
◦ :-) (smiley face), :-( (frowning face), ;-) (winking smiley face)
Intelligent Robot Lecture Note
19
Text-to-Speech
Linguistic Analysis
• A minimal set of modular functions or services
►
Sentence breaking
Mr. Smith came by. He knows that it costs $1.99, but I don’t
know when he’ll be back (he didn’t ask, “when should I
return?”)... His Web site is www.mrsmithhhhhh.com. The car
is 72.5 in. long (we don’t know which parking space he’ll
put his car in.) but he said “...and the truth shall set you
free,” an interesting quote.
◦ Abbreviation processing
◦ Rules or CART built upon features based on: document structure,
whitespace, case conventions, etc.
◦ Statistical frequencies on sentence-initial word likelihood
◦ Statistical frequencies of typical lengths of sentences for various genres
◦ Streaming syntactic/semantic (linguistic) analysis
Intelligent Robot Lecture Note
20
Text-to-Speech
Linguistic Analysis
• A minimal set of modular functions or services
►
POS tagging [Manning et al., 1999]
Influential/JJ members/NNS of/IN the/DT House/NNP Ways/NNP
and/CC Means/NNP Committee/NNP introduced/VBD legislation/NN
that/WDT would/MD restrict/VB how/WRB the/DT new/JJ savingsand-loan/NN bailout/NN agency/NN can/MD raise/VB
capital/NN ,/, creating/VBG another/DT potential/JJ
obstacle/NN to/TO the/DT government/NN ’s/POS sale/NN of/IN
sick/JJ thrifts/NNS ./.
◦ A system uses the POS labels to decide alternative pronunciations and to
assign differing degrees of prosodic prominence.
◦ In addition, the bracketing might assist in deciding where to place pauses
for great intelligibility.
Intelligent Robot Lecture Note
21
Text-to-Speech
Linguistic Analysis
• A minimal set of modular functions or services
►
Homograph disambiguation
◦ Homograph disambiguation in general refers to the case of words with the
same orthographic representation (written form) but having different
semantic meanings and sometimes even different pronunciations.
►
Noun phrase (NP) and clause detection
◦ Basic NP and clause information could be critical for a prosodic generation
module to generate segmental durations. It also provides useful clues to
introduce necessary pauses for intelligibility and naturalness. Phrase and
clause structure are well covered in any parsing techniques.
►
Sentence type identification
◦ Sentence types (declarative, yes-no question, etc.) are critical for macrolevel prosody for the sentence.
Intelligent Robot Lecture Note
22
Text-to-Speech
Homograph Disambiguation
• Homograph variation
►
►
Homograph variation can often be resolved on POS (grammatical)
category.
Deep semantic and/or discourse analysis would be required to resolve
the tense ambiguity.
◦ “If you read the book, he’ll be angry.”
• Two special sources of pronunciation ambiguity
►
A variation of dialects
◦ Tomato: tom[ey]to, tom[aa]to
◦ Boston natives tend to reduce the /r/ sound in sentences like “Park your
car in Harvard yard.”
►
Speech rate and formality level
◦ The /g/ sound in recognize may be omitted in faster speech.
Intelligent Robot Lecture Note
23
Text-to-Speech
Homograph Disambiguation
• Examples of homographs
►
Stress homographs: noun with front-stress vowel, verb with end-stress
vowel
◦ “an absent boy” vs. “Do you choose to absent yourself?”
►
Voicing: noun/verb or adjective/verb distinction made by voice final
consonant
◦ “They will abuse him.” vs. “They won’t take abuse.”
►
-ate words: noun/adjective sense uses schwa, verb sense uses a full
vowel
◦ “He will graduate.” vs. “He is a graduate.”
►
Double stress: front-stressed before noun, end-stressed when final in
phrase
◦ “an overnight bag” vs. “Are you staying overnight?”
►
-ed adjectives with matching verb past tenses
◦ “He is a learned man.” vs. “He learned to play piano.”
Intelligent Robot Lecture Note
24
Text-to-Speech
Homograph Disambiguation
• Examples of homographs
►
Ambiguous abbreviations
◦ in, am, SAT (Saturday vs. Standard Aptitude Test)
►
Borrowed words from other languages
◦ They could sometimes be distinguishable based on capitalization.
◦ “El Camino Real road in California” vs. “real world”
◦ “polish shoes” vs. “Polish accent”
►
Miscellaneous
◦ “The sewer overflowed.” vs. “a sewer is not a tailor.”
◦ “He moped since his parents refused to buy a moped.”
◦ “Agape is a Greek word.” vs. “His mouth was agape.”
Intelligent Robot Lecture Note
25
Text-to-Speech
Morphological Analysis
• Morphological analysis
►
Decomposition process
re nation al ize
prefix
►
stem
suffix
When a dictionary does not list a given orthographic form explicitly, it
is sometimes possible to analyze the new word in terms of shorter
forms already present.
Intelligent Robot Lecture Note
26
Text-to-Speech
Morphological Analysis
• Prefixes and suffixes
►
The prefixes and suffixes are generally considered bound, in the
sense that they cannot stand alone but must combine with a stem.
◦ A word such as establishment may be decomposed into a “stem” establish
and a suffix -ment.
◦ Should a system attempt to further decompose the stem establish into
establ and -ish?
– Since a difference that makes no difference is no difference, it is best to be
conservative and list only obvious and highly productive affixes.
►
Prefix and suffix stripping gives an analysis for many common
inflected and some derived words.
◦ It helps in saving system storage, but it also make mistakes (from a
synchronic point of view: basement is not base + -ment).
◦ Adding irregular morphological formation into a system dictionary is always
a desirable solution.
Intelligent Robot Lecture Note
27
Text-to-Speech
Letter-to-Sound Conversion
• Letter-to-sound conversion [Demper et al., 1998]
►
►
►
Also known as grapheme-to-phoneme conversion
For most words encountered in the input of a TTS system, a canonical
(or ‘baseform’) pronunciation is easily obtained by dictionary look-up.
The traditional default strategy uses a set of context-dependent
phonological rules written by an expert.
◦ However, the task of manually writing such a set of rules, deciding the rule
order so as to resolve conflicts appropriately, maintaining the rules as
mispronunciations are discovered etc., is very considerable and requires
an expert with depth of knowledge of the specific language.
►
More recent attention has focused on the application of automatic
techniques based on machine-learning from large corpora.
Intelligent Robot Lecture Note
28
Text-to-Speech
Letter-to-Sound Conversion
• Phonological rules
►
Assumption
◦ The pronunciation of a letter or letter substring can be found if sufficient is
known of its context, i.e. the surrounding letters.
►
►
The form of the rules is A[B]C  D, which states that the letter
substring B with left-context A and right-context C receives the
pronunciation (i.e. phoneme substring) D.
Because of the complexities of letter-to-sound correspondence, more
than one rule generally applies at each stage of transcription.
◦ The conflicts which arise are resolved by maintaining the rules in a set of
sublists, grouped by (initial) letter and with each sublist ordered by
specificity.
◦ The most specific rule is usually at the top and most general at the bottom.
◦ Transcription is usually a one-pass, left-to-right process.
Intelligent Robot Lecture Note
29
Text-to-Speech
Letter-to-Sound Conversion
• Pronunciation by analogy
►
►
Pronunciation by analogy exploits the phonological knowledge
implicitly contained in a dictionary of words and their corresponding
pronunciations.
The underlying idea is that a pronunciation for an unknown word is
assembled by matching substrings of the input to substrings of known,
lexical words, hypothesizing a partial pronunciation for each matched
substring from the phonological knowledge, and concatenating the
partial pronunciations.
Intelligent Robot Lecture Note
30
Text-to-Speech
Letter-to-Sound Conversion
• Decision trees [Black et al., 1998]
►
►
►
A decision tree has as the input grapheme sliding window with three
letters to the left and three to the right accordingly.
This method is appropriate for discreet characteristics and produces
rather compact models, whose size is defined by the total number of
questions and leaf nodes in the output tree.
Disadvantages
◦ It assumes that the decisions are independent one from another so is that
it cannot use the prediction of the previous phone as the reference to
predict the next one.
◦ Another limitations introduced by the binary decision trees that every time
a question is asked the training corpus is divided into two parts and further
questions are asked only over the remaining parts of the corpus.
Intelligent Robot Lecture Note
31
Text-to-Speech
Letter-to-Sound Conversion
• Finite state transducers
►
This approach chooses the pronunciation φ that maximizes the
probability of a phoneme sequence given the letter sequence g.
ˆ  arg max { p ( | g )}

►
This estimation can be done using standard n-gram methods.
N
p( g , ) 

i 1
i 1
p ( g i , i | g1 ,1 )
i 1
◦ where N is the number of letters in the word.
Intelligent Robot Lecture Note
32
Text-to-Speech
Letter-to-Sound Conversion
• Finite state transducers
►
►
►
n-grams can be represented by a finite-state automaton, where a new
state is defined for each history h and arc a is created for each new
grapheme-phoneme pair (g,φ).
These arcs are labeled with a grapheme-phoneme pair and weighted
with the probability of (g,φ) given the history h.
To derive the finite state transducer the labels attached to the
automaton edges are split in a way that letters become input and
phonemes become output.
Intelligent Robot Lecture Note
33
Text-to-Speech
Letter-to-Sound Conversion
• Hidden Markov models [Taylor, 2005]
►
►
►
Each phoneme is represented by one HMM and letters are the emitted
observations.
The probability of transitions between models is equal to the
probability of the phoneme given the previous phoneme.
The objective of this method is to find the most probable sequence of
hidden models (phonemes) given the observations (letters), using the
probability distributions found during the model training.
ˆ  arg max p ( g |  ) p ( )

◦ where p(φ) is the prior probability of a sequence of phonemes occurring,
and p(g,φ) is the likelihood of grapheme sequence g given phoneme
sequence φ.
Intelligent Robot Lecture Note
34
Text-to-Speech
Evaluation
• Text and phonetic analysis
►
►
The evaluation is usually feasible, because the input and output of
such module is relatively well defined.
The evaluation focuses mainly on symbolic and linguistic level in
contrast to the acoustic level.
• Automatic detection of document structures
►
►
The evaluation typically focuses on sentence breaking and sentence
type detection.
Since the definitions of these two types of document structures are
straightforward, a standard evaluation database can be easily
established.
• LTS conversion analysis
►
An automated test framework for the LTS conversion analysis
minimally includes a set of test words and their phonetic transcriptions
for automated lookup and comparison tests.
Intelligent Robot Lecture Note
35
Text-to-Speech
Reading List
• Black et al., 1998. “Issues in building general letter to sound rules”.
3rd ESCA workshop on speech synthesis, Jenolah Caves,
Australia, pp. 77-80.
• Demper et al., 1998. “A comparison of letter-to-sound conversion
techniques for English text-to-speech synthesis”. Institute of
Acoustics 20(6), pp. 245-254.
• Huang et al., 2001. Spoken Language Processing: A Guide to
Theory, Algorithm, and System Development. New Jersey,
Prentice Hall.
• Manning and Schütze, 1999. Foundations of Statistical Natural
Language Processing. Cambridge, MIT Press.
• Taylor, 2005. “Hidden Markov Models for Grapheme to Phoneme
Conversion”. Interspeech-2005, Lisbon, pp. 1973-1976.
Intelligent Robot Lecture Note
36
Text-to-Speech
Prosody
Intelligent Robot Lecture Note
37
Text-to-Speech
Prosody
•
•
•
•
•
•
•
General Prosody
Speaking Style
Symbolic Prosody
Duration Assignment
Pitch Generation
Prosody Markup Languages
Prosody Evaluation
Intelligent Robot Lecture Note
38
Text-to-Speech
General Prosody
• It is not what you said; it is how you said it!
►
►
►
An important supporting role in guiding a listener’s recovery of the
basic messages
A starring role in signaling connotation
The speaker’s attitude toward the message, toward the listener
• Prosody is often defined on two different levels
►
An abstract, phonological level
◦ (phrase, accent and tone structure)
►
A physical phonetic level
◦ (fundamental frequency, intensity or amplitude, and duration)
Intelligent Robot Lecture Note
39
Text-to-Speech
General Prosody
• From the listener’s point of view, prosody consists of systematic
perception and recovery of a speaker’s intentions based on:
►
Pauses
◦ To indicate phrases and to avoid running out of air
►
Pitch
◦ Rate of vocal-fold cycling (fundamental frequency or F0) as a function of
time
►
Rate/relative duration
◦ Phoneme durations, timing, and rhythm
►
Loudness
◦ Relative amplitude/volume
Intelligent Robot Lecture Note
40
Text-to-Speech
General Prosody
• The analysis of prosody
►
Two approaches in the prosody literature
◦ Create an abstract descriptive system witch characterizes observations of
the behavior of the parameters of prosody within the acoustic signal
(fundamental frequency movement, intensity changes and durational
movement), and promote the system to a symbolic phonological role.
◦ Create a symbolic phonological system which can be used to input to
processes which eventually result in an acoustic signal judged by listeners
to have a proper prosody.
Intelligent Robot Lecture Note
41
Text-to-Speech
General Prosody
Phonetic analysis
sound wave
synthesized
sound wave
strip random
element
add random
element
Phonological analysis
strip non-random
device-constrained
phonetic variations
strip non-random,
but non-essential
language-specific
phonological
variations
preserve generalized
stripped information
in a static phonetic
model
preserve generalized
stripped information
in a static
phonological model
reconstruct
phonetic
variations
reconstruct
phonological
variations
Low level synthesis
abstract
underlying
representation
abstract
markup
High level synthesis
Intelligent Robot Lecture
The Note
diagram of how linguists take a sound wave [Monaghan, 2002]
42
Text-to-Speech
General Prosody
Parsed text and phone string
Pause insertion and prosodic phrasing
Speaking style
F0 contour
Duration
Loudness
Enriched prosodic representation
Block diagram of a prosody generation system
Intelligent Robot Lecture Note
43
Text-to-Speech
Speaking Style
• Prosody depends not only on the linguistic content of a sentence.
• Different people generate different prosody for the same sentence.
• Even the same person generates a different prosody depending on
his or her mood.
Intelligent Robot Lecture Note
44
Text-to-Speech
Speaking Style
• Character
►
As a determining element in prosody, refers primarily to long-term,
stable, extra-linguistic properties of a speaker.
◦ Speaker’s region and economic status
◦ Gender, age, speech defects
◦ Fatigue, inebriation, talking with mouth full
►
Since many of these elements have implications for both the prosodic
and voice quality of speech output.
Intelligent Robot Lecture Note
45
Text-to-Speech
Speaking Style
• Emotion
►
►
Temporary emotional conditions such as amusement, anger,
contempt, grief, sympathy, suspicion, etc. have an effect on prosody.
A few preliminary conclusion from existing research on emotion in
speech [Murray et al., 1993].
◦ Speakers vary in their ability to express emotive meaning vocally in
controlled situations.
◦ Listeners vary in their ability to recognize and interpret emotions from
recorded speech.
◦ Some emotions are more readily expressed and identified than others.
◦ Similar intensity of two emotions can lead to confusing one with the other.
Intelligent Robot Lecture Note
46
Text-to-Speech
Speaking Style
• Some basic emotions that have been studied in speech include:
►
►
►
►
Anger, though well studied in the literature, may be too broad a
category for coherent analysis. One could imagine a threatening kind
of anger with a more overtly expressive type of tantrum could be
correlated with a wide, raised pitch range.
Joy is generally correlated with increase in pitch and pitch range, with
in crease in speech rate, Smiling generally raises F0 and formant
frequencies and can be well identified by untrained listeners.
Sadness generally has normal or lower than normal pitch realized in a
narrow range, with a slow rate and tempo, It may also be
characterized by slurred pronunciation and irregular rhythm.
Fear is characterized by high pitch in a wide range, variable rate,
precise pronunciation, and irregular voicing (perhaps due to disturbed
respiratory pattern).
Intelligent Robot Lecture Note
47
Text-to-Speech
Symbolic Prosody
• Abstract or symbolic prosodic structure is the link between the
infinite multiplicity of pragmatic, semantic, and syntactic features of
an utterance and the relatively limited F0, phone durations, energy,
and voice quality.
• Symbolic prosody deals with:
►
►
Braking the sentence into prosodic phrases, possibly separated by
pauses
Assigning labels, such as emphasis, to different syllables or words
within each prosodic phrase.
Intelligent Robot Lecture Note
48
Text-to-Speech
Symbolic Prosody
• Words are normally spoken continuously, unless there are specific
linguistic reasons to signal a discontinuity.
• The term juncture refers to prosodic phrasing that is, where do
words cohere, and where do prosodic breaks (pauses and/or
special pitch movements) occur.
• Juncture effects, expressing the degree of cohesion or
discontinuity between adjacent words, are determined by
physiology (running out of breath), phonetics, syntax, semantics,
and pragmatics.
Intelligent Robot Lecture Note
49
Text-to-Speech
Symbolic Prosody
• The primary phonetic means of signaling juncture are:
►
►
►
►
Silence insertion
Characteristic pitch movements in the phrase-final syllable.
Lengthening of a few phones in the phrase-final syllable.
Irregular voice quality such as vocal fry.
Intelligent Robot Lecture Note
50
Text-to-Speech
Symbolic Prosody
Parsed text and phone string
Symbolic Prosody
Pauses
Prosodic Phrases
Accent
Tone
Tune
Prosody Attributes
Pitch Range
Prominence
Declination
Speaking Style
F0 Contour Generation
F0 contour
Pitch generation decomposed in symbolic and phonetic prosody
Intelligent Robot Lecture Note
51
Text-to-Speech
Symbolic Prosody
• Pauses
►
►
►
►
In a long sentence, speakers normally and naturally pause a number
of times.
These pauses have traditionally been thought to correlate with
syntactic but might more properly be thought of as markers of
information structure [Steedman, 2000].
They may also be motivated by poorly understood stylistic
idiosyncrasies of the speaker, or physical constraints.
There are many reasonable places to pauses in a long sentence, but
a few where it is critical not to pause. The goal of a TTS system
should be to avoid placing pauses anywhere that might lead to
ambiguity, misinterpretation, or complete breakdown of understanding.
Intelligent Robot Lecture Note
52
Text-to-Speech
Symbolic Prosody
• Pauses
►
The CART can be used for pause assignment [Ostendorf et al., 1998].
◦ Use POS categories of words, punctuation, and a few structural measures,
such as overall length of a phrase, and length relative to neighboring
phrases to construct the classification tree.
◦ Is this a sentence boundary (marked by punctuation)?
◦ Is the left word a content word and the right word a function word?
◦ What is the function word type of word to the right?
◦ Is either adjacent word a proper name?
◦ What is the current location in the sentence?
◦ …
Intelligent Robot Lecture Note
53
Text-to-Speech
Symbolic Prosody
• Prosodic phrases
►
►
►
An end-of-sentence period may trigger an extreme lowering of pitch, a
comma-terminated prosodic phrase may exhibit a small continuation
rise at its end, signaling more to come, etc.
Prosodic junctures that are clearly signaled by silence (and usually by
characteristic pitch movement as well), also called intonational
phrases, are required between utterances and usually at punctuation
boundaries.
Prosodic junctures that are not signaled by silence but rather by
characteristic pitch movement only, also called phonological phrases,
may be harder to place with certainty and to evaluate.
Intelligent Robot Lecture Note
54
Text-to-Speech
Symbolic Prosody
• Prosodic phrases
►
►
To discuss linguistically significant juncture types and pitch movement,
it helps to have a simple standard vocabulary.
ToBI (for Tones and Break Indices) [Silverman, 1992] is a proposed
standard for transcribing symbolic intonation of American English
utterances, though it can be adapted to other languages as well.
Intelligent Robot Lecture Note
55
Text-to-Speech
Symbolic Prosody
• Prosodic phrases
►
►
The Break Indices part of ToBI specifies an inventory of numbers
expressing the strength of a prosodic juncture.
The Break Indices are marked for any utterance on their own discrete
break index tier (or layer of information), with the BI notations aligned
in time with a representation of the speech phonetics and pitch track.
Intelligent Robot Lecture Note
56
Text-to-Speech
Symbolic Prosody
• Prosodic phrases
►
On the break index tier, the prosodic association of words in an
utterance is shown by labeling the end of each word for the subjective
strength of its association with the next word, on a scale from 0
(strongest perceived conjoining) to 4 (most disjoint).
◦ 0 for cases of clear phonetic marks of clitic groups (pronunced as part of
another word, as in ve in I’ve)
◦ 1 most phrase-medial word boundaries.
◦ 2 a strong disjuncture marked by a pause or virtual pause, but with no
tonal marks.
◦ 3 intermediate intonation phrase boundrary.
◦ 4 full intonation phrase boundary.
◦ “Did/0 you/1 want/0 an/1 example?/4”
Intelligent Robot Lecture Note
57
Text-to-Speech
Symbolic Prosody
• Accents
►
►
We should briefly clarify use of terms such as stress and accent.
Stress generally refers to an idealized location in an English word that
is a potential site for phonetic prominence effects, such as extruded
pitch and/or lengthened duration.
◦ “Acme Industries is the biggest employer in the area.”
►
Accent is the signaling of semantic salience by phonetic means.
◦ “I didn’t say employer, I said employee.”
Intelligent Robot Lecture Note
58
Text-to-Speech
Symbolic Prosody
• Tones
►
►
Tones can be understood as labels for perceptually salient levels or
movements of F0 on syllables.
Pitch levels and movements on accented and phrase-boundary
syllables can exhibit a bewildering diversity, based on the speaker’s
characteristics, the nature of the speech event.
Intelligent Robot Lecture Note
59
Text-to-Speech
Symbolic Prosody
• Tones
►
Chinese, a lexical tone language, is said to have an inventory of 4
lexical tones (5 if neutral tone is included).
The four Chinese tones
Intelligent Robot Lecture Note
60
Text-to-Speech
Symbolic Prosody
• Tones
►
►
A basic set of tonal contrasts has been codified for American English
as part of the Tones and Break Indices (ToBI) system [Silverman,
1992].
ToBI can be used for annotation of prosodic training data for machine
learning, and also for internal modular control of F0 generation in a
TTS system
◦ The set specifies 2 abstract levels, H (High) and L (Low), indicating a
relatively higher or lower point in a speaker’s range.
Intelligent Robot Lecture Note
61
Text-to-Speech
Symbolic Prosody
ToBI Pitch accent tones
ToBI Tone
Description
H*
Peak accent – a tone target on an accented syllables which is
in the upper part of the speaker’s pitch range
L*
Low accent – a tone target on an accented syllable which is
in the lowest part of the speaker’s pitch range
L*+H
Scooped accent – a low tone target on an accented syllable
which is immediately followed by a relatively sharp rise to a
peak in the upper part of the speaker’s pitch range
L*+!H
Scooped downstep accent – a low tone target on an
accented syllable which is immediately followed by a relatively
flat rise to a downstep peak
L+H*
Rising peak accent – a high peak target on an accented
syllable which is immediately preceded by a relatively sharp
rise from a valley in the lowest part of the speaker’s pitch
range
!H*
Graph
Downstep high tone –a clear step down onto an accented
syllable from a high pitch which itself cannot be accounted for
by an H phrasal tone ending the preceding phrase or by a
preceding H pitch accent in the same phrase
Intelligent Robot Lecture Note
62
Text-to-Speech
Symbolic Prosody
ToBI intermediate phrasal tones
ToBI Tone
Description
L-
Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above).
H-
Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above).
ToBI boundary tones
ToBI Tone
Description
L-L%
For a full intonation phrase with an L phrase accent ending its final intermediate phrase and an L%
boundary tone falling to a point low in the speaker’s range, as in the standard ‘declarative’ contour of
American English.
L-H%
For a full intonation phrase with an L phrase accent closing the last intermediate phrase, followed by
an H boundary tone, as in ‘continuation rise’.
H-H%
For an intonation phrase with a final intermediate phrase ending in an H phrase accent and a
subsequent H boundary tone, as in the canonical ‘yes-no question’ contour. Note that the H- phrase
accent causes ‘upstep’ on the following boundary tone, so that the H% after H- rises to a very high
value.
H-L%
For an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L%
to a value in the middle of the speaker’s range, producing a final level plateau.
%H
High initial boundary tones; marks a phrase that begins relatively high in the speaker
‘s pitch range when not explained by an initial H* or preceding H%.
Intelligent Robot Lecture Note
63
Text-to-Speech
Symbolic Prosody
“Marianna made the marmalade”, with an H* accent on Marianna and marmalade,
and a final L-L% marking the characteristic sentence-final pitch drop.
Intelligent Robot Lecture Note
64
Text-to-Speech
Symbolic Prosody
• Prosodic transcription system
►
INTSINT
◦ INTSINT is a coding system of intonation [Hirst, 1994].
►
TILT
◦ TILT is one of the most interesting models of prosodic annotation.
◦ It can represent a curve in both its qualitative (ToBI-like) and quantitative
(parameterized) aspects.
►
K-ToBI
◦ Prosodic transcription convention for standard Korean [Jun, 2000].
Intelligent Robot Lecture Note
65
Text-to-Speech
Symbolic Prosody
• INTSINT
►
INTSINT is a coding system of intonation [Hirst, 1994].
◦ It provides a formal encoding of the symbolic or phonologically significant
events on a pitch curve.
◦ Each such target point of the stylized curve is coded by a symbol, either as
an absolute tone, scaled globally with respect to the speakers pitch range,
or as a relative tone, defined locally in conjunction with the neighboring
target points.
The definition of absolute tones
T
Top of the speaker’s pitch range
M
Initial, mid value
B
Bottom of the speaker’s pitch range
H
Target higher than both immediate neighbors
S
Target not different from preceding target
L
Target lower than both immediate neighbors
U
Target in a rising sequence
D
Target in a falling sequence
Intelligent Robot Lecture Note
66
Text-to-Speech
Symbolic Prosody
• INTSINT
Top
T
Higher
H
D
Upstepped
Downstepped
U
B
Bottom
Intelligent Robot Lecture Note
Mid
Same
M
S
L
Lower
Absolute tone
Relative tone
67
Text-to-Speech
Symbolic Prosody
• TILT
►
The automatic parameterization of a pitch event on a syllable is in
terms of:
◦
◦
◦
◦
◦
►
Starting F0 value (Hz)
Duration
Amplitude of rise (Arise, in Hz)
Amplitude of fall (Afall, in Hz)
Starting point, time aligned with the signal and with the vowel onset
The tone shape, mathematically represented by its tilt, is a value
computed directly from the F0 curve by the following formula
tilt 
Intelligent Robot Lecture Note
| A rise |  | A fall |
| A rise |  | A fall |
68
Text-to-Speech
Symbolic Prosody
• TILT
Label scheme for syllables
sil
Silence
c
Connection
a
Major pitch accent
fb
Falling boundary
rb
Rising boundary
afb
Accent + falling boundary
arb
Accent + rising boundary
m
Minor accent
mfb
Minor accent + falling boundary
mrb
Minor accent + rising boundary
l
Level accent
lrb
Level accent + rising boundary
lfb
Level accent + falling boundary
Intelligent Robot Lecture Note
69
Text-to-Speech
Symbolic Prosody
• K-ToBI
►
►
►
Prosodic transcription convention for standard Korean [Jun, 2000].
Based on the design principles of the original English ToBI and the
Japanese ToBI system.
Assumes intonational phonology with a close relationship to a
hierarchical model of prosodic constituents.
Intelligent Robot Lecture Note
70
Text-to-Speech
Symbolic Prosody
• K-ToBI
IP
AP
IP: intonation phrase
AP: accentual phrase
W: phonological word
s: syllable
T=H, when the syllable initial segment is aspirated/tense
otherwise, T=L
%: intonation phrase boundary tone
(AP)
W
(W)
s
s…s
T
H
s
L H
%
Intonational structure of Korean
Intelligent Robot Lecture Note
71
Text-to-Speech
Symbolic Prosody
• K-ToBI
►
►
Two prosodic units defined by intonation
Intonation Phrase (IP)
◦ Marked by a boundary tone (%) and phrase final lengthening
►
Accentual Phrase (AP)
◦ Smaller than an IP but larger than word and demarcated by phrasal tones
◦ AP phrasal tones
– LHLH when the phrase has more than 3 syllables
– LH, LLH, LHH when the phrase has fewer than 4 syllables
◦ AP initial tone
– Changes depending on the laryngeal feature of the phrase initial segment
– H when the AP begins with an aspirated or a tense obstruent (e.g., HHLH)
– Otherwise L
Intelligent Robot Lecture Note
72
Text-to-Speech
Symbolic Prosody
“I hate Younga”
Intelligent Robot Lecture Note
73
Text-to-Speech
Symbolic Prosody
• Structure of K-ToBI
ToBI (4 tiers)
Word tier
Tone tier
K-ToBI (5 tiers)
Word tier
Phonological tone tier
Phonetic tone tier
Break index tier
Break index tier
Miscellaneous tier
Miscellaneous tier
Intelligent Robot Lecture Note
74
Text-to-Speech
Symbolic Prosody
• Why does K-ToBI need 2 tone tier?
►
►
Surface tonal variation are not distinctive or predictable in Korean
Prosody
Rather, what is distinctive is the phrasing and IP boundary tones.
◦ The presence or absence of an AP and IP boundary can change the
meaning of an utterance, as with the distinction between wh-questions and
yes/no-question and the disambiguation of syntactically ambiguous
structures.
◦ Also, the IP boundary tone delivers semantic s sell as pragmatic meaning
for an utterance.
Intelligent Robot Lecture Note
75
Text-to-Speech
Symbolic Prosody
• Phonological tone tier
►
Labels the boundary of two prosodic units
◦ A boundary tone (X%) at the end of an IP
– X% can be one of the 9 IP boundary tones (L%, H%, HL%, LH%, LHL%,
HLH%, HLHL%, LHLH%, LHLHL%)
– We note that it is possible that not all of the IP boundary tones are distinctive
(e.g., LHLH% vs. LHLHL%), but until we find further evidence of distinctive
meaning, or lack thereof, we will use all of these tones
◦ LHa at the end of an AP
– Aligned with the end of AP final segment determined from the waveform.
◦ T%
– Marks the end of an IP
– Aligned with the end of IP final segment determined form the waveform
Intelligent Robot Lecture Note
76
Text-to-Speech
Symbolic Prosody
• Phonetic tone tier
►
►
IP boundary tones: the same as those in the phonological tone tier.
AP tones: 14 types of surface tonal patterns
◦ (LHa, LHHa, LLHa, HLHa, HHa, HLa, LHLa, HHLa, HLLa, LLa, HHLHa,
LHLHa, HHLLa, LHLLa)
◦ Labeled by 3 AP initial tones (L, H, +H) and 3 AP final tones (La, Ha, L+)
◦ These 6 tones are not always realized, but the combination of these 6
tones can represent all 14 tonal types
Intelligent Robot Lecture Note
77
Text-to-Speech
Symbolic Prosody
Schematic F0 contours of 8 boundary tones of IP
Schematic F0 contours of 14 tonal patterns of AP
Intelligent Robot Lecture Note
78
Text-to-Speech
Symbolic Prosody
• The break index tier
►
Represents the degree of juncture perceived between each pair of words
►
4 break indices
◦
◦
◦
◦
►
3: a strong phrasal disjuncture such as an IP
2: minimal phrasal disjuncture such as an AP
1: phrase-internal word boundaries
0: a juncture smaller than a word boundary
3 more break index’s for a mismatch between the perceived degree of
juncture and the tonal pattern
◦ 3m: used when the juncture is 3 but has an AP tonal pattern
◦ 2m: used when the juncture is 2 but has either no AP tonal pattern or the
tonal pattern of an IP
◦ 1m: used when the juncture is 1 but there is an AP tonal pattern
►
“-” diacritic affixed directly to the right of the higher break index
◦ 1-: uncertainty between 0 and 1
◦ 2-: uncertainty between 1 and 2
Intelligent Robot Lecture Note
79
Text-to-Speech
Symbolic Prosody
• Miscellaneous tier
►
Contains labeler comments concerning events such as silence,
audible breathing, laughter, or other disfluencies.
“These days, that kind of church, eh, Year 2000, millennium…”
Intelligent Robot Lecture Note
80
Text-to-Speech
Reading List
• Allen, J., M.S. Hunnicutt, and D.H. Klatt, From Text to Speech: the
MITalk System, 1987, Cambridge, UK, University Press.
• Black, A. and A. Hunt, “Generating F0 Contours from toBI labels
using Linear Regression,” Proc, of the Int. Conf, on Spoken
Language Processing, 1996, pp. 1385-1388.
• Fujisaki, H. and H, Sudo, “A generative Model of the Prosody of
Connected Speech in Japanese,” annual Report of Eng. Research
Institute, 1971, 30, pp. 75-80.
• Hirst, D.H., “The Symbolic Coding of Fundamental Frequency
Curves: from Acoustics to Phonology,” Proc. Of Int, Symposium on
Prosody, 1994, Yokohama, Japan.
• Huang, X., et al., “Whistler: A Trainable Text-to-Speech System,”
Int, Conf. on Spoken Language Processing, 1996, Philadelphia,
PA, pp. 2387-2390.
Intelligent Robot Lecture Note
81
Text-to-Speech
Reading List
• Jun, S., K-ToBI (Korean ToBI) labeling conventions (version 3.1),
http://www.linguistics.ucla.edu/people/jun/ktobi/K-tobi.html, 2000.
• Monaghan, A. “State-of-the-art summary of European synthetic
prosody R&D,” Improvements in Speech Synthesis. Chichester:
Wiley, 1993, 93-103.
• Murray, I. and J. Arnott, “Toward the Simulation of Emotion in
Synthetic Speech: A Review of the Literature on Human Vocal
Emotion,” Journal Acoustical Society of Ameriac, 1993, 93(2), pp.
1097-1108.
• Ostendorf, M., and N. Veilleux, “A Hierarchical Stochastic Model
for Automatic Prediction of Prosodic Boundary Location,”
Computational Linguistics, 1994, 20(1), pp. 27-54.
Intelligent Robot Lecture Note
82
Text-to-Speech
Reading List
• Plumpe, M. and S. Meredith, “which is More Important in a
Concatenative Text-to-Speech System: Pitch, Duration, or Spectral
Discontinuity,” Third ESCA/COCOSDA Int. Workshop on Speech
Synthesis, 1998, Jenolan Caves, Australia, pp. 231-235.
• Silverman, K., The Structure and processing of fundamental
Frequency Contours, Ph.
D. Thesis, 1987, University of Cambridge, Cambridge, UK.
• Steedman, M., “Information Structure and the Syntax-Phonology
Interface,” Linguistic Inquiry, 2000.
• van Santen, J., “Assignment of Segmental Duration in Text-toSpeech Synthesis,” computer Speech and Language, 1994, 8, pp.
95-128.
• W3C, Speech Synthesis Markup Requirements for Voice Markup
Languages, 2000, http://www.w3.org/TR/voice-tts-reqs/.
Intelligent Robot Lecture Note
83
Descargar

PowerPoint Template - Intelligent Software Lab.