Whither Phonetic Science?
Why are we doing what we are doing,
and what should we be doing?
Klaus J. Kohler
University of Kiel, Germany
Welcoming address to Sound-to-Sense, Kiel
14 December, 2012
• Welcome
– to Germany
– to Kiel
– to Phonetics and Digital Speech Processing
– the Institute was closed on 1 April 2011
– due to the inscrutable wisdom of our Alma Mater
– but its spirit is still very much alive and kicking
– and, like Phoenix, it is rising from the ashes
– thanks to Oliver Niebuhr‘s enthusiasm and drive in
speech science research and teaching
• You have come to this discussion meeting
because, in some way or other, you are affiliated
to the EC Marie Curie Research Training Network
Sound to Sense
– either because you actively worked on it
– or because you want to be part of the
interdisciplinary network paradigm which the
funding program developed for the advance of
speech science
• So, this is a good opportunity to reflect on where
phonetic science has got and where it should be
• These questions have been asked at various stages
in the history of speech science.
• The most famous case was JR Pierce in two
papers in JASA (1969, 1970), “Whither speech
recognition?” in connection with ASR
“…before embarking upon such work, the worker
should candidly ask and answer the following
questions: Why am I working in this field? What
particular thing do I hope to accomplish? Why is it
worthwhile? Am I likely to succeed? How will I
know whether or not I have succeeded? Where
will success take or leave me?"
• One and a half decades later, Manfred Schroeder in
the Preface to the Bibliotheca Phonetica volume
Speech and Speaker Recognition [1985] says about
the state-of-the-art of automatic recognition of
speech at the time:
"… one of the main impacts of the computer has been
to demonstrate the manifest inadequacy of superficial
algorithms that take no account of context and
meaning. The simple-minded computer per se was not
the hoped-for cure-all, and speech recognition was in
acute danger of withering in the laboratory rather than
blooming in the field…"
• So, what IS the phonetic scientist’s ultimate goal?
• To find answers to the question “How do humans
communicate with speech in all types of speech
interactions in the languages of the world?”
• This question has always been asked and partial
answers have been proposed
– by creating categories of phonetic description
– but they have always ended up as concepts
abstracted from their original life contexts and
reified in metalinguistic pursuits in their own right
• Let’s have a look at some corner stones in the history
of phonetic science.
From Sound to Phoneme
• For thousands of years, homo sapiens loquens has
invented ways of capturing the fleeting sound of
spoken words in timeless symbols on durable material.
• The aim of all the systematic writing systems that have
resulted is to represent lexical items in graphic form
– either ideographically, or with reference to sound
units in syllabic or alphabetic scripts
– An alphabetic writing system has been invented
only once, in the Semitic language family.
– All other alphabetic systems are derivatives from it.
• Why should that be so?
• 3-consonant roots for semantic fields of the lexicon
he wrote
he writes, will write
office, desk
• This was the birth of the “phonemic” principle in
tight association of lexical meaning and form.
• No other language had this, so no other language
developed an indigenous alphabetic script.
• When the phoneticians of the newly-founded IPA
at the end of the 19th c. devised a phonetic alphabet
to indicate pronunciation in languages like English
or French, whose Latin orthographies had become
deficient in the representation of sounds, they
reinvented the phonemic principle
– broad and narrow transcription
• The linguists of the Prague Circle turned this into a
phonological theory with the distinctive phoneme
for the differentiation of the intellectual meaning of
words, and allophonic variation in context.
– They kept the function-form link
– but dissociated it from graphic representation
– and turned it into a principle of sound structures
– every language having its own phonemic system
• The American Structuralists, in their behaviouristic
philosophy went one step further and removed the
link to meaning, being unable to formalize it.
• Grouping of sounds into phonemes now governed by
– complementary distribution
– phonetic similarity
• But Pike still recognised the original “phonemic
principle” because he gave his book Phonemics the
subtitle “A technique for reducing languages to
• After that “phonology” became a separate discipline
and had a metalinguistic purpose in itself practised by
desk phonologists.
• Generative Phonology, Optimality Theory,
Markedness, Feature Hierarchy
• Phonological categories were moved again from
behaviouristic groupings to entities in the ideal
speaker/listener’s mind.
• At this point, psycholinguists got hold of them and
started taking them into the lab for experiments
on “the phoneme as a perceptual” unit.
– This has been the MPI Nijmegen paradigm for
the past 20 years, e.g. in phoneme spotting.
– But is this extrapolation justified?
From Phoneme to Fine Phonetic Detail
• Pronunciation“white please” vs. “black please”
ordering coffee
– :z]by a Londoner
– mistaken for pli:z] by a Scottish listener
– expecting pli:z].
• In this situational context, the listener‘s task was to
understand one of two possible meanings
– wrong understanding triggered by “graveness”
instead of“acuteness” of the sound
– not by wrong phoneme perception.
• Listeners process speech signals with perceptual
categories shaped by attention and memory, not by
abstraction from sound to phoneme
– they aim at understanding messages in all their
facets of meaning, even from incomplete
“segmental” signal information
– stable multidimensional fine phonetic detail plays
an important role
– based on episodic memory, exemplar recognition
and contextual information
• This is mandatory in the processing of reduced
speech, especially of function word form variability.
• Here is an example from the Kiel Corpus of
Spontaneous Speech: OLV g122a009
• I shall first play a stretch of speech that even native
speakers of German will not be able to understand,
which phoneticians find very difficult to represent
as a string of segments, and German phoneticians as
a sequence of phonemes.
• Then I shall add the next stretch which will most
likely trigger understanding of both stretches.
• A third stretch will complete understanding.
• The fine phonetic detail in the stretches will be
nun wollen wir mal kucken, ob Mittwoch frei ist
/u()U()n /
• HUN]is
identified as the verb <kucken>.
• The sound stretch that immediately precedes must
be the modal particle <mal>, which commonly
occurs in verbal context as [ma].
• But then an inflected auxiliary verb must precede.
• The dark vocalic stretch ending in a labiodentalized
nasal, which is in turn followed by [], can be
associated with <wollen wir>, because it commonly
reduces in the direction of VV]. <werden, sollen,
müssen> do not fit.
• The initial stretch of [n] + dark vowel with strong
nasalization across the long vocalic section can be
associated with <nun> .
• The result is an understanding of what in English is
<“Now let’s see if Wednesday is free.”>.
• This theoretical account of how the highly reduced
utterance may be recognised puts sound perception
into an integrated framework of cognitive processing
for the understanding of meaning.
– Phonemes and canonical forms play no role in it.
– Phonetic traces that need not be segmental but may
be spread over indefinite stretches (articulatory
prosodies) trigger the recognition process, in
conjunction with
– morphological, syntactic and situational constraints
– memory of multiple phonetic forms of lexical items
is essential
– complete phonetic identification of acoustic
sequences is not required
– These components of the recognition process must
work in parallel to allow for real-time processing.
– How they are implemented in real situations is an
interesting and pressing question for future
research in cooperation with neuroscientists
(Event-Related Potentials)
• Important suprasegmental articulatory prosodies are
– nasalization
– glottalization
– labialization, labiodentalization
– palatalization, velarization, pharyngealization
er ><das
<sollen wir ><das
0  0
wir><das > <machen
• The fact that no role is attributed to phonemes and
canonical forms in speech recognition does not mean
that they are useless concepts.
– The relevance of the phoneme concept in devising
economical alphabetic writing systems has already
been referred to.
– The concept of canonical forms is useful in
compiling pronunciation dictionaries listing
variants under a lexical heading.
– It is also useful for training automatic speech
• But neither concept should be extrapolated beyond these
specific domains of application without special
– They are both inappropriate in (semi)automatic
segmentation of acoustic databases for phonetic
research, because they cannot capture articulatory
prosodies, which are essential in speech production
and perception.
° The Munich Automatic Segmentation System
(MAUS) fails to provide annotation files that are
usable for such a research goal.
° At present there is no adequate shortcut to manual
phonetic annotation by competent phoneticians.
• The concept of articulatory prosodies was integrated
into the annotation of the Kiel Corpus of Read and
Spontaneous Speech
n u: -MA n-+ &0 v- O- l- @- n+ &0 -MA v- i:6-6+
&0 m a: l-+ &1^ g-k -h 'U k @- n-N , &0 Q- -q O
-MA p-m+ &2. &2^ m 'I t v O x &1. &2^ f r 'aI
&0 Q- I s t-+ .
• Several publications:
K.J. Kohler, Articulatory prosodies in German
reduced speech, ICPhS 1999
Complementary Phonology – A theoretical frame for
labelling an acoustic database of dialogues,
O. Niebuhr, K.J. Kohler, Perception of phonetic detail
in the identification of highly reduced words, JP 2011
K.J. Kohler, O. Niebuhr, On the role of articulatory
prosodies in German message decoding, Phonetica
• Phonemes and canonical forms are also inappropriate
for gaining insight into speech and language
acquisition, be it L1 or L2
– although they have provided the standard paradigm
– e.g. the Contrastive Structures Series, ed. by
Charles Furguson
– but MacNeillage, P. The Origin of Speech. 2008;
Frame and Content theory.
Piske, T. Artikulatorische Muster im frühen Lautund Lexikonerwerb. Tübingen: Gunter Narr (2001)
From Auditory Observation to Signal Analysis
• The technological advance in speech signal
analysis, the spectrograph to start with, and latterly
computer programs,
– inevitably led to taking the phoneme concept
into the lab
– in order to substantiate phonological entities and
structures by objective measurement
– thus to supplement auditory impressions by
testable physical properties
– finally to replace auditory observation altogether
• This development has culminated in Laboratory
Phonology and has publication platforms in Journal
of Phonetics, Laboratory Phonology
– useless questions are asked and badly answered
– e.g. Incomplete Neutralization of voicing in
German final obstruents: rund(e) vs. bunt(e)
– the latest analysis is Röttger, Winter, Grawunder,
The robustness of incomplete neutralization in
German, ICPhS 2011
– in production a difference was found of 8ms in
vowel duration before voiced/voiceless plosives
– below JND, thus has no communicative value
– in the subsequent perception experiment 8
subjects classified 54% of the /ptkbdg/ stimuli
as voiceless, 46% as voiced
– logical regression and t tests gave significant
differences between voiceless and voiced
classification across all stimuli
– however, the distribution of voiceless and voiced
judgements across /ptk/ and /bdg/ separately, i.e.
hits, misses and false alarms, was not tested, and
the frequencies are not given
– but they can be estimated from other indices as
° 56% voiceless and 44% voiced for /ptk/
° 52% voiced and 48% voiceless for /bdg/
– chi2 testing gives no significance for an
association of /ptk/ or /bdg/ stimuli with
voiceless or voiced judgements, nor significant
deviation from equal distribution for /bdg/
– So, the judgements are random
– and therefore neither the results of production nor
of perception have any communicative value
– and the robustness in the title is a phantom.
• We can well do without such l’art pour l’art
experimentation, which abounds in Laboratory
– This is time, effort and public money badly spent.
– It does not advance our knowledge of how people
communicate one bit.
– Sense has to be reintroduced into measurement
From Sound to Sense
• The origin of speech technology after World War II
had of course the communicative component
– communications engineering, technological
development to improve communiaction
– Speech Communications Conference at MIT1950
– Menzerath and Meyer-Eppler invited
– >Institut für Phonetik u. Kommunikationsforschung
– Research Laboratory of Electronics, Speech
Communication Group, MIT
– Speech Communication Seminar, Stockholm 1974
– From Sound to Sense: 50+ years of discoveries in
speech communication, MIT 2004
– invited paper by Sarah Hawkins:
Puzzles and patterns in 50 years of research on
speech perception
“It seems reasonable to hope that new theories will
aim to include the following attributes. They should
be biologically plausible; include roles for attention,
memory, and learning; focus on understanding
meaning rather than identifying phonological form;
allow for multiple potential ‘units of perception’,
possibly with no obligatory units; and they should
allow meaning and linguistic structure to be
understood from incomplete information.”
“A … key issue is to re-evaluate the distinction between
bottom-up and top-down information. On the one hand,
fine phonetic information that systematically indicates
linguistic structure should make many model ‘top-down
processes’ unnecessary. For example, fine allophonic
detail can provide segmentation information that makes
top-down use of abstract knowledge about possible
word constraints redundant.
On the other hand, such fine phonetic detail cannot
be used in the absence of top-down knowledge about
how it should be used —for this language, this
accent, this speaker. The traditional distinction
between signal and knowledge is thus likely to be
blurred in future models. This seems entirely
consistent with current understanding of brain
• This is the theoretical background, including the
name, for the EC Marie Curie RTN.
• There is a strong influence from Firthian linguistics.
• This embedding of sound into sense in speech
communication was, and is again, the research and
teaching strategy of Phonetics in Kiel
– and it naturally led to the integration of prosody in
the study of sounds and their phrasal variability
– thus looking at the exchange of meaning between
speakers and listeners with the full array of
phonetic form and substance.
From Sense to Sound
• But we also need to include the complement
– Jakobson, Fant, Halle, Preliminaries to speech
analysis, 1952
“given the evident fact that we speak to be heard
to be understood”
– Speakers transmit meaning
– by coding it in words and syntactic structures
with fine phonetic detail of segments and
– generating acoustic signals for listeners to decode
• We need to answer two questions:
– How is the phonetic form of words represented
mentally to trigger physiological and articulatory
processes for acoustic sound production?
– What are the rules for producing reduced or
elaborated phonetic forms?
• A global answer to the first question is that the
representation can certainly not be canonical
phonemic form
• essential phonetic elements that define the whole
formal set of a lexical item will need to be specified
(Niebuhr’s phonetic essence)
– this specification must include segmental units as
well as articulatory prosodies
– both are related to lexical, morphological and
speech style categories
– which allow for phonetic under-specification
• e.g. the ending of infinitives and 1st, 3rd persons plural
of the German verb can be specified as [nasal]
– the presence of a preceding vowel depends on a
reduction-elaboration coefficient related to
speaking style and speaking situation, >  > E
– the realization of the nasal as m n N depends on
the preceding vocalic or consonantal stretch
– as in the spontaneous-speech example discussed
earlier, the nasality feature may be realised as an
articulatory prosody on the preceding vocalic
stretch instead of a nasal consonant, when the
reduction coefficient increases in more casual style
• The answer to the second question goes well
beyond descriptive accounts of large databases
(e.g. Kohler, Articulatory dynamics of vowels and
consonants in speech communication, JIPA 2001)
• it needs to include the coupling of reduction/
elaboration with lexical class, morphology, syntax
and speaking style, closely linked to the answer of
the first question
• e.g. the German sequence of preposition +
definite article masc. mit dem has two sets of
containing the deictic marker [d], as in local and
temporal pointers da, dort, dann and
demonstrative pronouns dieser, der (da),
mI(t) de()m mI dm
II. not containing [d]: mIpm mI(b)m mImm
• II. is appropriate in phrases with generic reference,
e.g. means of transport: mit dem Auto, mit dem Bus,
mit dem Zug, mit dem Flugzeug “by car, bus, train,
• I. has a specific reference, e.g. ich fahr mit dem
Auto, und zwar mit dem BMW meiner Frau “I go
by car, and I take my wife’s BMW”
• These two sets need to have separate mental
representations, because they have different
functions in the transmission of meaning
– both representations must contain mI __ m
– for I. the deictic marker is inserted with
variable vocalic release according to the
situationally determined reduction coefficient
– for II. bilabial plosive interruption of sonority is
possible with any phonation feature.
• Thus mental lexical representation is multivalued.
• You might call this proposition speculative
– but it is no more speculative than the assumption of
underlying canonical forms in the mental lexicon as
a basis for 20 years of MPI Nijmegen perception
– we simply need to develop the adequate new
experimentation to find answers for it
– which means for researchers to give up cherished
postulates and procedures to move in new directions
– the Sense-to-Sound approach will make it possible.
From Sense to Sound to Sense
• Finally, we have to combine the Speaker’s Senseto-Sound with the Listener’s Sound-to-Sense in
dialogue interaction.
• At this point, the Propositional, Expressive and
Appeal functions of speech communication and
their prosodic coding come to the fore.
• There is a substantial amount of solid results in this
field resulting from the development of the Kiel
Intonation Model (KIM) over 25 years and its more
recent refinements and additions.
• What needs to be developed in the investigation of
dialogue interaction is a new methodology of data
acquisition that is adaptable to the specific
research questions asked by speech scientists
across the whole field of phonetic science, as
sketched in this paper
– isolated sentences will no longer do in prosodic
– we need to work with stylized systematic
dialogue interaction as well as non-systematic
Conversation Analysis data
– a lot has already been done along these lines.
• On the other hand, dialogues cannot be the basis for
analysing articulatory control and coordination
– in spite of the Edinburgh phoneticians’ decision to
buy two EMA machines to allow subjects to
communicate under their helmets.
• It is particularly demanding to devise data acquisition
procedures for systematic natural speech reduction
– controlling speech rate is inadequate, though
commonly used, as reduction reflects reduced effort
– we will have to rely, in the first instance, on
introspection of competent native speakers with
good phonetic awareness, and on large corpus data.
• Whatever data acquisition procedure we use for a
particular phonetic investigation
– we should always ask whether and how we can
extrapolate from Lab to Real Situations
– and we should be careful with generalizing
statements for a whole language or dialect,
– particularly when the data are obtained with
highly invasive techniques.
• If we take the steps I have outlined we will be
progressively providing answers to the question I
raised at the beginning of this talk and develop a
Communicative Phonetic Science
• for which Phonetica provides a publication
platform under the motto
Sounds and Prosodies in Speech Communication
Why are we doing what we are doing?
Never has there been more activity in phonetic science
than today
never have there been more conference meetings and
proceedings as outlets for phonetic research than today
never has more money been poured into short-term
projects on restricted topics than today
never have research institutions complained more
about lack of funding than today
as if good ideas could be generated by money
Isaac Newton was supposedly lying under an apple
tree when he had an idea that revolutionized physics
• never have there been more PhD programs than today
• never has the rat-race among young researchers been
fiercer than today
• never have there been more tread-mill experimental
analyses on phoneme and AM/ToBI bases than today
• but never has there been so little progress on general
theory and modelling of speech communication as
• This is the situation James Le Fanu described in an
article in NZZ of 19 June 2011 as “The End of
• So, you need to reflect on why you are doing what
you are doing.
– You need to free yourselves from the downtrodden paradigms that provide you with shortterm jobs and with subjects for your dissertations
and theses, and your hurriedly compiled 4-page
congress papers.
– You are in the lucky situation of being affiliated
to a research programme that defined its goal as
developing new models of speech perception,
speech production and speech communication.
• Take the bait and grab the opportunity to let your
work become a contribution to this goal
– contribute to advancing theoretical discussion on
Communicative Phonetic Science
– derive your specific experiments from such a
global theoretical orientation
– rather than accumulating isolated experiments
– and let your thoughts mature, do not rush into yet
another symposium, workshop, conference, etc.
• I wish you a successful start with the discussions at
this Symposium and a fruitful continuation as a
potential Sense-to-Sound-to-Sense Working Group.

Kein Folientitel