Whither Phonetic Science? Why are we doing what we are doing, and what should we be doing? Klaus J. Kohler University of Kiel, Germany Welcoming address to Sound-to-Sense, Kiel 14 December, 2012 1 Introduction • Welcome – to Germany – to Kiel – to Phonetics and Digital Speech Processing – the Institute was closed on 1 April 2011 – due to the inscrutable wisdom of our Alma Mater – but its spirit is still very much alive and kicking – and, like Phoenix, it is rising from the ashes – thanks to Oliver Niebuhr‘s enthusiasm and drive in speech science research and teaching • You have come to this discussion meeting because, in some way or other, you are affiliated to the EC Marie Curie Research Training Network Sound to Sense – either because you actively worked on it – or because you want to be part of the interdisciplinary network paradigm which the funding program developed for the advance of speech science • So, this is a good opportunity to reflect on where phonetic science has got and where it should be going. • These questions have been asked at various stages in the history of speech science. • The most famous case was JR Pierce in two papers in JASA (1969, 1970), “Whither speech recognition?” in connection with ASR “…before embarking upon such work, the worker should candidly ask and answer the following questions: Why am I working in this field? What particular thing do I hope to accomplish? Why is it worthwhile? Am I likely to succeed? How will I know whether or not I have succeeded? Where will success take or leave me?" • One and a half decades later, Manfred Schroeder in the Preface to the Bibliotheca Phonetica volume Speech and Speaker Recognition  says about the state-of-the-art of automatic recognition of speech at the time: "… one of the main impacts of the computer has been to demonstrate the manifest inadequacy of superficial algorithms that take no account of context and meaning. The simple-minded computer per se was not the hoped-for cure-all, and speech recognition was in acute danger of withering in the laboratory rather than blooming in the field…" • So, what IS the phonetic scientist’s ultimate goal? • To find answers to the question “How do humans communicate with speech in all types of speech interactions in the languages of the world?” • This question has always been asked and partial answers have been proposed – by creating categories of phonetic description – but they have always ended up as concepts abstracted from their original life contexts and reified in metalinguistic pursuits in their own right • Let’s have a look at some corner stones in the history of phonetic science. 2 From Sound to Phoneme • For thousands of years, homo sapiens loquens has invented ways of capturing the fleeting sound of spoken words in timeless symbols on durable material. • The aim of all the systematic writing systems that have resulted is to represent lexical items in graphic form – either ideographically, or with reference to sound units in syllabic or alphabetic scripts – An alphabetic writing system has been invented only once, in the Semitic language family. – All other alphabetic systems are derivatives from it. • Why should that be so? • 3-consonant roots for semantic fields of the lexicon k'atab he wrote y'iktib he writes, will write k'aatib clerk k'ataba clerks kit'aab book k'utub books makt'uub written m'aktab office, desk makt'aba library • This was the birth of the “phonemic” principle in tight association of lexical meaning and form. • No other language had this, so no other language developed an indigenous alphabetic script. • When the phoneticians of the newly-founded IPA at the end of the 19th c. devised a phonetic alphabet to indicate pronunciation in languages like English or French, whose Latin orthographies had become deficient in the representation of sounds, they reinvented the phonemic principle – broad and narrow transcription • The linguists of the Prague Circle turned this into a phonological theory with the distinctive phoneme for the differentiation of the intellectual meaning of words, and allophonic variation in context. – They kept the function-form link – but dissociated it from graphic representation – and turned it into a principle of sound structures – every language having its own phonemic system • The American Structuralists, in their behaviouristic philosophy went one step further and removed the link to meaning, being unable to formalize it. • Grouping of sounds into phonemes now governed by – complementary distribution – phonetic similarity • But Pike still recognised the original “phonemic principle” because he gave his book Phonemics the subtitle “A technique for reducing languages to writing”. • After that “phonology” became a separate discipline and had a metalinguistic purpose in itself practised by desk phonologists. • Generative Phonology, Optimality Theory, Markedness, Feature Hierarchy • Phonological categories were moved again from behaviouristic groupings to entities in the ideal speaker/listener’s mind. • At this point, psycholinguists got hold of them and started taking them into the lab for experiments on “the phoneme as a perceptual” unit. – This has been the MPI Nijmegen paradigm for the past 20 years, e.g. in phoneme spotting. – But is this extrapolation justified? 3 From Phoneme to Fine Phonetic Detail • Pronunciation“white please” vs. “black please” ordering coffee – :z]by a Londoner – mistaken for pli:z] by a Scottish listener – expecting pli:z]. • In this situational context, the listener‘s task was to understand one of two possible meanings – wrong understanding triggered by “graveness” instead of“acuteness” of the sound – not by wrong phoneme perception. • Listeners process speech signals with perceptual categories shaped by attention and memory, not by abstraction from sound to phoneme – they aim at understanding messages in all their facets of meaning, even from incomplete “segmental” signal information – stable multidimensional fine phonetic detail plays an important role – based on episodic memory, exemplar recognition and contextual information • This is mandatory in the processing of reduced speech, especially of function word form variability. • Here is an example from the Kiel Corpus of Spontaneous Speech: OLV g122a009 • I shall first play a stretch of speech that even native speakers of German will not be able to understand, which phoneticians find very difficult to represent as a string of segments, and German phoneticians as a sequence of phonemes. • Then I shall add the next stretch which will most likely trigger understanding of both stretches. • A third stretch will complete understanding. • The fine phonetic detail in the stretches will be discussed. V HUN0 00H nun wollen wir mal kucken, ob Mittwoch frei ist /u()U()n / uHUN • HUN]is identified as the verb <kucken>. 0 • The sound stretch that immediately precedes must be the modal particle <mal>, which commonly occurs in verbal context as [ma]. • But then an inflected auxiliary verb must precede. • The dark vocalic stretch ending in a labiodentalized nasal, which is in turn followed by , can be associated with <wollen wir>, because it commonly reduces in the direction of VV]. <werden, sollen, müssen> do not fit. • The initial stretch of [n] + dark vowel with strong nasalization across the long vocalic section can be associated with <nun> . • The result is an understanding of what in English is <“Now let’s see if Wednesday is free.”>. • This theoretical account of how the highly reduced utterance may be recognised puts sound perception into an integrated framework of cognitive processing for the understanding of meaning. – Phonemes and canonical forms play no role in it. – Phonetic traces that need not be segmental but may be spread over indefinite stretches (articulatory prosodies) trigger the recognition process, in conjunction with – morphological, syntactic and situational constraints – memory of multiple phonetic forms of lexical items is essential – complete phonetic identification of acoustic sequences is not required – These components of the recognition process must work in parallel to allow for real-time processing. – How they are implemented in real situations is an interesting and pressing question for future research in cooperation with neuroscientists (Event-Related Potentials) • Important suprasegmental articulatory prosodies are – nasalization – glottalization – labialization, labiodentalization – palatalization, velarization, pharyngealization z 0 <soll er ><das ><machen > /z/ z) 0 <sollen wir ><das ><machen > /z/ z0 <sollten 0 0 wir><das > <machen > /z/ • The fact that no role is attributed to phonemes and canonical forms in speech recognition does not mean that they are useless concepts. – The relevance of the phoneme concept in devising economical alphabetic writing systems has already been referred to. – The concept of canonical forms is useful in compiling pronunciation dictionaries listing variants under a lexical heading. – It is also useful for training automatic speech recognisers. • But neither concept should be extrapolated beyond these specific domains of application without special justification. – They are both inappropriate in (semi)automatic segmentation of acoustic databases for phonetic research, because they cannot capture articulatory prosodies, which are essential in speech production and perception. ° The Munich Automatic Segmentation System (MAUS) fails to provide annotation files that are usable for such a research goal. ° At present there is no adequate shortcut to manual phonetic annotation by competent phoneticians. • The concept of articulatory prosodies was integrated into the annotation of the Kiel Corpus of Read and Spontaneous Speech n u: -MA n-+ &0 v- O- l- @- n+ &0 -MA v- i:6-6+ &0 m a: l-+ &1^ g-k -h 'U k @- n-N , &0 Q- -q O -MA p-m+ &2. &2^ m 'I t v O x &1. &2^ f r 'aI &0 Q- I s t-+ . • Several publications: K.J. Kohler, Articulatory prosodies in German reduced speech, ICPhS 1999 Complementary Phonology – A theoretical frame for labelling an acoustic database of dialogues, ICSLP1994 O. Niebuhr, K.J. Kohler, Perception of phonetic detail in the identification of highly reduced words, JP 2011 K.J. Kohler, O. Niebuhr, On the role of articulatory prosodies in German message decoding, Phonetica 2011 • Phonemes and canonical forms are also inappropriate for gaining insight into speech and language acquisition, be it L1 or L2 – although they have provided the standard paradigm – e.g. the Contrastive Structures Series, ed. by Charles Furguson – but MacNeillage, P. The Origin of Speech. 2008; Frame and Content theory. Piske, T. Artikulatorische Muster im frühen Lautund Lexikonerwerb. Tübingen: Gunter Narr (2001) 4 From Auditory Observation to Signal Analysis • The technological advance in speech signal analysis, the spectrograph to start with, and latterly computer programs, – inevitably led to taking the phoneme concept into the lab – in order to substantiate phonological entities and structures by objective measurement – thus to supplement auditory impressions by testable physical properties – finally to replace auditory observation altogether • This development has culminated in Laboratory Phonology and has publication platforms in Journal of Phonetics, Laboratory Phonology – useless questions are asked and badly answered – e.g. Incomplete Neutralization of voicing in German final obstruents: rund(e) vs. bunt(e) – the latest analysis is Röttger, Winter, Grawunder, The robustness of incomplete neutralization in German, ICPhS 2011 – in production a difference was found of 8ms in vowel duration before voiced/voiceless plosives – below JND, thus has no communicative value – in the subsequent perception experiment 8 subjects classified 54% of the /ptkbdg/ stimuli as voiceless, 46% as voiced – logical regression and t tests gave significant differences between voiceless and voiced classification across all stimuli – however, the distribution of voiceless and voiced judgements across /ptk/ and /bdg/ separately, i.e. hits, misses and false alarms, was not tested, and the frequencies are not given – but they can be estimated from other indices as ° 56% voiceless and 44% voiced for /ptk/ ° 52% voiced and 48% voiceless for /bdg/ – chi2 testing gives no significance for an association of /ptk/ or /bdg/ stimuli with voiceless or voiced judgements, nor significant deviation from equal distribution for /bdg/ – So, the judgements are random – and therefore neither the results of production nor of perception have any communicative value – and the robustness in the title is a phantom. • We can well do without such l’art pour l’art experimentation, which abounds in Laboratory Phonology. – This is time, effort and public money badly spent. – It does not advance our knowledge of how people communicate one bit. – Sense has to be reintroduced into measurement 5 From Sound to Sense • The origin of speech technology after World War II had of course the communicative component incorporated – communications engineering, technological development to improve communiaction – Speech Communications Conference at MIT1950 – Menzerath and Meyer-Eppler invited – >Institut für Phonetik u. Kommunikationsforschung – Research Laboratory of Electronics, Speech Communication Group, MIT – Speech Communication Seminar, Stockholm 1974 – From Sound to Sense: 50+ years of discoveries in speech communication, MIT 2004 – invited paper by Sarah Hawkins: Puzzles and patterns in 50 years of research on speech perception “It seems reasonable to hope that new theories will aim to include the following attributes. They should be biologically plausible; include roles for attention, memory, and learning; focus on understanding meaning rather than identifying phonological form; allow for multiple potential ‘units of perception’, possibly with no obligatory units; and they should allow meaning and linguistic structure to be understood from incomplete information.” “A … key issue is to re-evaluate the distinction between bottom-up and top-down information. On the one hand, fine phonetic information that systematically indicates linguistic structure should make many model ‘top-down processes’ unnecessary. For example, fine allophonic detail can provide segmentation information that makes top-down use of abstract knowledge about possible word constraints redundant. On the other hand, such fine phonetic detail cannot be used in the absence of top-down knowledge about how it should be used —for this language, this accent, this speaker. The traditional distinction between signal and knowledge is thus likely to be blurred in future models. This seems entirely consistent with current understanding of brain functioning.” • This is the theoretical background, including the name, for the EC Marie Curie RTN. • There is a strong influence from Firthian linguistics. • This embedding of sound into sense in speech communication was, and is again, the research and teaching strategy of Phonetics in Kiel – and it naturally led to the integration of prosody in the study of sounds and their phrasal variability – thus looking at the exchange of meaning between speakers and listeners with the full array of phonetic form and substance. 6 From Sense to Sound • But we also need to include the complement – Jakobson, Fant, Halle, Preliminaries to speech analysis, 1952 “given the evident fact that we speak to be heard to be understood” – Speakers transmit meaning – by coding it in words and syntactic structures with fine phonetic detail of segments and prosodies – generating acoustic signals for listeners to decode • We need to answer two questions: – How is the phonetic form of words represented mentally to trigger physiological and articulatory processes for acoustic sound production? – What are the rules for producing reduced or elaborated phonetic forms? • A global answer to the first question is that the representation can certainly not be canonical phonemic form • essential phonetic elements that define the whole formal set of a lexical item will need to be specified (Niebuhr’s phonetic essence) – this specification must include segmental units as well as articulatory prosodies – both are related to lexical, morphological and speech style categories – which allow for phonetic under-specification • e.g. the ending of infinitives and 1st, 3rd persons plural of the German verb can be specified as [nasal] – the presence of a preceding vowel depends on a reduction-elaboration coefficient related to speaking style and speaking situation, > > E – the realization of the nasal as m n N depends on the preceding vocalic or consonantal stretch – as in the spontaneous-speech example discussed earlier, the nasality feature may be realised as an articulatory prosody on the preceding vocalic stretch instead of a nasal consonant, when the reduction coefficient increases in more casual style • The answer to the second question goes well beyond descriptive accounts of large databases (e.g. Kohler, Articulatory dynamics of vowels and consonants in speech communication, JIPA 2001) • it needs to include the coupling of reduction/ elaboration with lexical class, morphology, syntax and speaking style, closely linked to the answer of the first question • e.g. the German sequence of preposition + definite article masc. mit dem has two sets of realizations I. containing the deictic marker [d], as in local and temporal pointers da, dort, dann and demonstrative pronouns dieser, der (da), mI(t) de()m mI dm II. not containing [d]: mIpm mI(b)m mImm • II. is appropriate in phrases with generic reference, e.g. means of transport: mit dem Auto, mit dem Bus, mit dem Zug, mit dem Flugzeug “by car, bus, train, plane” • I. has a specific reference, e.g. ich fahr mit dem Auto, und zwar mit dem BMW meiner Frau “I go by car, and I take my wife’s BMW” • These two sets need to have separate mental representations, because they have different functions in the transmission of meaning – both representations must contain mI __ m – for I. the deictic marker is inserted with variable vocalic release according to the situationally determined reduction coefficient – for II. bilabial plosive interruption of sonority is possible with any phonation feature. • Thus mental lexical representation is multivalued. • You might call this proposition speculative – but it is no more speculative than the assumption of underlying canonical forms in the mental lexicon as a basis for 20 years of MPI Nijmegen perception research – we simply need to develop the adequate new experimentation to find answers for it – which means for researchers to give up cherished postulates and procedures to move in new directions – the Sense-to-Sound approach will make it possible. 7 From Sense to Sound to Sense • Finally, we have to combine the Speaker’s Senseto-Sound with the Listener’s Sound-to-Sense in dialogue interaction. • At this point, the Propositional, Expressive and Appeal functions of speech communication and their prosodic coding come to the fore. • There is a substantial amount of solid results in this field resulting from the development of the Kiel Intonation Model (KIM) over 25 years and its more recent refinements and additions. • What needs to be developed in the investigation of dialogue interaction is a new methodology of data acquisition that is adaptable to the specific research questions asked by speech scientists across the whole field of phonetic science, as sketched in this paper – isolated sentences will no longer do in prosodic research – we need to work with stylized systematic dialogue interaction as well as non-systematic Conversation Analysis data – a lot has already been done along these lines. • On the other hand, dialogues cannot be the basis for analysing articulatory control and coordination – in spite of the Edinburgh phoneticians’ decision to buy two EMA machines to allow subjects to communicate under their helmets. • It is particularly demanding to devise data acquisition procedures for systematic natural speech reduction – controlling speech rate is inadequate, though commonly used, as reduction reflects reduced effort – we will have to rely, in the first instance, on introspection of competent native speakers with good phonetic awareness, and on large corpus data. • Whatever data acquisition procedure we use for a particular phonetic investigation – we should always ask whether and how we can extrapolate from Lab to Real Situations – and we should be careful with generalizing statements for a whole language or dialect, – particularly when the data are obtained with highly invasive techniques. • If we take the steps I have outlined we will be progressively providing answers to the question I raised at the beginning of this talk and develop a Communicative Phonetic Science • for which Phonetica provides a publication platform under the motto Sounds and Prosodies in Speech Communication • • • • 8 Why are we doing what we are doing? Never has there been more activity in phonetic science than today never have there been more conference meetings and proceedings as outlets for phonetic research than today never has more money been poured into short-term projects on restricted topics than today never have research institutions complained more about lack of funding than today as if good ideas could be generated by money Isaac Newton was supposedly lying under an apple tree when he had an idea that revolutionized physics • never have there been more PhD programs than today • never has the rat-race among young researchers been fiercer than today • never have there been more tread-mill experimental analyses on phoneme and AM/ToBI bases than today • but never has there been so little progress on general theory and modelling of speech communication as today. • This is the situation James Le Fanu described in an article in NZZ of 19 June 2011 as “The End of Science”. • So, you need to reflect on why you are doing what you are doing. – You need to free yourselves from the downtrodden paradigms that provide you with shortterm jobs and with subjects for your dissertations and theses, and your hurriedly compiled 4-page congress papers. – You are in the lucky situation of being affiliated to a research programme that defined its goal as developing new models of speech perception, speech production and speech communication. • Take the bait and grab the opportunity to let your work become a contribution to this goal – contribute to advancing theoretical discussion on Communicative Phonetic Science – derive your specific experiments from such a global theoretical orientation – rather than accumulating isolated experiments – and let your thoughts mature, do not rush into yet another symposium, workshop, conference, etc. • I wish you a successful start with the discussions at this Symposium and a fruitful continuation as a potential Sense-to-Sound-to-Sense Working Group.