Collecting and interpreting acceptability judgments using Magnitude Estimation Caroline Heycock with Zakaris Svabo Hansen and Antonella Sorace University of Edinburgh NLVN-course/NORMS-seminar Tórshavn, Faroe Islands, 8–16 August 2008 Outline • Why do we need acceptability judgments? • What are the problems with acceptability judgments? • How can Magnitude Estimation help with any of these problems? • Exemplification from ongoing studies on Faroese (and related languages) Examples – aggregate across speakers – include performance errors – allow no straightforward distinction between nonoccurring and ungrammatical – may not exist Problems ME • There is no direct way to access I-language (the speaker’s knowledge of their language), we need to triangulate from all available sources of data. • Corpus data typically Need Why do we need judgment data? Outline • Why do we need acceptability judgments? • What are the problems with acceptability judgments? • How can Magnitude Estimation help with any of these problems? • Exemplification from ongoing studies on Faroese (and related languages) Validity Need Judgments are also a type of behaviour, known to be affected by Problems ME processing constraints personality and mental state presentation (order, context, mode) absolute vs relative task linguistic training Examples – – – – – – This may or may not be considered a problem of reliability, depending on assumptions about individual’s grammars, but it is at least a methodological problem • Intraspeaker inconsistency Problems ME • Interspeaker variation Need Reliability Examples Problems ME Examples • Judgments of linguistic acceptability usually form category scales (ok/*) or limited ordinal scales (ok/?/?*/*), (1,2,3,4,5) • These scales require absolute rating judgments, rather than relative ranking judgments • Ordinal scales provide no information about the relative distance between adjacent points on the scale Need Conventional measurements of acceptability – These scales cannot be analysed using parametric statistics, because this type of analysis requires the data to be on at least an interval scale. • Inconsistency • Uninterpretability – What do the middle points on a rating scale actually mean? – How can we distinguish between lack of certainty and intermediate acceptability? Examples – Even trained linguists use diacritics in different ways. Comparison between different studies is extremely difficult. Problems ME • Limited in their range of values • Lack of statistical power Need Problems arising with conventional scales for acceptability judgments Judgment data: interpreting midpoints Thráinsson 2003, Petersen 2000 V-Adv +bridge compl -bridge compl Relative Indirect question Adverbial clause Adv-V √ ? * √ 34% 33% 33% 75% 21% 4% 66% 7% 26% 92% 0% 8% 14% 41% 45% 82% 14% 4% 25% 6% 69% 98% 0% 2% 5% 31% 64% 81% 17% 2% 3% 0% 97% 100% 0% 0% 5% 32% 63% 74% 21% 5% 0% 0% 100% 100% 0% 0% 39% 37% 24% 81% 17% 2% ? * Judgment data: interpreting midpoints Thráinsson 2003, Petersen 2000 V-Adv √ +bridge compl -bridge compl Relative Indirect question Adverbial clause ? Adv-V * √ ? * 34% 33% 33% 75% 21% 4% 66% 7% 26% 92% 0% 8% 14% 41% 45% 82% 14% 4% 25% 6% 69% 98% 0% 2% 5% 31% 64% 81% 17% 2% 3% 0% 97% 100% 0% 0% 5% 32% 63% 74% 21% 5% 0% 0% 100% 100% 0% 0% 39% 37% 24% 81% 17% 2% Outline • Why do we need acceptability judgments? • What are the problems with acceptability judgments? • How can Magnitude Estimation help with any of these problems? • Exemplification from ongoing studies on Faroese (and related languages) Problems ME Examples • ME is an experimental technique used to determine quickly and easily how much of a given sensation a person is having. • In an ME experiment subjects are presented with a standard stimulus (a modulus) and are asked to express the magnitude by a number. • They are then presented with a series of stimuli that vary in intensity and are asked to assign each of the stimuli a number relative to the modulus. Need M[agnitude] E[stimation] in psychophysics Examples – to the modulus to reflect magnitude of pertinent characteristics (length, loudness, brightness) – to each successive stimulus to indicate apparent magnitude relative to the first (or to a previous stimulus) Problems ME • Subjects assign a number: Need ME in psychophysics Problems ME Examples • Scaling in ME is not about absolute accuracy of judgments; • Scaling is about the relative relationships between judgments of stimuli of different intensities. Need ME in psychophysics: Scaling Examples • Other modalities can be more user-friendly particularly if you are testing people who (think they) are numerically-challenged. Problems ME • The numerical modality is the most common but other modalities are possible (e.g. line length). Need ME in psychophysics: modalities Problems ME Examples • Many magnitude estimation experiments use a control condition in which subjects are asked to perform magnitude estimations of the length of a line. • Magnitude estimations of line length have been shown to be proportional to the actual length of the lines. Need ME in psychophysics: can people do it? Problems ME Examples • Unlike other dimensions, linguistic acceptability has no obvious “physical” continuum to plot against subjects’ impressions. • However, Bard, Robertson & Sorace 1996 have applied standard cross-modality matching techniques and were able to show that the technique is reliable. Need ME in Linguistics Problems ME • Here’s an example of what the instructions look like... Need Typical instructions Examples Instructions The purpose of this exercise is to get you to judge the acceptability of some English sentences. You will see a series of sentences on the screen. These sentences are all different. Some will seem perfectly okay to you, but others will not. What we're after is not what you think of the meaning of the sentence, but what you think of the way it's constructed. • Your task is to judge how good or bad each sentence is by assigning a number to it. • You can use any number that seems appropriate to you. For each sentence after the first, assign a number to show how good or bad that sentence is in proportion to the reference sentence. For example, if the first sentence was: (1) cat the mat on sat the. and you gave it a 1, and if the next example: (2) the dog the bone ate. seemed 20 times better, you'd give it twenty. If it seems half as good as the reference sentence, give it the number 0.5 • You can use any range of positive numbers you like including, if necessary, fractions or decimals. • You should not restrict your responses to, say, an academic marking scale. • You may not use minus numbers or zero, of course, because they aren't proper multiples or fractions of positive numbers. • If you forget the reference sentence don't worry; if each of your judgments is in proportion to the first, you can judge the new sentence relative to any of them that you do remember. • There are no 'correct' answers, so whatever seems right to you is a valid response. Nor is there a 'correct' range of answers or a `correct` place to start. • Any convenient positive number will do for the reference. • We are interested in your first impressions, so don't spend too long thinking about your judgment. Remember: • Use any number you like for the first sentence. • Judge each sentence in proportion to the reference sentence. • Use any positive numbers you think appropriate. Problems ME Examples • The experimenter has the option of assigning a fixed number to the modulus. • Another option is to leave the modulus in sight throughout the experiment. • This option has good face validity, but it isn’t clear to what extent it affects the ultimate reliability of the estimates. • People don’t need to remember the modulus; if they are making judgments proportionally, the reference point shifts as they move on. Need Choices about the modulus: face validity Need • The experimenter can impose constraints on the randomization to prevent certain experimental items from occurring consecutively. • The modulus can be chosen to represent an intermediate degree of acceptability. • A number (or a line) of intermediate size can be assigned to the modulus by the experimenter. Problems ME Examples Advantages of quasi-randomization Examples • Intervals have to be different for non-native speakers: they have to be piloted carefully. Problems ME • Timing the intervals between sentences may reduce the likelihood that people consult metalinguistic or prescriptive knowledge. Need Timed vs untimed ME Problems ME Examples • There is a tendency in some people to use a fixed (usually 10-point) scale. This is possibly because of familiarity with school marking systems. • If the instructions contain an explicit warning against using a restricted range of numbers, the tendency is much reduced. • People are very sensitive to instructions: these have to be as explicit and clear as possible. • A detailed practice session is essential! Need Varying the instructions Examples – a direct indication of the speaker’s ability to discriminate between more or less acceptable sentences – a direct measure of the strength of speakers’ preferences Problems ME • ME yields interval scales, which allow the use of parametric statistics • Mathematical operations can be applied to the estimates, allowing: Need Advantages Need • Informants are enabled to express their intuitions without any restrictions of the judgment scale. • They are asked to provide purely comparative judgments: these are relative both to a reference item and the individual subject’s own previous judgments. • At no point is an absolute criterion of grammaticality applied. • The subjects themselves fix the value of the reference item relative to which subsequent judgments are made. Problems ME Examples Advantages Problems ME Examples • The scale used by informants is open-ended and has no minimum division: subjects can always add a further highest score or produce an additional intermediate rating. • The result is that subjects are able to produce judgments which distinguish all and only the differences they perceive. Need Advantages Problems ME Examples ME data need to be normalized because people use different ranges of estimates. • Raw magnitude values are often transformed into logs in order to yield a normal distribution. • Each number is divided by the modulus that the subject had assigned to the reference sentence, or alternatively the z-scores are used. • Any statistical package can easily do these transformations. Need Data analysis: normalisation Outline • Why do we need acceptability judgments? • What are the problems with acceptability judgments? • How can Magnitude Estimation help with any of these problems? • Exemplification from ongoing studies on Faroese (and related languages) Problems ME Examples Some questions: 1. Do current speakers of Faroese have V-to-I as part of their competence grammar(s)? that is, do they allow the order Finite Verb > Negation in all types of subordinate clause? 2. Do current speakers of Faroese allow “generalised embedded Verb Second” (V2)? That is, do they allow a wide range of subordinate clauses to begin with something other than the subject? 3. With respect to these phenomena, how is Faroese situated with respect to Icelandic and Danish? Need Faroese How acceptable is V-I in Faroese? We looked at the effect of two variables and their interaction (2 within-subjects variables, 2 and 3 levels): • Order – Verb-Adverb – Adverb-Verb • Type of “adverb” – Negation (ikki) – “High” adverb (kanska) – “Low” adverb (ofta) These orders were all contained in relative clauses. Examples • Adverb: Negation Order: V-Adv Hatta er filmurin, sum Hanus hevur ikki sæð That is film-def that Hanus has neg seen • Adverb: Negation Order: Adv-V Hetta er brævið, sum Elin ikki hevur lisið That is letter-def that Elin neg has read • Adverb: Low Adv Order: V-Adv Hetta er lagið, sum Teitur hevur ofta spælt That is piece-the that Teitur has often played • Adverb: Low Adv Order: Adv-V Hatta er sangurin, sum Eivør ofta hevur sungið That is song-def that Eivør often has sung How “generalized” is V2 in Faroese? We looked at the effect of two variables and their interaction (2 withinsubjects variables, 2 and 5 levels): • Order – Subject-Initial – Adjunct-Initial • Clause type – – – – – Main clause “Bridge verb” complement Nonbridge verb A complement (regret, admit) Nonbridge verb B complement (deny, doubt, be proud) Indirect question Examples • • • • Clause Type: Bridge Order: Subject-Initial Lív segði, at hon kom seint til arbeiðis í gjár Lív said that she came late to work yesterday Clause Type: Bridge Order: Adjunct-Initial Beinir segði, at í morgin kemur hann seint til arbeiðis Beinir said that tomorrow comes he late to work Clause Type: NonBridge B Order: Subject-Initial Sámal noktaði, at hann hevði verið alla náttina á barrini í fleiri førum Sámal denied that he had been all night in bar-def frequently Clause Type: NonBridge B Order: Adjunct-Initial Einar noktaði, at í fleiri forum hevði hann drukkið alla náttina á barrini Einar deniedthat frequently had he drunk all night in bar-def Faroese 1 vs Faroese 2: geographic? • In Jonas 1996 it is argued that there are two distinct “dialects” in Faroese: – Faroese 1, which optionally allows V-to-I – Faroese 2, which does not allow V-to-I • Jonas suggests that these two dialects may correlate both with age and with dialect area: Faroese 1 more common in the southern islands, and among older speakers. • We investigated the geographic dialect suggestion by collecting data from 25 subjects from Tórshavn (North) and 22 subjects from Suðuroy (South). Subjects were, as much as possible, matched for age. Verb position: North vs South 0.4 0.3 z -s c o r e s (m e a n s ) 0.2 0.1 North South 0 Verb - Adverb Adverb - Verb -0.1 -0.2 -0.3 Position of verb No geographic dialect difference • The main effect of dialect group was not significant • There was no significant interaction between language group and position of verb, or between language group and type of adverb • We did not find any evidence for a geographic dialect difference with respect to V-to-I in our subjects Commparison with Danish, Icelandic • There is a significant interaction between language and order of the verb with respect to Negation/Adverb. • I.e. the effect of the different orders is different, depending on the language... Position of the verb 0.6 0.5 E s ti m a te d M a r g i n a l M e a ns 0.4 0.3 0.2 0.1 Icelandic Danish Faroese 0 Verb - Adverb Adverb - Verb -0.1 -0.2 -0.3 -0.4 -0.5 Position Comparing Verb/Adverb orders • To see where there is any difference between the different adverbs in terms of whether or not the verb can move past them, we can look at the difference between the VerbAdverb and Adverb-Verb orders with respect to each of the three adverbs • We’d expect no difference between verb movement over the three adverbs in Icelandic (all should be good) and in Danish (all should be bad) • If Faroese is just intermediate between Icelandic and Danish, we’d also expect no effect of the different adverb types here. The effect of verb movement past different adverbs 1.000 0.800 0.600 0.400 z -s c o r e s 0.200 Icelandic 0.000 Danish Negation High Adverb -0.200 -0.400 -0.600 -0.800 -1.000 Type of adverb Low Adverb Faroese Comparing Verb/Adverb orders • Our Faroese subjects dispreferred the order Finite Verb - Negation in an unambiguously non-V2 context to the same extent that the Danish subjects did. • However, our Faroese subjects found Verb-Adverb orders better than Verb-Negation orders (this effect was found neither in Danish nor in Icelandic). • It is possible that to the extent that IP-internal verb movement is still grammatical in Faroese, for some speakers it is to an intermediate position. Looking at the effect of V2 The best measure of the effect of V2 is to look at the difference between the Subject-Initial and Adjunct-Initial order, for each clause type: That is, what is the difference between the scores for sentences of type (a) and type (b) for each clause type? (a) Order: Subject-Initial Lív segði, at hon kom seint til arbeiðis í gjár Lív said that she came late to work yesterday (b) Order: Adjunct-Initial Beinir segði, at í morgin kemur hann seint til arbeiðis Beinir said that tomorrow comes he late to work The effect of of V2 in different clause types 0.200 0.000 Main Bridge NonBridge A NonBridge B Ind qu D if f e r e n c e in z -s c o r e s -0.200 -0.400 -0.600 Icelandic Danish Faroese -0.800 -1.000 -1.200 -1.400 -1.600 Clause type The effect of V2: Danish • In Danish there was a significant difference between the effect of V2 in a main clause and after the second category of “nonbridge” verbs (deny, doubt, be proud). • There was however no significant difference between the effect of V2 in a main clause and after the first category of “nonbridge” verbs (regret, admit). • Taken together, this suggests that for this language Vikner’s original categorisation of “bridge” verbs for V2 is not correct; instead these results are more consistent with the proposals in Bentzen et al (2007) or Julien (2007). The effect of V2: Faroese and Icelandic • In Faroese and Icelandic, however, there is no significant difference between the effect of V2 in a main clause and after the second category of “nonbridge” verbs. • This suggests that V2 in these languages targets a different projection than in Danish (and the other mainland Scandinavian languages?) Is apparent V-to-I really V2? V2: • Clause Type: Nonasserted Order: Subject-Initial Ronaldo noktar, at hann hevur skrivað undir sáttmála við Liverpool næsta ár Ronaldo denies that he has signed contract with Liverpool next year • Clause Type: Nonasserted Order: Adjunct-Initial Næmingarnir noktaðu, at í fríkorterinum høvdu teir roykt á vesinum Students-def denied that in breaks had they smoked in toilets-def “V-to-I” • Clause Type: Nonasserted Order: Negation-Verb Handilskvinnan noktaði, at hon ikki hevði læst handilin í gjárkvøldið Shopkeeper denied that she not had locked shop-def yesterday evening • Clause Type: Nonasserted Order: Verb-Negation Sámal noktaði, at hann hevði ikki latið sjálvuppgávuna inn til tíðina Sámal denied that he had not handed assignment in on time Effect of “verb movement” 0.2 0 z-s co re s(log ) -0.2 "said..." "denied..." Ind Qu -0.4 V2 -0.6 V-to-I -0.8 -1 -1.2 -1.4 Clause type Conclusion • Judgment data are important for linguistic analysis, especially where corpora are not available, but even where they are. • In investigating language we are always dealing with behaviour, when we want to learn about knowledge. Investigating different types of behaviour may help us to narrow down the range of possibilities • Magnitude Estimation is a method for gathering judgment data that allows for a wider range of analytical tools than many other techniques All data collected by Zakaris Svabo Hansen for the project Verb movement in contemporary Faroese http://www.ling.ed.ac.uk/~heycock/faroese-project.shtml Project funded by the Arts and Humanities Research Council Some References • • • • • • • Bard, E.G., Robertson, D. and Sorace, A. 1996. Magnitude estimation of linguistic acceptability. Language 72: 32-68. Featherston, S. (2005). Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua, 115:1525–1550. Featherston, S. (2007). Data in generative grammar: the stick and the carrot. Theoretical Linguistics, 33(3):269–318. Keller, F. 2003. A psychophysical law for linguistic judgments. Proceedings of the 25th Annual Conference of the Cognitive Science Society. Mahawah: Lawrence Erlbaum. Sorace, A. 1996. The use of acceptability judgments in second language research. In V. T. Bhatia and W. Ritchie (eds.) Handbook of Second Language Acquisition. New York: Academic Press, p. 375-409. Sorace, A. & Keller, F. in press. Gradience in linguistic data. To appear in Lingua. Sprouse, J. 2007. A program for experimental syntax: Finding the relationship between acceptability and grammatical knowledge. PhD thesis, University of Maryland, College Park.