Experimental Design for Linguists Charles Clifton, Jr. University of Massachusetts Amherst Slides available at http://people.umass.edu/cec/teaching.html Goals of Course ► Why should linguists do experiments? ► How should linguists do experiments? Part 1: General principles of experimental design ► How should linguists do experiments? Part 2: Specific techniques for (psycho)linguistic experiments Schütze, C. (1996). The empirical basis of linguistics. Chicago: University of Chicago Press. Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks, CA: Sage Publications Inc. Myers, J. L., & Well, A. D. (in preparation). Research design and statistical analysis (3d ed.). Mahwah, NJ: Erlbaum. 1. Acceptability judgments ► Check theorists’ intuitions about acceptability of sentences Acceptability, grammaticality, naturalness, comprehensibility, felicity, appropriateness… ► Aren’t theorists’ intuitions solid? Example of acceptability judgment: Cowart, 1997 ► Subject extraction: I wonder who you think (that) likes John. ► Object extraction: I wonder who you think (that) Mean judged acceptability (z-score) John likes. 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 No-That No-That That That Subject Extraction Object Extraction Stability of ratings (Cowart,1997) 2. Sometimes linguists are wrong… ► Superiority effects I’d like to know who hid it where. *I’d like to know where who hid it. ► Ameliorated by a third wh-phrase? ?I’d like to know where who hid it when. …maybe. Paired-comparison preference judgments a. I’d like to know who hid it where. 86% b. (*)I’d like to know where who hid it. 14% c. (*)I’d like to know where who hid it when. d. I’d like to know who hid it where when. 76% 24%. 49% 51% a-b basic superiority violation b-c heads-on comparison, extra wh “when” hurts, doesn’t help c-d the “ameliorated” superiority violation, c, seems good when compared to its non-superiority-violation counterpart Clifton, C. Jr., Fanselow, G., & Frazier, L. (2006). Amnestying superiority violations: Processing multiple questions. Linguistic Inguiry, 37(51-68). Another instance… Question: is the antecedent of an ellipsis a syntactic or a semantic object? Why is (a) good and (b) bad? (a) The problem was to have been looked into, but obviously nobody did. (b) #The problem was looked into by John, and Bob did too. Andrew Kehler’s suggestion: semantic objects for causeeffect discourse relations, syntactic objects for resemblance relations. Corpus data bear his suggestion out. But an experimental approach… Kim looked into the problem even though Lee did. (causal, syntactic parallel) Kim looked into the problem just like Lee did. (resemblance) The problem was looked into by Kim even though Lee did. (causal, nonparallel) The problem was looked into by Kim just like Lee did. (resemblance) Mean Acceptability Rating (5 = good) 4.5 4 Causal Resemblance 3.5 3 2.5 Parallel NonParallel Frazier, L., & Clifton, C. J. (2006). Ellipsis and discourse coherence. Linguistics and Philosophy, 29, 315-346. Context effects ► Linguists: think of minimal pairs ► The contrast between a pair may affect judgments ► Hirotani: Production of Japanese sentences The experimental context in which sentences are produced affects their prosody Hirotani experiment a. Embedded wh-question (ka associated to na’ni-o) (# = Major phrase boundary) Mi’nako-san-wa Ya’tabe-kun-ga Minako-Ms.-TOP Yatabe-Mr.-NOM na’ni-o moyasita’ka (#) gumon-sita’-nokai? what-ACC burned-Q stupid question-did-Q (-wh) ‘Did Minako ask stupidly what Yatabe burned?’ (Yes, it seems (she) asked such a question.’) b. Matrix wh-question (ndai associated to na’ni-o) Mi’nako-san-wa Ya’tabe-kun-ga Minako-Ms.-TOP Yatabe-Mr.-NOM na’ni-o moyasita’ka (#) gumon-sita’-ndai? what-ACC burned-Q stupid question-did-Q (+wh) ‘What did Minako ask stupidly whether Yatabe burned?” (‘The letters (he) received from (his) ex-girlfriend.’) Hirotani results Percentage of insertion of MaP before phrase with question particle Initial Block (pure) Final Block (pair contrast) Embedded question 100% 100% Matrix question 57% 15% Hirotani, Mako. (submitted). Prosodic phrasing of wh-questions in Japanese 3. Unacceptable grammaticality ► Old multiple self-embedding sentence experiments Miller & Isard 1964: sentence recall, right-branching vs. self-embedded (1-4) ► She liked the man that visited the jeweler that made the ring that won the prize that was given at the fair. ► The prize that the ring that the jeweler that the man that she liked visited made won was given at the fair. ► Median trial of first perfect recall: 2.25 vs never Stolz 1967, clausal paraphrases: subjects never understood the self-embedded sentences anyway Miller, G. A., & Isard, S. (1964). Free recall of self-embedded English sentences. Information and Control, 4, 292-303. Stolz, W. (1967). A study of the ability to decode grammatically novel sentences. Journal of verbal Learning and verbal Behavior, 6, 867-873.. 3’. Acceptable ungrammaticality Speeded acceptability judgment and acceptability rating %OK Rating a. OK None of the astronomers saw the comet, but John did. 83% 4.36 B. Embedded VP Seeing the comet was nearly impossible, but John did. 66% 3.71 The comet was nearly impossible to see, / but John did. 44% 3.27 C. VP w/ trace D. Neg adj The comet was nearly unseeable, / but John did. 17% 2.21 Arregui, A., Clifton, C. J., Frazier, L., & Moulton, K. (2006). Processing elided verb phrases with flawed antecedents: The recycling hypothesis. Journal of Memory and Language, 55, 232-246. 4. Provide additional evidence about linguistic structure ►A direct experimental reflex of structure would be nice But we don’t have one ► Are traces real? Filled gap effect: reading slowed at us in My brother wanted to know who Ruth will bring (t) us home to at Christmas. Compared to My brother wanted to know if Ruth will bring us home to at Christmas. Stowe, L. (1986). Parsing wh-constructions: Evidence for on-line gap location. Language and Cognitive Processes, 1, 227-246. Are traces real, cont. ► Pickering and Barry. “no.” ► Possible evidence That’s the pistol with which the heartless killer shot the hapless man yesterday afternoon t. That’s the garage with which the heartless killer shot the hapless man yesterday afternoon t. disrupted at shot in the second example, far before the trace position ► Reading But who’s to say that the parser has to wait to project the trace? Pickering, M., & Barry, G. (1991). Sentence processing without empty categories. Language and Cognitive Processes, 6, 229-259. Traxler, M. J., & Pickering, M. J. (1996). Plausibility and the processing of unbounded dependencies: An eye-tracking study. Journal of Memory and Language, 35, 454-475. 5. Is grammatical knowledge used? ► Serious question early on “psychological reality” experiments ► Direct experimental attack did not succeed Derivational theory of complexity ► Indirect experimental attack has succeeded Build experimentally-based theory of processing 6. Test theories of how grammatical knowledge is used ► Moving beyond modularity debate – more articulated questions about real-time use of grammar ► Phillips: parasitic gaps, selfpaced reading The superintendent learned which schools/students the plan to expand _ … overburdened _. (slowed at expand after students – plausibility effect) The superintendent learned which schools/students the plan that expanded _ … overburdened _. (no differential slowing at expand – no plausibility effect) Phillips, C. (2006) The real-time status of island phenomena. Language, 82, 795-823. II: How to do experiments. Part 1, General design principles ► Dictum 1: Formulate your question clearly ► Dictum 2: Keep everything constant that you don’t want to vary ► Dictum 3: Know how to deal with unavoidable extraneous variability ► Dictum 4: Have enough power in your experiment ► Dictum 5: Pay attention to your data, not just your statistical tests Dictum 1: Formulate your question clearly ► Independent variable: variation controlled be experimenter, not by what subject does ► Dependent variable: variation observed in subject’s behavior, perhaps dependent on IV ► Operationalization of variables Formulate your question ► Question: Do you identify a focused word faster than a non-focused word? Must clarify: Syntactic focus? Prosodic focus? Semantic focus? Must operationalize ►Syntactic focus – Clefting? Fronting? Other device? ►Prosodic focus – Natural speech? Manipulated speech? Synthetic speech? Target word or context? Formulate your question ► Question: does discourse context guide or filter parsing decisions? Clarify question: does discourse satisfy reference? establish plausibility? set up pragmatic implications? create syntactic structure biases? Operationalize IV: Lots of choices here ►But also have to worry about dependent variable… Choose appropriate task, DV ► Question about focus: need measure of speed of word identification Conventional possibilities: lexical decision, naming, phoneme detection, reading time ► Question about “guide vs filter:” probably need explicit theory of your task Tanenhaus: linking hypothesis E.g. eye movements in reading: tempting to think that “guide” implicates “early measures,” “filter” implicated “late measures.” ► But what’s early, what’s late? Need model of eye movement control in parsing. Subdictum A: Never leave your subjects to their own devices ► It may not matter a lot Cowart example: 5-point acceptability rating ►A. “….base your responses solely on your gut reaction” ►B. “…would you expect the professor to accept this sentence [for a term paper in an advanced English course]?” ► But sometimes it does matter… Cowart 1997 Dictum 2: Try to keep everything constant except what you want to vary ► Try to hold extraneous variables constant through norms, pretests, corpora… ► When you can’t hold them constant, make sure they are not associated (confounded) with your IV An example: Staub, in press Eyetracking: does the reader honor intransitivity? Compare unaccusative (a), unergative (b), and optionally transitive) a. When the dog arrived the vet1 and his new assistant took off the muzzle2. b. When the dog struggled the vet1 and his new assistant took off the muzzle2. c. When the dog scratched the vet1 and his new assistant took off the muzzle2. Critical regions: held constant (the vet…; took off the muzzle). Manipulated variable (verb): conditions equated on average length and average word frequency of occurrence. Better: match on additional factors (number of stressed syllables, concreteness, plausibility as intransitive, ….) Better: don’t just have overall match, but match the items in each triple. Staub, A. (in press). The parser doesn't ignore intransitivity, after all. Journal of Experimental Psychology: Learning, Memory and Cognition. Another example: NP vs S-comp bias Kennison (2001), eyetracking during reading of sentences like: a. The athlete admitted/revealed (that) his problem worried his parents…. b. The athlete admitted/revealed his problem because his parents worried… Conflicting results from previous research (Ferreira & Henderson, 1990; Trueswell, Tanenhaus, & Kello, 1993): does a bias toward use as S-complement (admit) reduce the disruption at the disambiguating word worried? Problems in previous research: plausibility of direct object analysis not controlled (e.g., Trueswell et al., ambiguous NP (his problem) rated as implausible as direct object of S-biased verb) Kennison, normed material, equated plausibility of subject-verb-object fragment for NP- and S-comp biased verbs; found reading disrupted equally at disambiguating verb worried for both types of verbs. Kennison, S. M. (2001). Limitations on the use of verb information during sentence comprehension. Psychonomic Bulletin & Review, 8, 132-137. What happens when there is unavoidable variation? ► Subdictum B: When in doubt, randomize Random assignment of subjects to conditions Questionnaire: order of presentation of items? ► Single randomization: problems ► Different randomization for each subject ► Constrained randomizations ► Equate confounds by balancing and counterbalancing Alternative to random assignment of subject to conditions: match squads of subjects Counterbalancing of materials ► Counterbalancing Ensure that each item is tested equally often in each condition. Ensure that each subject receives an equal number of items in each condition. ► Why is it necessary? Since items and subjects may differ in ways that affect your DV, you can’t have some items (or subjects) contribute more to one level of your IV than another level. Sometimes you don’t have to counterbalance ► If you can test each subject on each item in each condition, life is sweet ► E.g., Ganong effect (identification of consonant in context) Vary VOT in 8 5-ms steps ► /dais/ - /tais/ ► /daip/ - /taip/ Classify initial segment as /d/ or /t/ ► Present each of the 80 items to each subject 10 times ► Ganong effect: biased toward /t/ in “type,” /d/ in “dice” Connine, C. M., & Clifton, C., Jr. (1987). Interactive use of information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 13, 291-299. If you have to counterbalance… ► Simple example Questionnaire, 2 conditions, N items Need 2 versions, each with N items, N/2 in condition 1, remaining half in condition 2 ► Versions ► More 1 and 2, opposite assignment of items to conditions general version M conditions, need some multiple of M items, and need M different versions ► Embarrassing if you have 15 items, 4 conditions… ► That means that some subjects contributed more to some conditions than others did; bad, if there are true differences among subjects Counterbalancing things besides items ► Order of testing Don’t test all Ss in one condition, then the next condition… At least, cycle through one condition before testing a second subject Fancier, latin square ► Avoid minor confound if always test cond 1 before cond 2 etc. ► N x n square, sequence x squad, containing condition numbers, such that each condition occurs once in each column, each order ► Location of testing E.g., 2 experiment stations Experimental Design for Linguists Charles Clifton, Jr. University of Massachusetts Amherst Slides available at http://people.umass.edu/cec/teaching.html and at http://coursework.stanford.edu Goals of Course ► Why should linguists do experiments? ► How should linguists do experiments? Part 1: General principles of experimental design ► How should linguists do experiments? Part 2: Specific techniques for (psycho)linguistic experiments Schütze, C. (1996). The empirical basis of linguistics. Chicago: University of Chicago Press. Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks, CA: Sage Publications Inc. Myers, J. L., & Well, A. D. (in preparation). Research design and statistical analysis (3d ed.). Mahwah, NJ: Erlbaum. II: How to do experiments. Part 1, General design principles ► Dictum 1: Formulate your question clearly ► Dictum 2: Keep everything constant that you don’t want to vary ► Dictum 3: Know how to deal with unavoidable extraneous variability ► Dictum 4: Have enough power in your experiment ► Dictum 5: Pay attention to your data, not just your statistical tests So how do you randomize? ► E-mail me ([email protected]) and I’ll send you a powerful program ► But for most purposes, check out http://www-users.york.ac.uk/~mb55/guide/randsery.htm Or http://www.randomizer.org/index.htm Factor out confounds ► Factorial design An example, discussed earlier: Arregui et al., 2006 Initial experiment contained a confound; corrected in second experiment by adding a second factor Arregui et al., rating study Acceptability rating Rating Rating clause 1 a. OK None of the astronomers saw the comet, but John did. 4.36 4.53 B. Embedded VP Seeing the comet was nearly impossible, but John did. 3.71 4.41 C. VP w/ trace The comet was nearly impossible to see, but John did. 3.27 4.81 D. Neg adj The comet was nearly unseeable, but John did. 2.21 4.39 Arregui, A., Clifton, C. J., Frazier, L., & Moulton, K. (2006). Processing elided verb phrases with flawed antecedents: The recycling hypothesis. Journal of Memory and Language, 55, 232-246. Factorial Design First clause Ellipsis absent Ellipsis present Syntactically OK None of the astronomers saw the comet. ..but John did. Embedded VP Seeing the comet was nearly impossible. ..but John did. VP with trace The comet was nearly impossible to see. ..but John did. Nominalization The comet was nearly unseeable. ..but John did. Factor 1: syntactic form of initial clause (4 levels) Factor 2: presence or absence of ellipsis (2 levels) An interaction Interaction: The size of the effect of one factor differs among the different levels of the other factor. Mean Acceptability Rating 5 4 3 OK Embed VP VP trace Nominal 2 1 0 Ellipsis absent Ellipsis present Factorial Designs in Hypothesis Testing ► Cowart (1997), that-trace effect Question: is it bad to extract a subject over that ► ?I wonder who you think (that) t likes John. Acceptability judgment: worse with that ► But: underlying theory talks just about extracting a subject. Does acceptability suffer with extraction of object over that? ►I wonder who you think (that) John likes t. Need to do factorial experiment 1: presence vs. absence of that ► Factor 2: subject vs. object extraction ► Factor Mean judged acceptability (z-score) The results (from before) 0.6 0.4 0.2 No-That 0 -0.2 -0.4 -0.6 -0.8 No-That That That Subject Extraction A clear interaction. Object Extraction A worry about scales ► Interactions of the form “the effect of Factor A is bigger at Level 1 than at Level 2 of Factor B. Cowart, effect of that bigger at subject than object extraction ► Types of scales Ratio: true zero, equal intervals, can talk about ratios (time, distance, weight) Interval: equal intervals, but no true zero (temperature, dates on a calendar) Ordinal: only more or less (ratings on rating scale, measures of acceptability, measures of difficulty) Original Scale 1000 800 600 Factor 1-1 Factor 1-2 400 200 0 Factor 2-1 Factor 2-2 Log scale 3 2 Factor 1-1 Factor 1-2 1 0 Factor 2-1 Is there really an interaction? Factor 2-2 Disordinal and crossover interactions Factor 1-1 Factor 1-2 35 30 30 25 25 20 20 15 15 10 10 5 0 Factor 1-1 Factor 1-2 5 Factor 2-1 Factor 2-2 0 Factor 2-1 Factor 2-2 An example of an important but problematic experiment: Frazier & Rayner, 1982 Closure: LC: Since Jay always jogs a mile and a half this seems like a short distance to him. 40 40 ms/ch EC: Since Jay always jogs a mile and a half seems like a very short distance to him. 35 54 ms/ch Attachment: MA: The lawyers think his second wife will claim the entire family inheritance. 36 ms/ch NMA: The second wife will claim the entire family inheritance belongs to her. 37 51 ms/ch Data shown: ms/character first pass times for the colored regions. Problems??? Frazier, L., & Rayner, K. (1982). Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, 178-210. Dictum 3: Know how to deal with unavoidable extraneous variability ► i.e., know some statistics ► Measures of central tendency (“typical”) Mean (average, sum/N) Median (middle value) Mode (most frequent value) ► Measures of variability Variance (Average squared deviation from mean) Average deviation (Average absolute deviation from median) Computation of Variance Distr 1 X Mean-X Sq’d 7 9 10 6 X Mean-X Sq’d 81 12 4 16 36 14 2 4 … Distr 2 … 30 -14 196 21 -5 25 17 -1 1 17 -1 1 314 64 78.5 16 Sum 64 Mean 16 Variance 46 Variance 11.5 Variance in an experiment ► Systematic variance: variability due to manipulation of IV and other variables you can identify ► Random variance: variability whose origin you’re ignorant of ► Point of inferential statistics: is there really variability associated with IV, on top of other variability? Is there a signal in the noise? Best way to deal with extraneous variability: Minimize it! ► Keep everything constant Reduce experimental noise ►See the signal easier Keep environment, instructions, distractions, experimenter, response manipulanda, etc. constant Pretest subjects and select homogeneous ones, if that suits your purposes One way to minimize extraneous variance: Within-subject designs ► Subjects differ …a lot, in some measures, eg. Reading speed, reaction time ► Present all levels of your IV to each subject Assume the subject effect is a constant across all the levels. Differences among conditions thus abstracted from subject differences ► Counterbalancing necessary Test each item in each condition for an equal number of subjects. ► Worry about experience changing what your subject did E.g., will reading an unreduced relative clause (The horse that was raced past the barn fell) affect reading of a reduced relative clause sentence? Statistical tests/statistical inference ► Never expect observed condition means to be exactly the same Just noise? Or signal + noise? ► Statistical inference: is there really a signal? p value: the probability you’d obtain a difference among the means that is as large as what you observed, if the true signal is zero “null hypothesis” test Basic logic of statistical tests (t, F, etc.) ► Get one estimate of the variabilty due to noise + any signal Estimate from the variation among the observed mean values in the different conditions ► Get another estimate of the variabilty due to noise alone Estimate from how much variation there is among subjects, within a condition ► If signal = 0, ratio is expected to be 1 If it’s enough bigger than 1, then the signal is likely to be non-zero Underlying model ► Subjects are a random sample from some population ► You can make inferences about variability in the population from the observed variability in the sample ► Logical inference: “if the size of the signal in the population is zero, the probability of getting a difference among the means that is as big as we observed is p” where p is the level of significance If p is small enough, reject the proposal that the population signal is zero Between-subject design ► Estimate of signal + noise: variability among the condition means ► Estimate of noise alone: variability among the subject means in each condition ► F = MSbetween conds/MSwithin cond MS, not exactly variance; must divide sum of squares by df, not by N Within-subject design ► Estimate of signal + noise: variability between the condition means ► Estimate of noise alone Get a measure of the variability among condition means for each subject Calculate the variability among these measures Subjects x treatment interaction ► How much the the size of the treatment effect differs among subjects is an estimate of error variability. ►F = MSbetween conditions/MSsubjects x treatments Advanced topics ► Multi-factor designs, tests for interactions ► Treat counterbalancing factors as factors in ANOVA E.g., if have 4 conditions, 4 counterbalancing groups, differing in assignment of items to conditions, you can treat groups as a between-subject factor and pull out variability due to items from the subjects x treatment error term ► Statistical accommodation of extraneous variation Analysis of covariance Multi-level, hierarchical designs Pollatsek, A., & Well, A. D. (1995). On the use of counterbalanced designs in cognitive research: A suggestion for a better and more powerful analysis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 785-794. Forthcoming special issue of the Journal of Memory and Language on new and alternative data analyses. Dictum 4: Have enough power to overcome extraneous variability ► Add more data! Minimizes noise component of differences among condition means ► Law of large numbers The larger the sample size, the more probable it is that the sample mean comes arbitrarily close to the population mean If you’re (almost) looking at population means, any differences have to be real – not sampling error Law of large numbers a population with a variance v2. ► Imagine you take a bunch of independent samples from this population, each sample of size N. ► Each sample will have a mean value. ► These mean values will have a variance, which turns out to be v2/N. ► This variance will be smaller as N gets larger. ► Imagine A sampling simulation ► The effect of sample size on the variability of sample means Bigger samples, smaller variability Standard deviation = square root of variance N=5 http://onlinestatbook.com/stat_sim/index.html N = 25 Means from larger Ns have less noise ► Holds for subject means More subjects, means reflect vagaries of sample less; means have less noise ► Holds for item means too More items, means less affected by peculiarity of individual items ► OK, you can have too many items and burn out your subjects Have enough power…. ► Back to holding everything constant First reason: don’t want variables confounded with our independent variable Second reason: minimize noise. Less noise, more power. Dictum 5: Pay attention to your data, not just your statistical tests ► Look at your data, graph them, try to make sense out of them Don’t just look for p < .05! ► Examine confidence intervals ► Look at your data distributions Stem and leaf graphs By subjects… Confidence intervals ► Confidence subjects) intervals (of means over items and If you have a sample mean and you know the true population standard deviation of the sample σM , you can say that there is a 95% chance that the true population mean is within +/- 1.96 * σM your sample mean. But of course you don’t know σM so you have to estimate it from your data and use the t distribution. But then you can present your means as X +/- CI ►A simulation: http://onlinestatbook.com/stat_sim/index.html Confidence Intervals ► Do you want to look at individual item data? Don’t make too much of the tea leaves Consider getting a confidence interval on the individual item means Example: Self-paced reading time ► Cond 1: This table is slightly dirty and the manager wants it removed. (minimum standard adjective) ► Cond 1: This table is slightly clean and the manager wants it removed. (maximum standard adjective) ► Reading time, clause 2, slower for maximum than minimum standard adjective Are some items more effective than others? Confidence intervals, individual items ► Each item has 12 different observations (different subjects) in each condition. ► Can measure the variability among these subject data points for max std and min std adjective And from that, estimate the variability of the difference, and from that, the confidence interval of the difference Max std adj Min std adj Mean Max Mean Min Diff 95% CI Diff Clean Dirty 1960 1486 474 +/- 662 Safe Dangerous 1572 1258 314 +/- 303 Healthy Sick 1635 1164 471 +/- 485 Dry Wet 1130 1229 -99 +/- 427 Complete Incomplete 1196 1617 -421 +/- 621 Dictum 5: Pay attention to your data, not just your statistical tests ► Graph your data ► Examine confidence intervals ► Look at the distributions of your means Stem and leaf graphs By subjects…and by items Maria asked Bob to invite Fred or Sam to the barbecue. She didn't have enough room to invite both. Frequency Stem & Leaf 3.00 0 . 778 17.00 1 . 01111222333334444 15.00 1 . 556677788889999 11.00 2 . 00011122333 1.00 2. 7 1.00 Extremes (>=3286) Stem width: 1000.00 Each leaf: 1 case(s) Maria asked Bob not to invite Fred or Sam to the barbecue. She didn't have enough room to invite both. Frequency Stem & Leaf 1.00 0. 9 13.00 1 . 1112223334444 22.00 1 . 5555555666677777777788 9.00 2 . 000111344 1.00 2. 7 2.00 Extremes (>=3120) Stem width: 1000.00 Each leaf: 1 case(s) Maria asked Bob to invite Fred or Sam to the barbecue. She didn't have enough room to invite both. Maria asked Bob not to invite Fred or Sam to the barbecue. She didn't have enough room to invite both. By items Frequency Stem & Leaf 7.00 1 . 1122234 12.00 1 . 556666667788 4.00 2 . 1123 1.00 Extremes (>=2629) Stem width: Each leaf: 1000 1 case(s) Frequency Stem & Leaf 1.00 Extremes (=<950) 4.00 1 . 3334 14.00 1 . 56666777777888 4.00 2 . 1222 1.00 Extremes (>=2562) Stem width: 1000.00 Each leaf: 1 case(s) Variation among items ► Treat items as a random sample from some population. Just like we treat subjects as a random sample ► Then do statistical tests to generalize to this population of items. “F1” and “F2” ► Criticisms Should generalize simultaneously to subjects and items, using F’. ► But must estimate F’ unless every you have data from every subject on every condition of every item (min F’; Clark, 1973) We’re fooling ourselves when we view items as anything like a random sample from a population. Alternatives to F1 and F2 ► Some conventional ANOVA designs do permit generalization to subjects and items without full data But generally lack power ► Coming designs trend: multilevel, hierarchical Complex regression-based analyses of individual data points, not subject- or item-means. But what if you recognize that random sampling from a population of items is nutty? ► What you really want is to show that your effects hold for most or all of your items and aren’t due to a couple of oddballs F2 tests a crude attempt to do this. ► People struggling to get a better way. One possibility, from Ken Forster: plot effect size vs effect rank, see if it is pleasingly regular. Forster, “What is F2 good for” ► Plot effect size (difference between two conditions) against rank of effect size (suggested by Peter Killeen) Both cases: a 5 msec mean effect size Left panel: a limited effect (add 100 ms to 5 items) Right panel: a general effect (add 5 ms to 100 items) 20 120 100 R2 = 0.9502 10 R2 = 0.4491 EFFECT SIZE EFFECT SIZE 80 60 40 20 0 0 0 10 20 30 40 -10 0 10 20 30 40 50 -20 -40 -20 RANK RANK Forster, K. (2007). What is F2 good for? Round 2. Unpublished ms, University of Arizona. 50 Bogartz, 2007 ► Effect size vs. rank effect size, Clifton et al. JML 2003 Effect of ambiguity (absence of relative pronoun) on sentences with relative clauses (The man [who was] paid by the parents was unreasonable) Contrasted with Monte Carlo data based on same mean and variance as experimental data Bogartz, R. (2007). Fixed vs. random effects, extrastatistical inference, and multilevel modeling. Unpublished manuscript, University of Massachusetts. III. How to do experiments, Part 2: Experimental procedures ► Acceptability judgment ► Interpretive choices ► Stops making sense ► Self-paced reading ► Eyetracking during reading ► ERP ► Secondary tasks ► Speed-accuracy tradeoff tasks ► Eyetracking during listening (“visual world”) Choose task that is appropriate for your question ► Is this really a sentence of English? ► Does some variable affect how a sentence is understood? ► Is there some difficulty in understanding this sentence? ► Just where in the sentence does the difficulty appear? ► Where in processing does the difficulty appear? ► Can we observe consequences of processing other that difficulty? ► and more…. Acceptability judgment ► Simple written questionnaire See Schütze, Cowart for lots of examples Worry about instructions Rating scales ►Is seven the magical number? ► Magnitude estimation Basis in psychophysics – attempt to build an interval scale Magnitude estimation: an example Which man did you wonder when to meet? Assign an arbitrary number to that item, greater than zero. Now, for each of the following items, assign a number. If the item is better than the first one, use a larger number; if it’s worse, smaller. Make the item proportional to how much better or worse the item is than the original – if twice as good, make the number 2x the start; if 1/3 as good, make the number 1/3 as big as the start. Magnitude estimation : an example ► Which man did you wonder when to meet? Assign an arbitrary number, greater than 0, to this first item. Now, for each successive item, assign a number – bigger if the item is better, smaller if worse, and proportional – if the item is 2x as good, make the number 2x the original; if ¼ as good, make the number ¼ as big as the original. Which book would you recommend reading? ► When do you know the man whom Mary invited? ► This is a paper that we need someone who understands. ► With which pen do you wonder when to write. ► Who did Bill buy the car to please? ► Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72. On-line and web-based questionnaires ► WebExp: http://www.webexp.info ► Subject scheduling systems option ► Advantages: Big N, easy, broader population ► Disadvantages: you have to worry about control Speeded acceptability judgment ► Time pressure; discourage navel-examining ► Measure reaction time and acceptability ► Example: is given-new order more acceptable than new-given? Maybe so. Maybe not always. Given-New: DefNP-IndefNP • All the players were watching an umpire. The pitcher threw the umpire a ball. New-Given: IndefNP-DefNP b. The catcher tossed a ball to the mound. The pitcher threw an umpire the ball. Given-New: DefNP-IndefPP c. The catcher tossed a ball to the mound. The pitcher threw the ball to an umpire. New-Given: IndefNP-DefPP d. All the players were watching an umpire. The pitcher threw a ball to the umpire. Given-New New-Given Given-New 3600 100 3200 90 Reaction 2800 Time, ms Percent Accepted 2400 2000 New-Given 80 70 NP-NP NP-PP 60 NP-NP NP-PP Clifton, C. J., & Frazier, L. (2004). Should given information come before new? Yes and no. Memory & Cognition, 32, 886-895. Choice of interpretation ► Paper and pencil or speeded ► Multiple-choice or paraphrase ► Example: interpretation of ellipsis Full stop effect ► Auditory questionnaire Relative size of intonational phrase boundary ► Strengths: does indicate whether a variable has an effect or not ► Weaknesses: don’t know when the effect operates Worst case: subject says sentence to self, mulls it over, reacts to the prosody s/he happened to impose Example of interpretation questionnaire: VPE John said Fred went to Europe and Mary did too. What did Mary do? …went to Europe 60% …said Fred went to Europe 40% John said Fred went to Europe. Mary did too. What did Mary do? …went to Europe 45% …said Fred went to Europe 55% Frazier, L., & Clifton, C. Jr. (2005). The syntax-discourse divide: Processing ellipsis. Syntax, 8, 154-207. Who arrived? Johnny and Sharon’sip inlaws. (0 ip) Who arrived? Johnnyip and Sharon’sip inlaws (ip ip) Who arrived? JohnnyIPh and Sharon’sip inlaws (IPh ip) Alternative answers: Sharon’s inlaws and Johnny; Sharon’s and Johnny’s inlaws Clifton, C. J., Carlson, K., & Frazier, L. (2002). Informative prosodic boundaries. Language and Speech, 45, 87-114. Stops-making-sense task ► Word-by-word, self-paced, but each word make one of two responses: OK, BAD ► Get cumulative proportion of BAD responses and OK RT ► Sensitive to point of difficulty in a sentence Example of SMS Which client/prize did the salesman visit while in the city? (transitive) Which child/movie did your brother remind to watch the show? (object control) Boland, J., Tanenhaus, M., Garnsey, S., & Carlson, G. (1995). Verb argument structure in parsing and interpretation: Evidence from wh-questions. Journal of Memory and Language, 34, 774-806. Stops-making-sense task ► Strengths Begins to address processing dynamics questions Can get both time and choice as relevant data ► Weaknesses Very slow reading time – 500 to 900 ms/word typically Permits more analysis than is done in normal reading Self-paced reading ► Word by word self-paced reading Generally noncumulative Sometimes in place (“RSVP”), sometimes moving across screen Time strongly affected by length of word, frequency of word ►Can ► Variant: statistically adjust reading. phrase by phrase self-paced SPR methods ► Computer programs E-prime (www.pstnet.com) Dmastr/DMDX (http://www.u.arizona.edu/~kforster/dmastr/d mastr.htm) Others (PsyScope, Superlab, various homemade systems) SPR Evaluation ► Cheap and effective Don Mitchell, trailblazing technique ► Slower than normal reading Perhaps 180 words per minute reading Unless reader clicks fast and buffers…. ► Often get effect on word following critical word Spillover ► Phrase-by-phrase: overcomes these difficulties, but you lose precision Mitchell, D. C. (2004). On-line methods in language processing: Introduction and historical review. In M. Carreiras & C. J. Clifton (Eds.), The on-line study of sentence comprehension: Eyetracking, ERPs, and beyond. Brighton, UK: Psychology Press. More SPR evaluation ► Does SPR hide subtle details Maybe: Clifton, Speer, & Abney 1991 JML; Schütze & Gibson 1999 JML attachment: The man expressed his interest in a hurry during the storewide sale… (VP adjunct) ► NP attachment: The man expressed his interest in a wallet during the storewide sale… (NP argument) ► Verb Clifton et al: eyetracking, slow first-pass time in NPattached PP (followed by faster reading for argument than adjunct) Schütze & Gibson, word by word SPR, only the argument advantage ► Better materials ► Worse technique Even more SPR evaluation ► Does SPR introduce unnatural effects? Maybe: Tabor, Galantucci, Richardson, 2004, local coherence effects The coach smiled at the player tossed/thrown the frisbee by the… slowed reading at tossed as if reader considering grammatically illegal main clause interpretation of “the player tossed the…” ► Result: But: scuttlebutt, may not show up in eyetracking ► Global SPR reading speed, 412 ms/word, 145 wpm Eyetracking during reading ► Eye movement measurement Fixations and saccades Reading time affected by word length, frequency, other lexical factors Word-based measures of eye movements (ms) Most cowboys hate to live in houses so they 1 223 2 3 235 178 cowboys SFD: FFD: 235 ms GAZE: 413 ms Go-P: 413 ms 4 6 5 7 301 179 267 199 hate 301 ms 301 ms 301 ms 301 ms houses 267 ms 267 ms 267 ms 436 ms Region-based measures While Mary/ was mending/ the sock/ fell off/ * * * * * * ** * 1 2 3 6 4 7 58 9 277 213 233 277 445 289 401 233 314 First pass: 510 ms Second pass: 401 ms Go-Past: 510 ms Total Time: 911 ms 445 ms 547 ms 1393 ms 992 ms Interpretation of the measures ► “Early” vs. “late” measures Debates about modularity Some measures clearly late – second pass time But early: need explicit model of eye movement control ► Rayner, Pollatsek, Reichle, colleagues – EZ Reader Good model of lexical effects Says little or nothing about parsing & intepretation ERP (event-related potentials) ► Measure electrical activity on scalp Reflect electrical activity of bundles of cortical neurons Good time resolution, questionable spatial resolution ► Standard effects: LAN, N400, P600 Typical peak time, polarity “Standard” ERP effects (Osterhout, 2004) N400 P600 The cat will EAT The cat will BAKE The cat will EAT *The cat will EATING Osterhout, L. et al. (2004). Sentences in the brain…. In M. Carreiras and C. Clifton, Jr., The on-line study of sentence comprehension. New York: Psychology Press, pp 271-308. Secondary tasks: Load effects ► Limited capacity models ► Desire: measure of auditory processing difficulty ► Phoneme monitoring Eg: Cutler & Fodor, 1979 ► Which man was wearing the hat? The man on the corner was wearing the blue hat. ► Which hat was the man wearing? The man on the corner was wearing the blue hat. ► Target: /k/ or /b/; when target started focused word, 360 ms; when started non-focused word, 403 ms. Interpretive difficulties Secondary tasks: Load effects II ► Lexical decision (or naming, or semantic decision, or….) Word unrelated to sentence; measure of available capacity Piñango et al., auditory presentation, visual probe ►The man examined the little bundle of fur for a long time aspect to see if it was… 743 ms ►The man kicked the little bundle of fur for a long time aspect to see if it was… 782 ms Pinango, M. M., Zurif, E., & Jackendoff, R. (1999). Real-time processing implications of enriched composition at the syntaxsemantics interface. Journal of Psycholinguistic Research, 28, 395-414. Secondary tasks: Probe for activation ► Auditory (or visual) presentation Probe semantically related to word in sentence whose activiation you want to measure ► E.g., activation of “filler” at “gap” in longdistance dependency The policeman saw the boy who the crowd at the party1 accused2 of the crime. probe girl or matched unrelated word at point 1 or 2; girl faster at 2. ►Present Worries, criticisms… Nicol, J., Swinney, D., Love, T., & Hald, L. (2006). The on-line study of sentence comprehension: An examination of dual task paradigms. Journal of Psycholinguistic Research, 35, 215-231. Speed-accuracy tradeoff ► Present sentence (usually RSVP), subject to make judgment (grammaticality, etc.) ► But judgment is made in response to a signal that is presented some time after a critical point. ► Accuracy increases with time after the critical point. ► Note, current procedure, multiple signals and multiple responses, e.g., every 350 ms Early procedure: just one signal, one response, per trial McElree, B., Pylkkanen, L., Pickering, M., & Traxler, M. (2006). A time course analysis of enriched composition. Psychonomic Bulletin & Review, 13, 53-59. McElree et al. data Best fit: coercion lowered asymptote and lowered rate of approach to asymptote. Visual World (“head-mounted eyetracking”) ► Measure where you look when you are listening to speech. Cooper, 1974. About 40% probability of fixating on referent, 30% fixating on related picture ►About 10% in control group. ► Permits on-line measure of processing during listening. Not just difficulty – actual content Both incidental looks and controlled reaching Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6, 84-107. Cooper, 1974 While on a photographic safari in Africa, I managed to get a number of breathtaking shots of the wild terrain. These included pictures of rugged mountains and forests as well as muddy streams winding their way through big game country. One of my best shots thought was ruined by my scatterbrained dog Scotty. Just as I had slowly wormed my way on my stomach to within range of a flock…. Allopenna, Magnuson & Tanenhaus (1998) Eye camera Scene camera Pick up the beaker Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419-439. Allopenna et al. Results 200 ms after coarticulatory information in vowel Thanks! Enjoy the Institute!