PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield Summer School Schedule Day 1 09:00 – Lectures: Introduction to corpus linguistic terminology and methodology; corpus linguistics + stylistics 11:00 – Practical session: Introduction to WMatrix 12:30 – LUNCH 14:00 – Practical sessions: WMatrix 17:30 – FINISH Summer School Schedule Day 2 09:00 – Practical: Introduction to AntConc 10:00 – Practical: AntConc– advanced features 11:30 – Lecture: Round up: Corpus stylistics – more than the sum of its parts? 12:30 – LUNCH 14:00 – Over to Willie Introduction to Corpus Linguistics PALA Summer School, Maribor, 2014 What is a corpus? • Latin corpus: ‘body’ (plural corpora) • Put simply: a corpus is a ‘body’ of text What is Corpus Linguistics? • Corpus linguistics is the study of language using a corpus or corpora Early Corpus Linguistics Franz Boas Leonard Bloomfield Early Corpus Linguistics Franz Boas Leonard Bloomfield Early Corpus Linguistics Franz Boas Charles Hockett Leonard Bloomfield Early Corpus Linguistics Franz Boas Charles Hockett Leonard Bloomfield Zellig Harris • • • • • • ‘Corpus Linguistics’ as an anachronism Field Linguistics Boas’s studies of native American languages Bloomfield’s description of Tagalog Hockett’s work on Potawatomi Harris’s emphasis on the importance of results being derived from data While until about 1880 investigators confined themselves to the collection of vocabularies and brief grammatical notes, it has become more and more evident that large masses of texts are needed in order to elucidate the structure of languages. (Boas 1917: 1) Principles of Chomskyan linguistics • Homogeneous underlying system of language • Describe the language of the ideal speaker/hearer • Focus on linguistic competence as opposed to linguistic performance Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. (Chomsky, quoted in Andor 2004: 97) Problems with intuition Issue of acceptability • I was 19 when I started university • I were 19 when I started university Impossibility of studying certain aspects of language without recourse to corpus data • Historical linguistics • Language change/variation • Language acquisition …this [intuition] is a very strange notion of data. Normally one expects a scientist to develop theories to describe and explain some phenomena which already exist, independently of the scientist. One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory. (Stubbs 1996: 29) The Survey of English • Instigated 1959 by Randolph Quirk at University College London • One million words of written and spoken British English, made up of 200 text samples of 5000 words each • Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • Manually annotated for prosodic and paralinguistic features • Grammatical structures for each text sample recorded on file cards • Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data! The Survey of English • Instigated 1959 by Randolph Quirk at University College London • One million words of written and spoken British English, made up of 200 text samples of 5000 words each • Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • Manually annotated for prosodic and paralinguistic features • Grammatical structures for each text sample recorded on file cards • Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data! Building the Brown corpus • The Brown Corpus • Built by Nelson Francis and Henry Kučera at Brown University, USA • One million words of written American English (1961), made up of 500 text samples of 2000 words each • Enabled frequency measures of words • Confirmed Zipf’s law • The most frequent word in a corpus is approximately twice as frequent as the second most frequent, and three times as frequent as the third most frequent, etc. • Frequency is inversely proportional to rank Extending the Brown family • 1970-78: LOB • Built by Geoffrey Leech and colleagues at Lancaster University • One million words of written British English (1961), made up of 500 text samples of 2000 words each • FROWN: Written American English from 1991 • FLOB: Written British English from 1991 • BE06: Written British English from early years of 21st century • LOBalike: Written British English from 2011 Extending the Brown family • 1970-78: LOB • Built by Geoffrey Leech and colleagues at Lancaster University • One million words of written British English (1961), made up of 500 text samples of 2000 words each • FROWN: Written American English from 1991 • FLOB: Written British English from 1991 • BE06: Written British English from early years of 21st century • LOBalike: Written British English from 2011 Making sense of meaning • COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English • English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small corpus of spoken and written English to investigate the relationship between words and meaning • Meaning is best seen as a property of words in combination • Builds on J. R. Firth’s concept of collocation Bart You're up to something, aren't ya? Homer No! I'm just going out to commit certain deeds. s. <p/> A39 57 A39 58 <h_><p_>The write way to of God is manifested. <tf_>Kill, D03 78 rob and 0 of a religious sect who orders his followers to bility of episcopal ordination"<quote/> would not 7 article. Take care though: don't let your words theory and deconstruction is such as to G67 189 ithin the Service about offenders who continue to 4m ($45m).<p/> H27 148 <p_>However, it would only democracy from collapse, but this was to J41 142 45 163 the effort levels that they are willing to 2 <p_>Her cheeks flushed crimson and he strove to 222 never took the slot, although he did briefly 1894.<p/> A26 13 <p_>Commissioners hesitated to ote/> <quote_>"Cold Feet A32 243 - Why Men Won't B13 92 addressed men who use drugs or those who ceeds rational basis. Since urban blacks B17 61 us consequences. Mr. C12 185 Deng was hounded to nd Jodie squabble C13 199 because he's afraid to ue <quote_>"is to do something about it, i.e., to n objective theistic D03 192 statement; it is to to be silly and trivial, because I don't want to WN:E28\><h_><p_>SANITATION<p/> E28 2 <p_>HOW TO an <tf|>offensive F04 52 position. That is, to form drives 15 F11 31 percent of its victims to , artificial persons make decisions that F37 23 act G22 13 open to us now would be unjust is to H08 57 exploiting the Gulf war as a pretext to p/> H09 52 <p_>First, we must get the people who <p_>And, it increases penalties for criminals who rease the penalties on those who use such guns to y requiring grantees H26 155 in most programs to to do with its value; to think so is to J30 27 ereas J43 34 disengaged delinquents are free to on a particular illegal J43 38 possibility. Why _>Hitler understandably regarded people who could rt with this J58 131 their so natural Right, but lives K23 172 were before us. Rarely did anyone asked Michael. <quote_>"Did you want P17 102 to commit murder<p/> A39 59 <p_><quote_>"Advice and inform commit adultery<tf/> are all deeds forbidden in the D03 commit suicide.<p/> D11 131 <p_>"God, permitting the mir commit the D17 47 Methodist Church to the view that th commit an editor to E10 98 using a specific picture, w commit the reasoner to defending certain values.<p/> G67 commit H09 191 crime while on bail<p/> H09 192 <p_>Whil commit itself to a forecast of H27 149 maintained sales commit <quote_>"a common fallacy in social thought which commit. Let contracts J45 164 with regard to effort be commit to memory P08 53 the lovely colour as the blood commit to an ROTC A10 223 program before putting his na commit themselves after one of the A26 14 monument's c Commit"<quote/> and <quote_>"Letting Go and Moving A32 commit adultery, and who B13 93 get AIDS and other ven commit more crime proportionately (although not numerica commit suicide in 1966 and his criticism is now C12 186 commit to marriage.<p/> C13 200 <p_>Social issues, too, commit oneself D03 187 to a way of life ..."<quote/><p/ commit oneself to living life and to D03 193 understand commit D06 180 an overt, nonrational act and I don't wa COMMIT BIOCIDE<p/> E28 3 <p_>In the strictest sense, s commit to an aggressive daily-action plan F04 53 desig commit suicide. (For a list of symptoms, F11 32 see 'A commit other people. At the same time, the power to spea commit ourselves to avoiding G22 14 it. But what of pa commit terrorism.<p/> H08 58 <p_>While we can be proud commit crimes out of the H09 53 community, and we must commit gun H09 69 offenses.<p/> H09 70 <p_>We have no commit H09 120 crimes.<p/> H09 121 <p_>Mr. President, I commit their own funds for a portion of the H26 156 cos commit a genetic fallacy. After I wrote this, I came acr commit a variety of illegal J43 35 activities, such fr commit anti-gay violence versus rape or armed J43 39 r commit such J56 150 acts against Britain as his natural commit onely<&|>sic! the Administration J58 132 of such commit suicide. Here, hundreds of K23 173 people sit, w commit suicide?"<quote/><p/> P17 103 <p_><quote_>"Oh, no Advances in annotation: • Currently, one of the best contemporary UK English corpora • 100 million words from the early 1990s • Represents a wide range of both spoken and written modern British English • Written data – 90 million words – Includes extracts from newspapers, academic books, popular fiction, letters and university essays • Spoken data – 10 million words – Includes demographic data and context governed data • The demographic part – Transcripts of about 900 everyday unscripted spoken conversations • The context-governed part – Spoken language collected in public contexts – e.g. radio phone-ins, government meetings, classroom interactions Advances in annotation: Wmatrix Looking ahead • • • • • Development of tools and technologies Corpus techniques increasingly used in other disciplines Interdisciplinarity Multimodal corpora (e.g. Headtalk, Knight et al. 2008) Corpus Linguistics and Geographical Information Systems. This involves extracting place-names from a corpus, searching for their semantic collocates and creating maps to allows users to visualise how concepts such as war and money are distributed geographically (Gregory and Hardie 2011) References • Andor, J. (2004) ‘The master and his performance: an interview with Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111. • Boas, F. (1917) ‘Introduction’, International Journal of American Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language and Culture, pp. 199-210. The Free Press; New York.] • Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together corpus linguistics and Geographical Information Systems’, Literary and Linguistic Computing 26(3): 297-314. • Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of the 6th Language Resources and Evaluation Conference, Palais des Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30th May. • Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.