PALA Summer School,
Maribor, 2014
Corpus Stylistics
Brian Walker and Dan McIntyre
University of Huddersfield
Summer School Schedule
Day 1
09:00 – Lectures:
Introduction to corpus linguistic terminology and methodology;
corpus linguistics + stylistics
11:00 – Practical session:
Introduction to WMatrix
12:30 – LUNCH
14:00 – Practical sessions:
WMatrix
17:30 – FINISH
Summer School Schedule
Day 2
09:00 – Practical:
Introduction to AntConc
10:00 – Practical:
AntConc– advanced features
11:30 – Lecture:
Round up: Corpus stylistics – more than the sum of its parts?
12:30 – LUNCH
14:00 – Over to Willie
Introduction to
Corpus Linguistics
PALA Summer School, Maribor, 2014
What is a corpus?
• Latin corpus: ‘body’ (plural corpora)
• Put simply: a corpus is a ‘body’ of text
What is Corpus Linguistics?
• Corpus linguistics is the study of language
using a corpus or corpora
Early Corpus Linguistics
Franz Boas
Leonard
Bloomfield
Early Corpus Linguistics
Franz Boas
Leonard
Bloomfield
Early Corpus Linguistics
Franz Boas
Charles Hockett
Leonard
Bloomfield
Early Corpus Linguistics
Franz Boas
Charles Hockett
Leonard
Bloomfield
Zellig Harris
•
•
•
•
•
•
‘Corpus Linguistics’ as an anachronism
Field Linguistics
Boas’s studies of native American languages
Bloomfield’s description of Tagalog
Hockett’s work on Potawatomi
Harris’s emphasis on the importance of results
being derived from data
While until about 1880 investigators confined
themselves to the collection of vocabularies and
brief grammatical notes, it has become more and
more evident that large masses of texts are needed
in order to elucidate the structure of languages.
(Boas 1917: 1)
Principles of Chomskyan linguistics
• Homogeneous underlying system of
language
• Describe the language of the ideal
speaker/hearer
• Focus on linguistic competence as
opposed to linguistic performance
Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist
decides, suppose physics and chemistry decide that instead of relying on
experiments, what they’re going to do is take videotapes of things happening in the
world and they’ll collect huge videotapes of everything that’s happening and from
that maybe they’ll come up with some generalizations or insights. (Chomsky,
quoted in Andor 2004: 97)
Problems with intuition
Issue of acceptability
• I was 19 when I started university
• I were 19 when I started university
Impossibility of studying certain aspects of language
without recourse to corpus data
• Historical linguistics
• Language change/variation
• Language acquisition
…this [intuition] is a very strange notion of data. Normally one expects a scientist
to develop theories to describe and explain some phenomena which already exist,
independently of the scientist. One does not expect a scientist to make up the data
at the same time as the theory, or even to make up the data afterwards, in order to
illustrate the theory. (Stubbs 1996: 29)
The Survey of English
• Instigated 1959 by Randolph Quirk at
University College London
• One million words of written and
spoken British English, made up of
200 text samples of 5000 words each
• Electronic version of the spoken data
produced in collaboration with Lund
University: the London-Lund Corpus
• Manually annotated for prosodic and
paralinguistic features
• Grammatical structures for each text
sample recorded on file cards
• Searching the corpus meant a trip to
the Survey offices to search through
filing cabinets of data!
The Survey of English
• Instigated 1959 by Randolph Quirk at
University College London
• One million words of written and
spoken British English, made up of
200 text samples of 5000 words each
• Electronic version of the spoken data
produced in collaboration with Lund
University: the London-Lund Corpus
• Manually annotated for prosodic and
paralinguistic features
• Grammatical structures for each text
sample recorded on file cards
• Searching the corpus meant a trip to
the Survey offices to search through
filing cabinets of data!
Building the Brown corpus
• The Brown Corpus
• Built by Nelson Francis and Henry
Kučera at Brown University, USA
• One million words of written
American English (1961), made up of
500 text samples of 2000 words each
• Enabled frequency measures of words
• Confirmed Zipf’s law
• The most frequent word in a corpus is
approximately twice as frequent as
the second most frequent, and three
times as frequent as the third most
frequent, etc.
• Frequency is inversely proportional to
rank
Extending the Brown family
• 1970-78: LOB
• Built by Geoffrey Leech and
colleagues at Lancaster University
• One million words of written British
English (1961), made up of 500 text
samples of 2000 words each
• FROWN: Written American English
from 1991
• FLOB: Written British English from
1991
• BE06: Written British English from
early years of 21st century
• LOBalike: Written British English from
2011
Extending the Brown family
• 1970-78: LOB
• Built by Geoffrey Leech and
colleagues at Lancaster University
• One million words of written British
English (1961), made up of 500 text
samples of 2000 words each
• FROWN: Written American English
from 1991
• FLOB: Written British English from
1991
• BE06: Written British English from
early years of 21st century
• LOBalike: Written British English from
2011
Making sense of meaning
• COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English
• English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small
corpus of spoken and written English to investigate the relationship between words
and meaning
• Meaning is best seen as a property of words in combination
• Builds on J. R. Firth’s concept of collocation
Bart
You're up to something,
aren't ya?
Homer
No! I'm just going out to
commit certain deeds.
s. <p/> A39 57 A39 58 <h_><p_>The write way to
of God is manifested. <tf_>Kill, D03 78 rob and
0 of a religious sect who orders his followers to
bility of episcopal ordination"<quote/> would not
7 article. Take care though: don't let your words
theory and deconstruction is such as to G67 189
ithin the Service about offenders who continue to
4m ($45m).<p/> H27 148 <p_>However, it would only
democracy from collapse, but this was to J41 142
45 163 the effort levels that they are willing to
2 <p_>Her cheeks flushed crimson and he strove to
222 never took the slot, although he did briefly
1894.<p/> A26 13 <p_>Commissioners hesitated to
ote/> <quote_>"Cold Feet A32 243 - Why Men Won't
B13 92 addressed men who use drugs or those who
ceeds rational basis. Since urban blacks B17 61
us consequences. Mr. C12 185 Deng was hounded to
nd Jodie squabble C13 199 because he's afraid to
ue <quote_>"is to do something about it, i.e., to
n objective theistic D03 192 statement; it is to
to be silly and trivial, because I don't want to
WN:E28\><h_><p_>SANITATION<p/> E28
2 <p_>HOW TO
an <tf|>offensive F04 52 position. That is, to
form drives 15 F11 31 percent of its victims to
, artificial persons make decisions that F37 23
act G22 13 open to us now would be unjust is to
H08 57 exploiting the Gulf war as a pretext to
p/> H09 52 <p_>First, we must get the people who
<p_>And, it increases penalties for criminals who
rease the penalties on those who use such guns to
y requiring grantees H26 155 in most programs to
to do with its value; to think so is to J30 27
ereas J43 34 disengaged delinquents are free to
on a particular illegal J43 38 possibility. Why
_>Hitler understandably regarded people who could
rt with this J58 131 their so natural Right, but
lives K23 172 were before us. Rarely did anyone
asked Michael. <quote_>"Did you want P17 102 to
commit murder<p/> A39 59 <p_><quote_>"Advice and inform
commit adultery<tf/> are all deeds forbidden in the D03
commit suicide.<p/> D11 131 <p_>"God, permitting the mir
commit the D17 47 Methodist Church to the view that th
commit an editor to E10 98 using a specific picture, w
commit the reasoner to defending certain values.<p/> G67
commit H09 191 crime while on bail<p/> H09 192 <p_>Whil
commit itself to a forecast of H27 149 maintained sales
commit <quote_>"a common fallacy in social thought which
commit. Let contracts J45 164 with regard to effort be
commit to memory P08 53 the lovely colour as the blood
commit to an ROTC A10 223 program before putting his na
commit themselves after one of the A26 14 monument's c
Commit"<quote/> and <quote_>"Letting Go and Moving A32
commit adultery, and who B13 93 get AIDS and other ven
commit more crime proportionately (although not numerica
commit suicide in 1966 and his criticism is now C12 186
commit to marriage.<p/> C13 200 <p_>Social issues, too,
commit oneself D03 187 to a way of life ..."<quote/><p/
commit oneself to living life and to D03 193 understand
commit D06 180 an overt, nonrational act and I don't wa
COMMIT BIOCIDE<p/> E28
3 <p_>In the strictest sense, s
commit to an aggressive daily-action plan F04 53 desig
commit suicide. (For a list of symptoms, F11 32 see 'A
commit other people. At the same time, the power to spea
commit ourselves to avoiding G22 14 it. But what of pa
commit terrorism.<p/> H08 58 <p_>While we can be proud
commit crimes out of the H09 53 community, and we must
commit gun H09 69 offenses.<p/> H09 70 <p_>We have no
commit H09 120 crimes.<p/> H09 121 <p_>Mr. President, I
commit their own funds for a portion of the H26 156 cos
commit a genetic fallacy. After I wrote this, I came acr
commit a variety of illegal J43 35 activities, such fr
commit anti-gay violence versus rape or armed J43 39 r
commit such J56 150 acts against Britain as his natural
commit onely<&|>sic! the Administration J58 132 of such
commit suicide. Here, hundreds of K23 173 people sit, w
commit suicide?"<quote/><p/> P17 103 <p_><quote_>"Oh, no
Advances in annotation:
• Currently, one of the best contemporary UK English corpora
• 100 million words from the early 1990s
• Represents a wide range of both spoken and written modern British
English
• Written data
– 90 million words
– Includes extracts from newspapers, academic books, popular fiction, letters
and university essays
• Spoken data
– 10 million words
– Includes demographic data and context governed data
• The demographic part
– Transcripts of about 900 everyday unscripted spoken conversations
• The context-governed part
– Spoken language collected in public contexts – e.g. radio phone-ins,
government meetings, classroom interactions
Advances in annotation: Wmatrix
Looking ahead
•
•
•
•
•
Development of tools and technologies
Corpus techniques increasingly used in other disciplines
Interdisciplinarity
Multimodal corpora (e.g. Headtalk, Knight et al. 2008)
Corpus Linguistics and Geographical Information Systems. This involves extracting
place-names from a corpus, searching for their semantic collocates and creating maps
to allows users to visualise how concepts such as war and money are distributed
geographically (Gregory and Hardie 2011)
References
• Andor, J. (2004) ‘The master and his performance: an interview with
Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111.
• Boas, F. (1917) ‘Introduction’, International Journal of American
Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language
and Culture, pp. 199-210. The Free Press; New York.]
• Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together
corpus linguistics and Geographical Information Systems’, Literary
and Linguistic Computing 26(3): 297-314.
• Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The
Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of
the 6th Language Resources and Evaluation Conference, Palais des
Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30th May.
• Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.
Descargar

Slide 1