Language Corpora:
Seeing Real English Grammar
What’s wrong with these examples?
Conjunctions:
Marsha ordered a double latte, for she had a long night
ahead of her.
He is always boasting; however, no one seems to mind.
Sentence types:
The wolf wailed in an awful way.
The jolly Santa smiled cheerfully.
Subordinate clauses:
Joan grimaced noticeably when Eric began his speech.
Lorry’s father loves his garage in which he builds models
of prehistoric animals.
The Problem
• Linguistics aims to describe real language,
not rules made up by ‘language police’
• Even good grammar textbooks cannot
represent real language. (They’d weigh a
ton!)
• Even good grammar textbooks tend to
‘bend’ the language to get the grammar
rules across.
Problem with Textbook Examples
• Stilted language
Marsha ordered a double latte, for she had a long night ahead of
her.
• Mixing of genres in a single sentence
loves his garage in which he builds models of prehistoric animals.
conversational
written
One Solution
• Look at large amounts of real language - the
corpus linguistics approach
– Enabled by computers with large memory
capacity
– (But the Oxford English Dictionary was built
on the same principle)
Types of Corpora
• a corpus is a collection of written or spoken
language
– Charles Dickens’ A Christmas Carol
– The New York Times online
– The Santa Barbara corpus of spoken English
• a representative corpus includes samples
from the various types of language usage
– The Brown Corpus
– The British National Corpus
– MICASE
The Brown Corpus: 1st representative corpus
• The Brown corpus consists of 500 text samples
• Each sample consists of just over 2,000 words
• Types of language usage include:
A. PRESS: REPORTAGE (44 texts)
H. MISCELLANEOUS: GOVT (30 texts)
B. PRESS: EDITORIALS (27 texts)
J. LEARNED (80 texts)
C. PRESS: REVIEWS (17 texts)
K. FICTION: GENERAL (29 texts)
D. RELIGION (17 texts)
L. FICTION: MYSTERY (24 texts)
E. SKILL AND HOBBIES (36 texts)
M. FICTION: SCIENCE (6 texts)
F. POPULAR LORE (48 texts)
N. FICTION: ADVENTURE (29 texts)
G. BELLES-LETTRES (75 texts)
O. FICTION: ROMANCE (29 texts)
P. HUMOR (9 texts)
A Simple Example
• Shifting word meaning
go to:
http:
//chss.montclair.edu
/linguistics
/corpus.tutorial.htm
More Sample Corpus Applications
•
•
•
•
•
Co-occurrence Restrictions
Part of Speech Identification
More POS: -ly words
Intransitive sentences with good vs. well
Syntactic Construction: the passive
Use the right corpus for your query
For a query about . . .
• current standard Engl.
• current everyday usage
• frequency of a word/phrase
• a single author’s writing
• word pair comparisons
Look at
• up-to-date, written
• up-to-date, spoken
• a large corpus
• Project Gutenburg
• a concordancer that lists
collocates
Copyright Issues
• Current Copyright Law:
• * For works created after January 1, 1978, copyright
protection will endure for the life of the author plus an
additional 70 years.
• * For pre-1978 works still in their original or renewal
term of copyright, the total term is extended to 95
years from the date that copyright was originally
secured.
• This means that 20th century literature is unavailable on
the web, except when permission has been obtained to put
it there. For example, Sylvia Plath’s poems are available.
Know your data collection.
Some collections won’t have enough examples.
• Sample topic: “Looking at a form like
progress, button, or butter . . .”
Brown
Times(1/95) Health
# of words
progress
button
butter
1.3 million
120
13
27
3.5 million
268
51
65
200,000
12
1
13
Know your data collection.
Its size may be important
Doc
Words
98,856
Times 1/95 3,567,629
Starr Rept
Raw count
for true
28
518
Normed count
for true (per 100,000 words)
27.67
14
Even though there are many more occurrences of true in a
month of The Times than in the Starr Report, true
appears more often every 100,000 words in Starr.
Know your tools
•
•
•
•
•
•
A concordancer
A collocate list
A part of speech tagger
Searching
Browsing
Sorting
Hong Kong Web Concordancer
• If you ask for a word that has many hits in the data,
it will give you the first 2001 hits
• you can search for prefixes, suffixes, etc.
– “Search string: equal to, starts with, ends with, contains”
• you can search for phrases: go to, was seen
• you can sort the output
– by word to the left of the hit (good if you’re looking for
specifiers -- determiners, auxiliaries, etc.)
– or word to the right (good if you’re looking for
complements)
HK Concordancer gives collocates
(words in the neighborhood of the keyword)
Concordances for was seen = 5
1 erms; even marriage and the family was seen as a contractual arrangement. It i
2 faire was a conscious policy. Law was seen as an emanation of the "sovereign
3 ent of each of the sample children was seen in the home. The parent was asked
4 Hearst changed to concern when it was seen that he had strong support in many
5 , another change in muscle nuclei was seen, usually occurring in fibers that
Right collocates for 'was seen'
as 2
in 1
that 1
usually 1
References
Ball, Catherine. Concordances and Corpora.
Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics:
Investigating Language Structure and Use. Cambridge University Press.
U.S. Copyright Office. Frequently Asked Questions.
(http://www.copyright.gov/faq.html#q46)
Descargar

Using Corpora to Teach Grammar