LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt In this lecture… Corpora and the Lexicon uses of corpora in lexicography Counting words lemmatisation and other issues types versus tokens word frequency distributions in corpora Part 1 Corpora and lexicography Why corpora are useful Lexicographic work has long relied on contextual cues to identify meanings. e.g. Samuel Johnson used examples from literature to exemplify uses of a word. Corpora make this procedure much easier not only to provide examples but: to actually identify meanings of a word given its context definitions of word meanings should therefore be more precise, if based on large amounts of data Specific applications Grammatical alternations of words E.g. Verb diathesis alternations: Atkins and Levin (1995) found that verbs such as quiver and quake have both intransitive and transitive uses. (see Lecture 1) E.g. uses of prepositions such as on, with… Regional variations in word use relying on corpora which include gender/region/dialect/date information Specific applications - II Identification of occurrences of a specific homograph, e.g. house (Verb) examination of the contexts in which it occurs relies on POS tagging Keeping track of changes in a language through a monitor corpus Identifying how common a word is, through frequency counts. many dictionaries include such information now this shall be our starting point Part 2 Counting words in corpora: types versus tokens Running example Throughout this lecture, reference is made to data from a corpus of Maltese texts: ca. 51,000 words all from Maltese-language newspapers various topics and article types How to count words: types versus tokens token = any word in the corpus (also counting words that occur more than once) type = all the individual, different words in the corpus (grouping occurrences of a word together as representatives of a single type) Example: I spoke to the chap who spoke to the child 10 tokens 7 types (I, spoke, to, the, chap, who, child) More on types and tokens The number of tokens in the corpus is an estimate of overall corpus size Maltese corpus: 51,000 tokens The number of types is an estimate of vocabulary size gives an idea of the lexical richness of the corpus Maltese corpus: 8193 types Type/token ratio A (rough!) way of measuring the amount of variation in the vocabulary in the corpus. no. types no. tokens Roughly, can be interpreted as the “rate at which new types are introduced, as a function of number of tokens” Difficult decisions - I Do we distinguish upper- and lowercase words? is New in New York the same as new in new car? but what of New in New cars are expensive? (sentence-initial caps) in practise, it’s not straightforward to distinguish the two accurately, but can be done Difficult decisions - II What about morphological variants? man – men one type or two? go – went one type or two? If we map all morphological (inflectional) variants to a single type, our counts will be cleaner (lemmatisation). depends on availability of automated methods to do this Maltese also presents problems with variants of the definite article (ir-, is-, ix- etc) ir-raġel (DEF-man): one token or two? Difficult decisions - III Do numbers count? e.g. is 1,500 a word? may artificially inflate frequency counts one approach is to treat all numbers as tokens of a single type “NUMBER” or “###” Punctuation can compromise frequency counts computer will treat “woman!” as different from “woman” needs to be stripped problematic for languages that rely on non-alphabetic symbols: Maltese ‘l (“to”) vs l- (“the”) Part 2 Representing word frequencies Raw frequency lists (data from Maltese) A simple list, pairing each word with its frequency word aħħar (“last”) jkun (“be.IMPERF.3SG”) ukoll (“also”) bħala (“as”) dak (“that.SGM”) tat- (“of.DEF”) frequency 97 96 93 91 86 86 Frequency ranks Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small) Raw frequency lists can be hard to process. Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words: rank 1 = most frequent rank 2 = next most frequent … Rank-frequency list example (data from Maltese) rank Frequency 1 2195 2 2080 3 1277 4 1264 Rank of type, according to frequency Number of times the type occurs Frequency spectrum (data from Maltese) A representation that shows, for each frequency value, the number of different types that occur with that frequency. frequency types 1 4382 2 1253 3 661 4 356 Normalised frequency counts A raw frequency for a word isn’t necessarily informative. E.g. difficult to compare the frequency of the word in corpora of different sizes. We often take a “normalised” count. typical to divide the frequency by some constant, such as 10,000 or 1,000,000 this gives “frequency of word per million” rather than a raw count. Type/token ratio revisited (no. of types)/(no. of tokens) Another way of estimating “vocabulary richness” of a corpus, instead of just looking at vocabulary size. E.g. if a corpus consists of 1000 words, and there are 400 types, then the TTR is 40% Type/token ratio Ratio varies enormously depending on corpus size! If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%. With 4 million words, it’s more likely to be in the region of 2%. Reasons: vocab size grows with corpus size but large corpora will contain a lot of tokens that occur many times Standardised type/token ratio One way to account for TTR variations due to corpus size is to compute an average TTR for chunks of a constant size. Example: compute the TTR for every 1000 words of running text then, take an average over all the 1000word chunks This is the approach used, for example, in WordSmith. Part 3 Frequency distributions, or “few giants, many midgets” Non-linguistic case study Suppose we are interested in measuring people’s height. population = adult, male/female, European sample: N people from the relevant population measure height of each person in the sample Results: person 1: 1.6 m person 2: 1.5 m … Measures of central tendency Given the height of individuals in our sample, we can calculate some summary statistics: mean (“average”): sum of all heights in sample, divided by N mode: most frequent value Median: the middle value What are your expectations? The data (example) height 1 135 2 159 3 160 4 160 5 180 Mean: 158.8cm This is the expected value in the long run. If our sample is good, we would expect that most people would have a height at or around the mean. Mode: 160cm Median: 160 Plotting height/frequency Observations: 1. Extreme values are less frequent. 2. Most people fall on the mean 3. Mode is approximately same as mean 4. Bell-shaped curve (“normal” distribution) Plotting height/frequency • • • This shape characterises the Normal Distribution. A “bell curve” Quite typical for a lot of data sampled from humans (but not all data) What about language? Typical observations about word frequencies in corpora: 1. there are a few words with extremely high frequency 2. there are many more words with extremely low frequency 3. the mean is not a good indicator: most words will have an actual value that is very far above or below the mean A closer look at the Maltese data Out of 51,000 tokens: 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5) ca. 15% of our corpus size is made up of only 5 different words! Out of 8193 types: 4382 are hapax legomena, occurring only once (bottom ranks) 1253 occur only twice … In this data, the mean won’t tell us very much. it hides huge variations! Ranks and frequencies (Maltese) 1. 2195 2. 2080 3. 1277 … 2298. 1 2299. 1 … Among top ranks, frequency drops very dramatically Among bottom ranks, frequency drops very gradually General observations In corpora: there are always a few very highfrequency words, and many lowfrequency words among the top ranks, frequency differences are big among bottom ranks, frequency differences are very small So what are the high-frequency words? Top 5 ranked words in the Maltese data: li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of”), tal- (“of the”) Bottom ranked words: żona (“zone”) f = 1 yankee f = 1 żwieten (“Zejtun residents”) f = 1 xortih (“luck.POSS-3SGM”) f = 1 widnejhom (“ear.POSS-3PL”) f = 1 Zipf’s law George K. Zipf (1902 – 1950) established a mathematical model for describing frequency data: Frequency decreases with rank. More precisely, frequency is inversely proportional to rank. We can plot this in a chart: Y-axis = frequency X-axis = rank each dot on the chart represents the lexical item (type) at a given rank How Zipf’s law pans out (Maltese data) A few high frequency, low-rank words Hundreds of low-frequency, high-rank words frequency 2500 frequency 2000 1500 frequency 1000 500 0 0 1000 2000 3000 4000 rank 5000 6000 7000 8000 9000 Zipf’s law cross-linguistically Empirical work has shown that the Zipfian distribution is observable: independent of the language irrespective of corpus size (for reasonably large corpora) The bigger your corpus: the bigger your vocabulary size (no. types) the more words of frequency 1 (hapax legomena) Why? Some reasons If words were completely random, every word would be equally likely. Our plot would be completely flat: all words at all ranks have same frequency. Language is absolutely non-random: occurrence of words governed by: syntax author/speaker intentions ... Some words are the basic “skeleton” for our sentences. They are the most frequent. Implications Traditional measures of central tendency (mean etc) not very useful. No two corpora can be directly compared if they are of different size: vocab size increases with corpus size most of the vocab made up of hapax legomena most of the corpus size (no. tokens) made up of a few, very frequent types, typically function words. Summary We’ve introduced some of the uses of corpora for lexicography. Focused today on word frequencies, especially Zipf’s law looked at some of the implications Next up: collocations and why they’re useful References Baroni, M. (2007). Distributions in text. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook. Berlin: Mouton de Gruyter.