Do we still need corpora
(now that we have the Web)?
Silvia Bernardini
University of Bologna, Italy
Postgraduate Conference in
Corpus linguistics
22 May 2008
The corpus
A collection of texts assumed to be representative of a given language,
dialect, or other subset of a language, to be used for linguistic analysis.
(Francis 1992(1982):17)
A collection of naturally-occurring language text, chosen to characterize a
state or variety of a language. (Sinclair 1991:171)
A closed set of texts in machine-readable form established for general or
specific purposes by previously defined criteria. (Engwall 1992:167)
Finite-sized body of machine-readable text, sampled in order to be
maximally representative of the language variety under consideration.
(McEnery and Wilson 1996:23)
A collection of (1) machine-readable (2) authentic texts […] which is (3)
sampled to be (4) representative of a particular language or language
variety. (McEnery et al. 2006:5)
The Web
• A mine of language data of unprecedented
richness (Lüdeling et al 2007)
• A fabulous linguists’ playground (Kilgarriff
and Grefenstette 2003)
• [a] cheerful anarchy (Sinclair 2004)
• A helluva lot of text, stored on
computers… (Leech 1992:106)
Is the Web a corpus? Yes!
The definition of corpus should be broad. We
define a corpus simply as “a collection of texts”. If
that seems too broad, the one qualification we
allow relates to the domains and contexts in which
the word is used […]: A corpus is a collection of
texts when considered as an object of language or
literary study. The answer to the question “Is the
web a corpus?” is yes.
Kilgarriff and Grefenstette (2003:334)
Is the Web a corpus? No!
The cheerful anarchy of the Web thus places a
burden of care on a user, and slows down the
process of corpus building. The organisation and
discipline has to be put in by the corpus builder.
[…] users of a corpus assume that there is a
consistency of selection, processing and
management of the texts in the corpus.
Corpora should be designed and constructed
exclusively on external criteria.
(Sinclair 2005)
This talk
• The Web and the corpus
– Disambiguating the WaC acronym
– Where the Web wins out
– Where the corpus holds its ground
• Web as Corpus initiatives @ Forlì
– The BootCaT way
– The WaCky! way
• Open issues and ways forward
Web as Corpus?
(The Web corpus “proper”)
The Web as a corpus surrogate
The Web as a corpus supermarket
The mega-corpus (or mini-Web)
The Web as a corpus surrogate
• Googleology…
• e.g.: Keller and Lapata (2003)
– Predicate-argument bigrams
– adj-noun, noun-noun, verb-noun
– not attested in the BNC
“Web counts correlate reliably with [human plausibility] judgments,
for all three types of predicate-argument bigrams tested, both seen
and unseen. For the seen bigrams, […] the Web frequencies
correlate better with judged plausibility than corpus frequencies”
(ibid: 481).
• … is bad science
“Working with commercial search engines makes us develop
workarounds. We become experts in the syntax and constraints of
Google, Yahoo!, Altavista, and so on. We become ‘googleologists’”
(Kilgarriff 2007:147)
• Unreplicable
– Véronis (2005): 5 billion "the" have disappeared overnight
– Kilgarriff (2007:148): “queries are sent to different computers, at
different points in the update cycle, and with different data in their
• Uncontrollable
Asterisk treated as placeholder for 1 word or more than 1 word
Punctuation and capitalisation disregarded (even in phrases)
Search hits are per page
Ranking criteria and result sorting (popularity, geographic relevance, …)
• Linguistically naïve
– No morphosyntactic annotation
• 36 queries to extract fulfill + obligation (Keller and Lapata 2003)
• Impossible to extract fulfill + NOUN
– Unsophisticated query language
• No sub-string matching
• No span options
SE post-processors?
• e.g. WebCorp, KWiCFinder
– Wildcards and tamecards
– Concordance output
– Collocation
• Not a solution, really
– Slow
– Same limits as SE
The Web as a corpus supermarket
• Selecting and downloading texts
– General or specialized
– Can be automatised (infra)
• e.g. (general):
– Leeds Internet corpora (Sharoff 2006)
• English, Chinese, Finnish, French, German, Italian, Japanese
• Lemmatised and pos-tagged
• Indexed with the CWB and searchable online (CQP)
– Fletcher’s WaC (Fletcher 2007)
• ~500M words of English (AU, CA, GB, IE, NZ, US)
• will be pos-tagged
• “Traditional” corpus =>
– Replicable results
– Control over corpus contents
• In principle
– Control over search methods
– Linguistically sophisticated searches
• Compromise btwn Web and corpus =>
– Relying on SE (Google, LiveSearch)
– Size
– Up-to-dateness
– Understanding of corpus contents/structure
– Variety of corpus contents
– Noise
The mega-corpus/miniweb
• Baroni (2007): Effort spent by NLP community in
developing Google-skills would be better spent building
our own Google-sized corpora
• None available so far, but:
– WebCorp (Renouf et al. 2007)
– The WaCky! effort (infra)
• Ultimate objective, build a linguist’s search
engine for the Web
Where the Web wins out
• Up-to-dateness
• Size
• Convenience
– Cost
– Ease of collection
– Under-resourced languages
• Web-specific genres
• Reference purposes
Where the corpus holds its ground
• Selection on external criteria
– Cf.: a collection of pieces of language text in
electronic form, selected according to external
criteria to represent, as far as possible, a
language or language variety as a source of
data for linguistic research (Sinclair 2005)
• Register/genre control
• Representativeness and documentation
• Pre- or non-Web genres
e.g.: McEnery et al 2007
• Collocation information for learners’ dictionaries
• “Help”: Full or bare infinitive?
– Varieties of English, language change, syntactic environment
• Acquisition of grammatical morphemes
– Learner language
• Swearing in modern British English
– writing vs. speaking
– sociolinguistic variables
• Conversation vs. formal speech in AmEng
• Aspect marking in English-Chinese translation
– Parallel corpora
– Cf. Resnik and Smith (2003)
Two approaches to
the Web as corpus
The BootCaT way
Select initial seeds (terms)
Query SE for random seed combinations
Retrieve pages and format as text (corpus)
Extract new seeds via corpus comparison
Designed for translation students
Also used for reference corpus building
Leeds Internet Corpora
BootCaT pros…
• Implemented in perl as a set of simple
command-line scripts
• Freely available
• documented
• Integrated into the Sketch Engine pipeline
• Community effort
– WebBootCaT
– JBootCaT
An example: wine tasting
Automatic query generation
wine rich unfiltered attractive
wine stylish "malolactic fermentation" sour
wine meager harsh spritzy
wine dumb tobacco direct
wine watery grapey tears
wine hazy breed nouveau
wine spicy flat body
wine vinous spritzy unfined
wine fleshy cigarbox easy
wine puckery sharp nutty
“vanilla” collocates (span=1R)
BootCaT wine tasting corpus
(English, 1.5M words)
…and BootCaT cons
• Relies on SE=> same limits (cf. supra)
– …and Google no longer gives out API keys
• Not really an option for very large corpus
building projects
A more ambitious alternative
The Wacky way
• Aim: produce very large (~2bn words)
web-derived corpora for several languages
• Collaborative effort, using existing open
tools, making developed tools publicly
• Wacky corpora currently available:
– deWaC, itWaC, ukWaC, frWaC
The Wacky pipeline
• Submit random word combinations to Google
and obtain list of URLs (seeding)
• Crawling (Heritrix)
• Code removal and boilerplate stripping
• Language filtering
• Near-duplicate detection
• Tokenization, POS-tagging and lemmatisation
• Indexing and querying
An example: constructing ukWaC
• Seeding: mid-frequency content words (BNC);
words from spoken text (BNC); vocabulary list
for foreign learners
• Crawl limited to UK domain and html
• Processing
– Only files btwn 5 and 200kb kept
– Perfect duplicates discarded
– Code, boilerplate, files with unconnected text and
pornographic pages removed
– Near-duplicates removed
UkWaC: Details and size
2,000 seed word pairs
6,528 seed URLs
351 GB raw crawl size
19 GB after document filtering
5.69 M of documents after filtering
12 GB after near-duplicate cleaning
2.69 M of documents after near-duplicate cleaning
30 GB size with annotation
1,914,150,197 tokens
3,798,106 types
Further info and availability:
A wacky example
Results for wacky+NOUN (>2), Baroni et al. submitted
• UkWaC
3 ideas
2 roles
2 photo
2 items
2 humour
2 characters
71 world, 44 ideas, 43
wigglers, 42 wiggler, 28
characters, 27 sense, 22
comedy, 21 stuff, 21 races,
20 things, 19 idea, 15
humour, 13 games, 12 race,
11 backy, 10 baccy, 10 fun,
10 game, 10 inventions, 10
names, 10 uses
WaC: What the future holds
• Have WaC replaced “traditional” corpora?
– Not really…
• Challenges
– Cleaning techniques
– Web-tuned annotation tools
– Indexing and querying systems
– (Automatic) text classification
Approaches to Web text classification
• Biber and Kurjan (2007)
– Search engine categories not well defined for
purposes of linguistic analysis
• Google directory
– Multidimensional analysis
• text type approach
– Register approach
• future work
Approaches to Web text classification
• Sharoff forthcoming
– Genre typology based on EAGLES
• “Communicative intentions”
• Discussion, information, instruction, propaganda,
recreation, regulations reporting
– SVMs to automatically categorise texts in
Web corpus
– Classifiers trained on manually-classified texts
• BNC + subset of Web corpus
WaC challenges
• Representativeness
Without representativeness,
whatever is found to be true
of a corpus, is simply true of
that corpus – and cannot be
extended to anything else
(Leech 2007:135)
WaC challenges
• Documentation
Compilers make the best corpus they can
in the circumstances, and their proper
stance is to be detailed and honest about
the contents. From their description of the
corpus, the research community can
judge how far to trust their results, and
future users of the same corpus can
estimate its reliability for their purposes.
(Sinclair 2005)
Thank you
Silvia Bernardini
University of Bologna, Italy
Postgraduate Conference in
Corpus linguistics
22 May 2008

Diapositiva 1