Text-based typology
Corpora, corpora of elicited texts and
parallel corpora
(based on STUF 2007)
МД
1
Pros as compared to
questionnaires




Contextualization of examples
Naturalistic discourse
Intralinguistic variation
Potentially, makes up for grammar gaps
2
Frog stories

(Mercer Mayer)
3
Pear stories



W. Chafe et al.
A six-minute film shot in UC (Berkeley) in
1975
Widely used in various cross-linguistic
research

referential density project
4
Referential density (Bickel 2003)

Relative frequency of overt NPs:
Via Nichols 2014
5
Contras of elicited corpora

Not directly comparable



events focused and omitted
mostly quantitative results
Require massive linguistic effort

limited data for each language
Any alternative?
 Parallel corpora
6
Massively parallel texts

Harry Potter


Biblical translations


State and Revolution: 71 tr in 36 lgs
Legal databases:




Pater Noster in 1300 lgs, 400 full texts, 1,000 gospels
Marxist texts


Including subtitles (76, 21)
Proceedings of the European Parliament
Universal Declaration of Human Rights (329)
Unesco online database of literary translations (1,5 mln items)
Andersen, Le Petit Prince, …
Cysouw and Wälchli 2007
7
Comparability (easy counts)

Parallel corpora:


Elicited texts:


roughly comparable number of sentences (from
1,663 to 1,528 for Petit Prince)
pear stories in the same language vary from 29
to 119 sentences (Bickel 2003 via Nichols 2014)
‘Free’ corpora:

not applicable…
8
Comparability (methodology)

Comparison by intension



definition of a phenomena
browsing grammars
Comparison by extension


linguistic structures used for expressing a
contextualized situation
truly functional
Wälchli 2007
9
Extensional typology in
parallel corpora

data we work with may be linguistically
different but semantically identical



cf. much looser identity in elicited texts
rather, they are “defined as a selection of
places in the parallel texts”
they may reflect linguistic variation

at points where one language uses the same
construction, another languages uses several
10
Parallel corpora support
conventional typology

Newmeyer against Stassen


Wälchli supports Stassen


Classical Greek, Latin and Tibetan have the ‘exceed’ type
comparative - contra Stassen 1985
A study of parallel corpora does not show ‘exceed’ but
‘separative’ construction
Parallel corpora reflect dominant patterns – exactly
where the typology’s primary interests lie

But they also numerically reflect variation or competition
between dominant patterns, rather than provide yes or
no typology
11
Case studies, among other:








Wälchli 2005: co-compounds
Auwera et al. 2004: epistemic poss. in Slavic
Wälchli 2006: ‘again’
Wälchli 2001: motion events
Wälchli & Zúñiga 2006: motion events ‘again’
Stolz 2004: total reduplication
Stolz et al. 2005: comitatives and instrumentals
Stolz et al.: absolute possessives
12
Stolz 2003, 2004
Le Petit Prince - quantitative
‘avec’-cline
Total-reduplication-cline
Does this require
parallel corpora?
13
Stolz 2003, 2004
Le Petit Prince – qualitative?
Puis il s-épongea le front avec un mouchoir à
carreaux rouges.
Then he mopped his forhead with a handkerchief
decorated with red squares.
Zatim obrise čelo rupčičem s crvenim kvadratima.
Wells with a rusty pulley – ornative or a
separate category?
14
Pitfalls: data analysis

Easier than raw texts


we know what was intended and where to look
still, as any grammatical analysis by a non
expert, subject to mistakes
Alignment issues
Anyway, same or easier than with
elicited texts

Wälchli 2007
15
Pitfalls: sample bias
Europe overrepresented, convenience
sampling:

Europe > IE > other families

In his study of comitatives, Stolz ended up
with an areal rather than sampling study
16
Pitfalls: style/variant choice

Standard language bias


‘Hagiolect’ effects


‘The sinners will-Evid not enter the heaven’
Style incomparability


Better include texts reporting speech
Bible translation are stylistically diverse
Purism
Wälchli 2007
17
Wälchli 2007
Pitfalls: translation bias
“Incommensurability” of linguistic structures: some
languages think differently…
 Australian lgs prefer absolute over relative frame of
reference
 In Australian Gospels, occurrences of AFR are found but
significantly less frequent than in natural discourse from
this area

“Inert” construction – a construction that tends to be
imported from the source language
18
Case study:
MVC in ‘bring’ and ‘run’ events
Bible-based, Bernhard Wälchli
Multi-verb construction: clauses that contain
more than one lexical verb
BRING and RUN events may be described as
MVC or “solitarizing” verbs
19
BRING and RUN events (Wälchli)
Examples:
Minnin ti-bouay la ban
mouin.
lead
I
Ač-i-ne
little-boy def give
Man pat-ăm-a
(Haitian Creole)
il-se
kil-ĕr. (Chuvash)
child-ps3-dat/acc I.gen to-poss1sg-dat take-conv come-imp2pl
‘… bring him unto me.’ (solitarizing)
Data usually unavailable from grammars…
20
BRING and RUN events (Wälchli)
Bible-based, Bernhard Wälchli
Multi-verb construction: clauses that contain
more than one lexical verb
BRING and RUN events may be described as
MVC or “solitarizing” verbs
Is there any correlation between the choice
of either construction for encoding the two
events?
21
BRING and RUN events (Wälchli)
BRING
Solit
Solit
MVC
Dinka, Navajo, Russian
Ainu, Ewe, Khasi
RUN
MVC
English, Guarani, Maltese Choctaw, Chuvash, Khoekhoe
22
BRING and RUN events (Wälchli)



RUN
165 languages (Eurasia over-represented)
18 BRING events, six RUN events
Correlation between MVC in BRING and RUN is
highly significant (Fisher’s test)
BRING
Solit
MVC
Solit
65
12
MVC
46
42
23
BRING and RUN events (Wälchli)

Is a language consistently MVC vs. solitarizing?

Surely not – then, is this a typological parameter at all?
24
BRING and RUN events (Wälchli)

But: the distribution is bimodal
25
BRING and RUN events (Wälchli)

If we only consider LOW and HIGH, fewer (14)
languages are inconsistent
26
Case study: demonstratives
Potter-based, Federica da Milano 2007

Distance-oriented systems


Person-oriented systems


this near – that far
this with us – that far from us
Is this a real disctinction, or are these two
subtypes of something more general?
27
Demonstratives (da Milano)

48 stimuli (da Milano 2005)


Also include reciprocal orientation of the
locutors: face to face, face to back, side by
side
83 occurrences of deictic
demosntratives in “… and the Chamber
of Secrets”

this with us – that far from us
28
Demonstratives (da Milano)
‘Tie that round the bars,’ said Fred, throwing
the end of a rope to Harry.
‘Przywiąż to do kraty’, powiedział Fred,
rzucając Harry’emu koniec liny.
29
Demonstratives (da Milano)
One term systems:
French – cela, ca (ceci not used)
German – der/die/das (dieser, jener not used)
30
Demonstratives (da Milano)
Two term systems:
Unmarked vs. proximal – Scandinavian, English,
Northen Italian
Unmarked vs. distal – Polish, Russian, Czech,
Hungarian, Modern Greek
Dyad oriented - Catalan
31
Demonstratives (da Milano)
Three term systems: proximal, medial, distal
Dual-anchored – medial (close to addressee or medium
distance)
Spanish (este~ese~aquel)
Basque (hau~hori~hura)
Addressee-anchored – medial is close to addressee only
– not verified on HP
Portuguese (esto~esso~aquele)
Also Sardinian and Tuscun
32
Demonstratives (da Milano)
da Milano then proceeds to build a similar typology for
adverbs; her conclusions are as follows:
 The map of adverbs is by and large isomorphic to the map
of pronouns
 Levinson 2004 “perhaps one can hazard the
generalizations that speaker-centered degrees of distance
are usually (more) fully represented in the adverbs than
the pronominals” confirmed
 “It has turned out to be fruitful to use parallel texts as a
control test of data obtained through the questionnaire.
The results from the parallel texts mainly confirmed the
prior typological generalizations.”
33
‘Free’ corpora!


No translations – no risk of inert
categories, closer to naturalistic
Massive amounts of texts


Usually – literary
Vast playground for quantitative analysis
34
‘Free’ corpora!
Examples:
 Combinatorial statistics for property
words


Lexical typology by LexTyp
Comparative occurrences

May be useful – cf. temperature domain
35
Comparison: texts in typology

Free corpora:





Elicited texts:





No ‘meaning identity’, shift towards intensional typology
Massive collections: almost all kinds of phenomena
But a shift towards intensional typology
Natural discourse
Weak ‘meaning’ identity
Massive effort for transcription, poor collections
Only frequent phenomena
Natural discourse (with provisos)
Parallel corpora:


Strong ‘meaning’ identity
Natural written discourse (with provisos)
38
Summary (obvious):

Corpora have their limitations and can
not substitute conventional methods –
but can go hand in hand with them
39
Descargar

Lexical Typology