ELAR
and Digital Archiving
for Documentation of
Endangered Languages
David Nathan
Endangered Languages Archive
SOAS University of London
LingDy
Feb 15, 2013
1
What is a digital language archive?
 a trusted repository created and maintained
by an institution with a commitment to the
long-term preservation of archived material
 has policies and processes for acquiring,
cataloguing, preserving, disseminating, and
migrating (updating formats)
 a platform for building and supporting
relationships between data providers and
data users
2
General archiving functions






3
advise
acquire
preserve
add value
provide access
develop trust
Why is language archiving different?
 what is a language?
 unlike business data, it is not
conventionalised (like $, age, year of
publication etc) – what and how to code?
 varying and competing expectations
4
And endangered languages archiving?
 extremely diverse context – languages,
cultures, communities, individuals, projects
 typical source - fieldworkers
 typical materials - documentation
 difficult for archive staff to manage
 sensitivities and restrictions
5
What can a language archive offer?
6
 Security - keep your electronic materials safe
 Preservation - store your materials for the long
term
 Discovery - help others to find out about your
materials, and you to find out about users
 Protocols - respect and implement sensitivities,
restrictions
 Sharing - share results of your work, if
appropriate
 Acknowledgement - create citable
acknowledgement
 Mobilisation - create usable language materials
 Quality and standards - advice for assuring
your materials are of the highest quality and
There are different kinds of language
archives
 from local to global - different coverage,
contexts, methods, collection policies
 consider placing your materials in more
than one …
 there are also sites for aggregating
different archives’ holdings, eg Virtual
Language Observatory, OLAC
7
Why digital?
 preservation: digitisation is the only way
that audio and video (non-symbolic
material) can be preserved for the future …
because it can be copied and transmitted
with zero loss
 also good for cataloguing, sharing,
dissemination, repurposing
8
Digital disadvantages




9
digital data is fragile and ephemeral
cost (human, equipment, maintenance)
requires strategy and luck to get right
preservation depends on file and data
formats
 depend on tools and software
 some formats require particular software
(can we archive the software?)
 formats: prefer standard, stable, open,
explicit, long-lasting
What do depositors have to do?
10
 select and contact an archive
 prepare materials
 select
 structure
 suitable encodings and formats
 complete metadata, metadocumentation,
agreements
 send materials to archive(s)
 work with archive during curation etc
 ongoing management, updating,
OAIS model
 OAIS archives define three types of
‘packages’
ingestion, archive, dissemination:
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
afd_34
dfa dfadf
fds fdafds
Producers
11
Ingestion
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
Archive
Dissemination
Designated
communities
ELAR - architecture
 reduced boundaries between depositors,
users and archive:
 users add, update content; negotiate
access Archive
request
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
&
afd_34
contribute
dfa dfadf
edit
Producers
12
fds fdafds
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
give access
Users
Redefining the digital EL archive
 a platform for developing and conducting
relationships between knowledge
producers and knowledge users – a social
networking archive
 level the playing field between researchers
and community members/other
stakeholders
 encourage, recognise and cater for
diversity
13
Data management and archiving
14
 use good data management practices
whether or not you plan to archive
materials
 document decisions, steps, conventions,
structures, encodings
 appropriate and conventional data
encoding methods (e.g. Unicode)
 be explicit and consistent
 plan for flowing data, working with
others, across different systems (cf Bird
and Simons, ‘Seven Dimensions of
Portability’)
Users and potential users
 depositors – deposit, access or update
materials
 speakers and their descendants
 other researchers - comparative/historical
linguists, typologists, theoreticians,
anthropologists, historians, musicologists
etc etc
 other “stakeholders”, eg educationalists,
funders
 journalists and the wider public
15
ELAR facts and figures









16
archived collections: ~200
online (published) collections: 150
average collection size about 80 GB
online data bundles: ~25,000
online bundles access: unrestricted 10,000,
restricted 15,000
total number of files held: around 200,000
total volume of files held: around 10 TB
registered users: ~800
annual number of website "hits": 230,000
ELAR facts and figures – users
 increasing number of community members,
including Aleut (Canada), Tai-Ahom, Wadar
(India), Burushaski (Pakistan), Serrano,
Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq),
Saami (Finland), Wabena (Tanzania), Torwali
(Pakistan), Hani, Bai (China), Irish
 comments: “I found your site while looking up
my grandmother, and i found her on your site
speaking our language. and i would love for
my children her great grandchildren to hear
our language coming from her".
 many interdisciplinary researchers, particularly
archivists and anthropologists
17
Our task
 … to preserve and disseminate
documentation of endangered languages
18
Why is this important?
 over 50% of the world’s 7000 languages:
 are endangered
 likely to cease to be spoken this
century
 little or nothing known about the
majority of them
 language documentations and the
archives that support, preserve, and
disseminate them, will become the means
of transmission of many languages
19
A perfect storm?
documentation methods expose documentation performed
by and for linguists and “others
sensitivities & vulnerabilities
20
“big data” – resources channeled
to analysis, broader audiences
“open data” – push
for
unmoderated access
Protocol
 the sensitivities and access restrictions
associated with EL resources
 need to be discussed, collected and
recorded in the field
21
Protocol and access control
 principles:
 granularity – file, bundle or collection
 access is a relation between object and
user
 protocol values can be changed over
time
 ELAR’s URCS system
 User
 Researcher
 Community member
22
ELAR’s protocol values
 U – resource available to all
registered users
 R – resource available to users
registered as researchers
 C – resource available to users
endorsed as members of relevant
language community
 S – resource available to users who
have been given individual access
rights for that resource
23
24
25
26
Subscription application: formal
User xx has just applied for access to restricted material in the
deposit solega-107128. The following message was attached to
the application:
"Hi [depositor],
Please delegate me for access to the material on Solegas."
27
Subscription response: formal
This email is to inform you that user xx's application for access to
restricted material in the deposit musgrave2007tulehu has just
been approved. The depositor included the following note to the
user:
"The researcher is known to me personally and I know that his
interest is legitimate."
28
Subscription application: “curious”
User xx has just applied for access to restricted material in the
deposit budd2008beirebo. The following message was attached to
the application:
"I'm xx. I like to learn Bislama language, but never heard what it
sounds like. Am very curious "
29
Subscription application: establish credentials and reason
User xx has just applied for access to restricted material
in the deposit verstraete2010paman. The following message was
attached to the application:
"I am currently doing my masters in Linguistics and I'm
researching on an endangered language in Malaysia. I would like
to see a sample of the data from the fieldwork since I'm not use to
this yet. I hope that I can gain more understanding in carrying out
the fieldwork."
30
Subscription response: rejected, with reason
This email is to inform you that user xx's application for access to
restricted material in the deposit verstraete2010paman has just
been rejected. The depositor included the following note:
"Dear xx,
I am sorry we cannot give you access to this deposit. The
Lamalama community has asked us to restrict access to
community members.
With best wishes,
[depositor]"
31
Subscription response: offering further help
This email is to inform you that user xx’s application for access to
restricted material in the deposit caballero2009raramuri has just
been approved. The depositor included the following note to the
user:
"Please let me know if you're looking for any specific materials or
if you have any questions."
32
Response: further info and offer to meet
This email is to inform you that user xx's application for access to
restricted material in the deposit kunbarlang-389 has just been
approved. The depositor included the following note to the user:
"Hi xx
I've approved your access to this collection, but you should know
that there is an update in the material I've just deposited, with
much more information on both music and texts. I'd be happy to
give you access to that when it is processed.
Next time I come to London (October or November this year) I'd
be happy to meet up if you would like to discuss."
33
What can you archive (at ELAR)?
 media - audio, video
 graphics - images, scans
 texts - fieldnotes, grammars, description,
analysis
 structured data - aligned and annotated
transcriptions, databases, lexica
 metadata, metadocumentation - contextual
information about the materials, both
structured and unstructured
34
Archive objects
See
bundles at
ELAR
35
 an “object” could be a file, a set of files, a
directory, or a set of files with their
relationships explicitly defined
 like other archives, ELAR uses a set
principle, we call “bundles” (like DoBeS’
sessions)
Archive objects
ELAR
Collection
Bundle
File
36
Collection
Bundle
File
File
Collection
Bundle
File
Collection
Bundle
File
What is required to make a deposit?
 resource(s) for an endangered language
 it could be just one file
 catalogue / metadata
 deposit form view
 existing deposits can also be updated,
added to, and metadata added/modified
37
Archive material should be selected
 example: Depositor’s question: How much
video can I archive?
 answer: ...
38
How can I deliver data?
39
 hard disks
 we return them
 we also send them out
 flash cards and USB sticks
 email
 good for samples for evaluation
 OK for most text materials
 Dropbox etc
 a web upload facility may be provided one
day
 we can download from your server
What about CDs and DVDs?
 we have found CDs, and
especially DVDs, to be
very unreliable
 DVD fail rate > 10%
 cause confusion as files
are allocated to fit on
disks, not according to
corpus structure
 create a lot of work for depositors and for
ELAR
40
Express yourself - Metadata
41
 metadata is
 data about data containers
 data about data
 its functions
• for identification, management, retrieval
of data
• provides the context and understanding
of that data
 carries those understandings into the
future, and to others
Express yourself - Metadata
42
 metadata reflects the knowledge and
practices of data providers
 … and therefore defines and constrains
audiences and usages for the data
 all value-adding to recordings of events
(annotations transcriptions, translations,
glosses, comments, interpretations, part
of speech tagging etc) can be considered
metadata
 data and metadata lie on a spectrum and
depend on how they are used rather than
being absolutely different things
Express yourself - Metadata
 distinguish between
 metadata scheme (eg set of
categories) and
 the way that scheme is expressed
43
ID
1
2
audio
TRS00065.wav
TRS00066.wav
transcription
bjt_02.txt
krs_43.txt
relational
filename: sessions.xls
filename: sessions.xml
tagged
<sessions>
<session id=”1”>
<audio>TRS00065.wav </audio>
<transcription>bjt_02.txt</transcription>
</session>
<session id=”2”>
<audio>TRS00066.wav</audio>
<transcription>krs_43.txt</transcription>
</session>
</sessions>
Express yourself - Metadata
45
 example
 you could choose categories from
OLAC, IMDI etc schemes or formulate
your own
 this would be a scheme of logical
categories (speaker, location, date etc)
 you could express these in different
language(s)
 you could structure the categories and
values in different ways, eg as
spreadsheet, database, XML
Express yourself - Metadata
 you need to choose
 a set of metadata categories applying
across whole collection
+
 metadata categories that apply to
particular types of objects (eg
transcriptions, video), or to individual
objects
+
46
 ways of expressing and encoding all that
metadata
47
48
Example
 Ju|’hoan (Biesele)
49
Potential sources of metadata











50
deposit form
spreadsheets
MS Word tables, CSV etc
IMDI and OLAC XML files
custom XML
notes, correspondence and reports
filenames
direct input to ELAR interface
audio files
images (/captions)
meta-metadata files
A survey
 we collected information from about 50
ELAR deposits
51
About 80% of most frequently occurring categories can be
mapped to OLAC
term
OLAC
20 language
Subject.language
17 date
Date
17 description Description
16 id
Identifier
16 speaker
Contributor
16 title
Title
15 format
Format
13 type
Type
12 creator
Creator
12 file name
Identifier
12 notes
11 rights
Rights
10 duration
Coverage
9 content
Description
9 contributor Contributor
9 name
Contributor
9 relation
Relation
term
8
8
8
8
7
7
7
7
6
5
5
5
OLAC
age
comment
genre
Type.linguistic
subject.language
Subject.language
date recorded
Date
document 1
gender
place
Coverage
directory
Identifier
location
Coverage
rec_date
Date
recorder
Contributor
Depositors also add categories such as:





53




detailed locations
metadata in Spanish
indigenous genres and titles (eg of songs)
parents’ and spouse’s mother tongues,
birthplaces
number of children, their language
competence
L2, L3 and competencies
languages heard
clan/moiety
occupation
… more metadata:
54
 date left home country
 photos (/captions) of consultants, field
sessions etc
 equipment
 microphone
 workflow status
 naming and organisational codes and
principles
 recorder/linguist experience level
 biography and project description (“metadocumentation”)
What is the distribution?
55
56
Term frequency
20
17
16
15
13
12
11
10
9
8
7
6
5
4
3
2
1
Number of terms
1
2
3
1
1
3
1
1
4
4
4
1
3
5
17
51
613
la
ng
ua
sp ge
ea
k
cr er
ea
du tor
ra
su
t io
bj
r
ec e la n
t.l
an tion
gu
ag
e
pl
ac
r
re e co e
c_ r d
lo er
ca
t io
n
co
e
m
la
m
n
m
un
oc e
ica
d
cu ia
tiv
pa
e_
ti
ev
su on
en
bj
t:
ab ect
f il
e_
st
ra
bu
ct
nd
co le: co
nt vid de
rib e
ut o _
or fil
au e
th
file
or
d
_b
i
a
un eq le c
dl uip t
e
m
in : a u en
di
g e dio t
no _fi
u s le
ite title
m
m dat
ed e
ia
ac
se re file
to
ss ad
r.f
io
am
n_ me
na
ily
t
.d
m
e a act ima ool
e
f.p o r. ge bo
r im de _ fi x id
a r afn len
yc es am
om s.
e
na
m sta
un tu
m
e
ic s
of
at
io
th
n
e
it e
m
(in sp file fn
sp eec pa
th
an h
ish so
u
/e nd
ng
lis
h)
25
20
15
10
5
0
57
A visualisation
58
59
60
61
Discussion and conclusions
 for endangered language documentation,
the metadata framework is to be
discovered, not predefined (cf Jeff
Wallman, TBRC)
62
MD and resource discovery
 “discovery” is not neutral:
 what is emphasized/distilled?
 who gains?
 who does the work?
 MD is also about the distribution of labor
and resources
63
MD and users
 MD is more responsible for the form,
presentation, and usage of documentation
than generally acknowledged
 MD should be equally accessible to and
relevant for community members – it may
even be more relevant to them than any
“linguistic” data
64
Common metadata standards
 OLAC: Open Language Archives
Date
Title
Community:
Identifier
Creator
Contributor
Language
Subject.language
65
Description
Format
Type
Rights
Coverage
Relation
 IMDI: ISLE Metadata Initiative
more categories, software specific
 ELAR: for endangered language
documentation, metadata framework is to
Types of metadata
 people metadata – creator’s / participants’
details
 descriptive metadata – content of data
 administrative metadata – eg. who did what
when, relationships between objects, IPR
and permissions
 structural metadata – how collection and its
objects are organised, associated, formatted
 preservation metadata – character
encoding, file format
 access and usage protocols
66
Examples






67
example - XLS
example - XML
example – key
example – key XML
example – summary and requests
example - notes
Meta-documentation
 Nathan (2010): “think of metadata as metadocumentation, the documentation of your
data itself, and the conditions (linguistic,
social, physical, technical, historical,
biographical) under which it was produced.
Such meta-documentation should be as rich
and appropriate as the documentary
materials themselves.”
68
Meta-documentation
69
 identity of stakeholders involved, and their
roles
 attitudes of language consultants, towards
their languages and towards the
documenter and documentation project
 relationships with consultants and
community (Good 2010 mentions what he
called ‘the 4 Cs’: ‘contact, consent,
compensation, culture’);
 goals and methodology of researcher,
including research methods and tools,
Meta-documentation
 project and researcher biography:
knowledge and experience of the
researcher and consultants (eg.
researcher’s knowledge at beginning of
project, what training researcher and
consultants received)
 for funded projects: grant application,
reports, email communications
 agreements entered into – formal or
informal (eg. Memorandum of
Understanding, compensation
70
Formats/encoding
 format choices at these levels:
 representation of information
 representation of characters
 how characters are assembled into files
(file formats)
71
Characters
72
 use UTF-8 (aka Unicode ISO 10646)
 be aware of using characters outside ASCII
(common US keyboard characters) – these
can break if UTF-8 is not used
 distinguish character encoding and fonts (a
font is simply a set of images for a
“character set”)
 something may be coded perfectly in
UTF-8 but there is no suitable font
applied
 some fonts may display special
File formats
73
 audio
 WAV
 (what if original is not WAV??)
 resolution: 16 bit, 44.1KHz, stereo or
better
 video
 changing frequently
 MPEG4 or MTS/H264/AVCH
 aspect, resolution: depends on project
 get advice from achive before depositing
File formats
 images
 TIFF **OR** original from device
 resolution: archive quality is 300dpi or
better
74
File formats
 text
 best is plain text
 PDF/A often acceptable, may pose
problem
 if MS-Word or ODF, check with archive
 structured data (spreadsheets, databases
 original format should be supplied
 provide a preservable derivative as well
(eg csv, PDF)
 common linguistic software (ELAN,
Transcriber, Toolbox, Praat etc)
75
Can I still use MS Word?
76
 ELAR no longer accepts MS Word files
 but Word is still useful
 quicker to type up
 useful tables, functions, macros etc
 solutions
 think “text only”
 tables as spreadsheets (are they bad
too?)
 (advanced) complex materials
formatted as styles, then export as
My cells have multiple values!
 example: keywords
 this is probably OK, as keywords are
atomic
 just consistently use a suitable delimiter
 e.g. use comma - if data values cannot
have commas
 ELAR recommends double pipe “||”
77
My cells have multiple values!
78
 example: speakers in a recording
 speakers are probably not atomic – they
have other attributes
 create a separate “speakers” sheet
 give each speaker an ID (number or
initials)
 use the IDs in the original sheet, with
delimiter (implements one to many)
 (advanced) or make another sheet to
associate recordings with speakers
(implements many to many)
Standards
79
 we have already mentioned some
standards – UTF-8, WAV etc
 there are other relevant standards, eg
 ISO 639-3 (language/dialect names)
 metadata systems
 you can also establish project-local
standards, eg
 to handle special characters (eg \e =
schwa)
 data field names
 document them! – for your usage and for
correspondence to wider standards
80
81
THANK YOU!
www.elar-archive.org
David Nathan [email protected]
82
Descargar

Document