Archiving
LingDy
16 Feb 2012
TUFS, Tokyo
1
David Nathan
Endangered Languages Archive
Hans Rausing Endangered Languages
Project
SOAS, University of London
What is an archive?
2
3
What is a digital language archive?
 a trusted repository created and maintained
by an institution with a commitment to the
long-term preservation of archived material
 has policies and processes for materials
acquisition, cataloguing, preservation,
dissemination, migration to new digital
formats
 a platform for building and conducting
relationships between data providers and
data users
4
Why is language archiving different?
 what is a language?
 the data is not conventionalised (like $,
age, year of publication etc) – what and
how to code?
 varying and competing expectations
5
And endangered languages archiving?
 extremely diverse context – languages,
cultures, communities, individuals, projects
 typical source - fieldworkers
 typical materials - documentation
 difficult for archive staff to manage
 sensitivities and restrictions
 extremely high priority
6
What can a language archive offer?
 Security - keep your electronic materials safe
 Preservation - store your materials for the long
term
 Discovery - help others to find out about your
materials, and you to find out about users
 Protocols - respect and implement sensitivities,
restrictions
 Sharing - share results of your work, if appropriate
 Acknowledgement - create citable
acknowledgement
 Mobilisation - create usable language materials
 Quality and standards - advice for assuring your
materials are of the highest quality and robust
standards
7
Different kinds of language archives
 different contexts, systems, methods,
collection policies
 you should consider placing your materials
in more than one …
8
Why digital?
 preservation: digitisation is the only way
that audio and video (non-symbolic
material) can be preserved for the future …
because it can be copied and transmitted
with zero loss
 cataloguing, sharing, dissemination,
repurposing
9
Digital disadvantages




digital data is fragile and ephemeral
cost (human, equipment, maintenance)
requires strategy and luck to get right
preservation depends on file and data formats
 depend on tools and software
 depends on formats (prefer standard, open,
explicit, long-lasting)
 materials may have to be converted and
migrated
 some formats require particular software (can
we archive the software?)
10
What is archiving of language materials?
11
 preparing materials
 selecting
 structuring
 suitable encodings and formats
 well-documented
 depositing them in a suitable archive(s)
 curation and accession by the archive
 ongoing management, dissemination
 new focus on form, presentation and user
interaction/feedback
Users and potential users
 depositors – deposit, access or update
materials
 speakers and their descendants (“majority
of users of Berkeley Language Center
archive are community members”)
 other researchers - comparative/historical
linguists, typologists, theoreticians,
anthropologists, historians, musicologists
etc etc
 other “stakeholders”, eg educationalists
 journalists and the wider public
12
Archives networks and bodies
 foundation concepts and technologies from
 library initiatives, eg. D-LIB http://www.dlib.org/
 OAI (Open Archives Initiative)
 OAIS Open Archival Information Systems
(NASA and space agencies incl JAXA)
 Open Language Archives Community
(OLAC)
 Digital Endangered Languages and
Archives Network (DELAMAN)
 ELAR, DOBES, ANLC, Paradisec, EMELD,
LACITO, AIATSIS, AMPM (Maori)
13
Citation examples
 from Heidi Johnson of AILLA
Collection:
Sherzer, Joel. "Kuna Collection." The Archive of the
Indigenous Languages of Latin America:
www.ailla.utexas.org. Media: audio, text, image. Access:
0% restricted.
File/resource:
Sherzer, Joel (Researcher). (1970). "Report of a curing
specialist." Kuna Collection. Archive of the Indigenous
Languages of Latin America: www.ailla.utexas.org. Type:
transcription&translation. Media: text. Access: public.
Resource ID: CUK001R001.
14
Endangered Languages ARchive (ELAR)
 one of 3 programs of the Hans Rausing
Endangered Languages Project
 develop policies, preservation
infrastructure, cataloguing and
dissemination, facilities, training, advice,
materials development and publishing
15
ELAR facts and figures










16
archived collections: 110
online (published) collections: 50
average collection size about 60 GB
online data bundles: 9523
total number of files held: around 200,000
total volume of files held: around 10 TB
online data bundles unrestricted access: 5298
registered users: >500
annual downloads: >1,000
annual number of website "hits": 230,000
ELAR facts and figures – user accounts
 increasing number of community members,
including Aleut (Canada), Tai-Ahom, Wadar
(India), Burushaski (Pakistan), Serrano,
Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq),
Saami (Finland), Wabena (Tanzania), Torwali
(Pakistan), Hani, Bai (China), Irish
 comments: “I found your site while looking up
my grandmother, and i found her on your site
speaking our language. and i would love for
my children her great grandchildren to hear
our language coming from her".
 many interdisciplinary researchers, particularly
archivists and anthropologists
17
Archiving and data management
 most data-related issues are really part of
linguistic data/corpus management
 there are now few data-related issues that
are archive-specific
 metadata formats
 video
 presentation/exhibition of material
18
What can you archive (at ELAR)?
 media - sound, video
 graphics - images, scans
 texts - fieldnotes, grammars, description,
analysis
 structured data - aligned and annotated
transcriptions, databases, lexica
 metadata - contextual information about
the materials, structured and unstructured
19
Archive objects
See
bundles at
ELAR
20
 an “object” could be a file, a set of files, a
directory, or a set of files with their
relationships explicitly defined
 these are often called “sessions”or
“bundles”
 they should be made explicit
 through metadata
 our future catalogue system will provide
the ability for depositors to directly
create, label and update bundles
Archive material should be selected
 example: Depositor’s question: How much
video can I archive?
 answer: ...
21
What is required to make a deposit?
 resource(s) for an endangered language
 it could be just one file
 inventory / metadata
 deposit form view
 existing deposits can also be updated,
added to, and metadata added/modified
22
How can I deliver data?
 hard disks
 we return them
 we send them out
 email
 good for samples for evaluation
 OK for most text materials
 Dropbox etc
 flash cards and USB sticks
 a web upload facility may be
provided one day
 we download from your server
23
What about CDs and DVDs?
 we have found CDs, and
especially DVDs, to be
very unreliable
 DVD fail rate > 10%
 cause confusion as files
are allocated to fit on
disks, not according to
corpus structure
 create a lot of work for
depositors and for ELAR
24
Protocol
 the sensitivities and access restrictions
associated with EL resources
 need to be discussed, collected and
recorded in the field
 global protocol (the overall, typical value) is
entered into the deposit form
 specific protocol (for files, bundles) is
entered via metadata (or any other explicit
way)
25
Protocol and access control
 principles:
 granularity – file, bundle or collection
 access is a relation between object and user
 protocol values can be changed over time
 ELAR’s URCS system




26
User
Researcher
Community member
Subscriber
“I have images”
 what kinds of images?
 what are their sources?
 what is their documentation value? what
role do they play in the collection?
 … these should be reflected in the data
structures/metadata
27
Metadata for images
 at least captions
 what else?




…
…
…
…
 in what form?
 narrative
 tabular fields
 keywords
28
Integrating images into metadata
 get a list of image files








29
command (DOS) window
in directory
type “dir > list.txt”
open text file (in Notepad++ or MS Word)
change font to Courier
get a “vertical selection”
(or use a file listing utility!)
paste into spreadsheet
Integrating images into metadata
 make a new sheet for images
 paste in image file list (see previous)
 add an ID column
 type “1” in first cell
 select from first to last cell in ID column
 Edit>Fill>Series>OK
 add other columns
 now you can refer to your images
anywhere!
30
Using spreadsheet to access data
 you can turn a filename into a link to
access files directly from a spreadsheet
 have the filename in cells
 use the formula
=HYPERLINK(file, “Message")
 examples
=HYPERLINK("E:\archiving\images\"&A2, "click
here")
=HYPERLINK(A1&A2, "click here")
=HYPERLINK(A1&A2, A2)
31
My cells have multiple values!
 example: keywords
 this is probably OK, as keywords are
atomic
 just consistently use a suitable delimiter
 e.g. use comma - if data values cannot have
commas
 ELAR recommends double pipe “||”
32
My cells have multiple values!
 example: speakers in a recording
 speakers are probably not atomic – they have
other attributes
 create a separate “speakers” sheet
 give each speaker an ID (number or initials)
 use the IDs in the original sheet, with delimiter
(implements one to many)
 (advanced) or make another sheet to associate
recordings with speakers (implements many to
many)
33
Expressing “Relation” in spreadsheets
 one column is usually insufficient
 “relationship” has 2-parts
 the target of the relationship
 description of the relationship
 how would this work for images?
34
How can I tell if it’s Unicode?
 use a browser or Notepad++
 paste text in
 examine the encoding (before and after)
35
Can I still use MS Word?
 ELAR no longer accepts MS Word files
 but Word is still useful
 quicker to type up
 useful tables, functions, macros etc
 solutions
 think “text only”
 tables as spreadsheets (are they bad too?)
 (advanced) complex materials formatted as
styles, then export as marked up
 PDF/A – but not a perfect solution
36
End
37
Descargar

Document