The Basic Language
Resources Kit (BLARK)
Steven Krauwer
Utrecht Institute of Linguistics UiL OTS /
ELSNET
Hamburg, 22-11-2004
[email protected]
1
Overview
•
•
•
•
•
•
•
•
The BLARK Enterprise
How to arrive at it
The Dutch Language Union approach
Refining the concept
Defining a BLARK
Main beneficiaries
References
Concluding remarks
Hamburg, 22-11-2004
[email protected]
2
The BLARK Enterprise
• Define the minimal set of language resources that
is necessary to do any precompetitive R&D and
professional education at all for a language (the
Basic Language Resource Kit or BLARK)
• Determine for each language which components
are already available
• Make a priority plan to complete the BLARK for
each language
• Ensure funding to get the work done
Hamburg, 22-11-2004
[email protected]
3
What are the components
of a BLARK
• Lexicons (monolingual, multilingual, …)
• Corpora (language, speech; annotated,
unannotated; mono- and multilingual;
mono- and multimodal; …)
• Tools (annotation, exploration, …)
• Modules (lemmatizers, parsers, speech
recognizers, tts, transcribers, translation, …)
• …
Hamburg, 22-11-2004
[email protected]
4
What makes the BLARK
Enterprise special?
• The idea is to make a common generic BLARK
definition, in principle applicable to all languages
• The common definition will be based on the
experience with different languages, and will
prevent reinvention of wheels
• The common definition will ensure
interoperability and interconnectivity (especially
for multilingual or cross-lingual applications)
Hamburg, 22-11-2004
[email protected]
5
Other benefits
• Experience from other languages will help
making cost estimations
• Adoption of a BLARK common to all
languages may help in persuading funders
to support the creation of the BLARK
• Adoption of a common BLARK may
facilitate porting of knowledge and
expertise between languages
Hamburg, 22-11-2004
[email protected]
6
Words of caution
• A BLARK definition will evolve over time, as
new applications, application environment and
technologies come up
• A BLARK definition should be seen as a template
rather than a dictate, as different languages may
have different specific requirements
• BLARK completion priorities may differ from
language to language (on e.g. economic, social or
political grounds)
Hamburg, 22-11-2004
[email protected]
7
How to define a BLARK
and assign priorities
• Methodology proposed by the Dutch Language
Union [DLU] (Binnenpoorte et al, LREC 2002):
– Identify a number of typical applications
– Determine for each of them which technologies
(modules) are needed to make them (-, +, ++, +++)
– Identify for each module which resources they require
(-, +, ++, +++)
– Assign the highest priority to the resources that support
most applications
Hamburg, 22-11-2004
[email protected]
8
Proposed DLU priorities
for NLP
1.
2.
3.
4.
5.
6.
treebank
robust parsers
tokenisation and named entity recognition
semantic annotations for the treebank
translation equivalents
evaluation benchmarks
Hamburg, 22-11-2004
[email protected]
9
Proposed DLU priorities
for speech
1.
2.
3.
4.
5.
6.
automatic speech recognition
application-specific speech corpora
multi-media speech corpora
tools for transcription of speech data
speech synthesis
benchmarks for evaluation
Hamburg, 22-11-2004
[email protected]
10
Next steps by DLU
• Make a survey of what exists and to what extent it
is available (0-9 availability score)
• Assign priorities (not just resources but also an
infrastructure for maintenance and distribution)
• Secure funding from Dutch and Flemish
government for a national programme
• Issue calls for proposals for collaborative
resources projects (1st call closed Nov 2 2004)
Hamburg, 22-11-2004
[email protected]
11
Refining the concept
• Items not really covered by the DLU teams:
–
–
–
–
–
–
definition vs specification
availability
quality
quantity
standards
support
• Addressed in the NEMLAR project
Hamburg, 22-11-2004
[email protected]
12
Definition / specification
• Not enough to say ‘a written language corpus’,
what about:
–
–
–
–
–
–
size (types, tokens)
encoding
annotation
text types
representativity
domains
• i.e. we need full specs
Hamburg, 22-11-2004
[email protected]
13
Availability
• DLU: 0-9 scale, very impressionistic
• Our proposal: 3 dimensions
– accessibility
– cost
– modifiability
• to each we assign a penalty score (0 is best)
Hamburg, 22-11-2004
[email protected]
14
Accessibility
•
3 classes, with associated penalties
–
–
–
(3) existing, but only company-internal
(2) existing and freely usable for
precompetitive research
(1) existing and freely usable for all R&D
Hamburg, 22-11-2004
[email protected]
15
Cost
• 4 cost categories:
–
–
–
–
(4) price over 10 keuro
(3) price between 1 and 10 keuro
(2) price between 100 and 1000 euro
(1) less than 100 euro
Hamburg, 22-11-2004
[email protected]
16
Modifiability
• 3 categories
– (3) black box: you get them as they are, but you
cannot change or even inspect its internals
– (2) glass box: you can’t change them but you
can see what is inside)
– (1) open resources: freely manipulable
Hamburg, 22-11-2004
[email protected]
17
Comments on availability
• we can now express availability in a 3 digit
score (accessibility, cost, modifiability)
which should be rather easy to assign
objectively
• the lowest scores are the best
• if the accessibility score is 3, the other
scores don’t mean very much
Hamburg, 22-11-2004
[email protected]
18
Quality
• We distinguish two types of quality:
absolute (I.e. an inherent property of the
resource) and relative (I.e. in relation to
how you want to use it):
• Absolute: standard-compliance and
soundness
• Relative: task-relevance and environmentrelevance
Hamburg, 22-11-2004
[email protected]
19
Standard-compliance
• criterion: to what extent is the resource
based on a common standard (formal or de
facto)
• possible values (penalty based):
– (3) no standard
– (2) standard, but not fully compliant
– (1) standard and fully compliant
Hamburg, 22-11-2004
[email protected]
20
Soundness
• criterion: to what extent is the resource
based on well-defined specifications
• values:
– (3) no specifications provided
– (2) specs provided, but not fully compliant
– (1) specs provided, fully compliant
Hamburg, 22-11-2004
[email protected]
21
Task-relevance
• criterion (relative): to what extent is the
resources suited for a specific task X
• values (3 binary values):
– contains all information needed for X (yes/no)
– has the proper size for X(yes/no)
– based on a relevant selection of items for X
(yes/no)
Hamburg, 22-11-2004
[email protected]
22
Environment-relevance
• criterion: to what extent is the resource
interoperable with its environment (other
resources)
• values (3 binary valuas):
– information matches (yes/no)
– size matches (yes/no)
– selection matches (yes/no)
Hamburg, 22-11-2004
[email protected]
23
Comments on quality
• We can now express absolute quality objectively
in terms of a pair of scores (standard-compliance,
soundness); this score can be assigned by the
provider
• and relative quality (for our own purposes) in
terms of two triples of yes/no answers (taskrelevance, environment-relevance); this score can
only be assigned by the user
• other attributes may be added as long as they can
be objectively assigned
Hamburg, 22-11-2004
[email protected]
24
Quantity
• The DLU team did not try to formulate any
quantitative requirements
• We have tried to do this in the context of the
NEMLAR project, see below for our tentative
figures
• Statistical approaches can swallow any amount of
resources, and minimal figures are very hard to
find
• Our figure finding exercise has been very much
example driven
Hamburg, 22-11-2004
[email protected]
25
Standards
• Very few existing formal standards around,
although some exist (cf Romary & Ide at
LREC2004 workshop, Monachini et al, 2003)
• Evolving de facto standards include:
– Bottom-up work by committees (TEI)
– Top-down actions:
• Projects aiming at standards (e.g. EAGLES, ISLE)
• Example setting R&D projects (e.g. Wordnet, Speechdat,
Multext)
• Our position: any standard is better than no
standard at all
Hamburg, 22-11-2004
[email protected]
26
Defining a BLARK
• Work carried out in the context of the
NEMLAR project (www.nemlar.org), aimed
at Arabic resources
• Work described here based on project
deliverables (see site), summarized in
article by Maegaard, Krauwer, Choukri,
Damsgaard presented at NEMLAR
conference in Cairo (Sep 2004)
Hamburg, 22-11-2004
[email protected]
27
Approach adopted
• Same strategy as Dutch Language Union
(applications => modules => resources)
• But with different results because of differences in
social/economic situation and in language
structure
• Results follow, in terms of global definitions and
tentative size indications (no specs provided at this
stage, but project is still ongoing)
• Feedback is welcome!!!!!!!!
Hamburg, 22-11-2004
[email protected]
28
Written resources (1)
• Lexicon:
– For all components: 40 000 stems with POS &
morphology
– For sentence boundary detection: list of conjunctions
and other sentence starters/stoppers
– For named entity recognition: 50 000 human proper
names
– For semantic analysis: same 40 000, with
subcategorization, shallow lexical semantic info;
possibly a WordNet
Hamburg, 22-11-2004
[email protected]
29
Written resources (2)
• Bi-/Multilingual lexicon
– Same size as monolingual
• Thesauri, ontologies, wordnets:
– Thesaurus subtree with ca 200-300 nodes for
each domain
– Ontologies and wordnets ideally same size as
lexicon
Hamburg, 22-11-2004
[email protected]
30
Written resources (3)
• Corpora:
–
–
–
–
–
–
–
–
For term extraction: 100 million words unannoteted
For small applications: 0.5 million words annotated
For statistical POS tagger: 1-3 million (ann)
Sentence boundary: 0.5-1.5 million (ann)
Named entity (stat based): 1.5 million (ann)
Term extraction: 100 million (ann)
Co-reference resolution: 1 million (ann)
WSD: 2-3 million (ann)
Hamburg, 22-11-2004
[email protected]
31
Written resources (4)
• Multilingual corpora:
– For alignment: 0.5 million (tagged)
• Multimodal corpora:
– For OCR (printed): ??
– For OCR (hand-written): ??
Hamburg, 22-11-2004
[email protected]
32
Spoken resources (1)
• Acoustic data:
– For dictation: 50-100 speakers, 20 min each, fully
transcribed, plus 10 speakers for testing
– For telephony: 500 speakers uttering 50 different
sentences (speechdat, orientel based)
– For embedded speech recognition: data similar to
Speecon
– For broadcast news transcription: 50-100 hours wellannotated, plus 1000 hours of non-transcribed data;
should come with 300 million words of non-annotated
written text
Hamburg, 22-11-2004
[email protected]
33
Spoken resources (2)
• Acoustic data (cont’d):
– For conversational speech: data similar to
CallHome/CallFriends from LDC
– For speaker recognition: 500 speakers for training, 3
minutes each, transcribed, plus 100 speakers for testing
– For language/dialect identification: data similar to
CallFriend, or from Broadcast News (esp for variants of
Arabic)
– For speech synthesis: male and female speakers, 15
hours, using a read text, phonetically balanced
– For formant synthesis: sama as above, with handlabelled formant
Hamburg, 22-11-2004
[email protected]
34
Spoken resources (3)
• Multimodal corpora:
– For lips movement reading: similar to M2VTS, with
some 50 faces
• Written corpora for speech technologies:
– General; 300 million words unannotated, preferably
broadcast news or other press and media sources
– For phonetic lexicon and language models: 1-5 million
words, annotated
– For Arabic: vowelized and non-vowelized corpus
Hamburg, 22-11-2004
[email protected]
35
What next? (1)
• Check definition and quantification for
completeness and consistency and correct
• Try to provide specs for every single item
• Try to differentiate between general and
Arabic in definitions and specs
Hamburg, 22-11-2004
[email protected]
36
What next? (2)
• For each language:
– Take the BLARK definition and specs
– Adapt to local conditions
– Make a survey of what exists and what has to
be made
– Find the funds and build the BLARK for your
language
Hamburg, 22-11-2004
[email protected]
37
Prescriptive / descriptive
• Prescriptive:
– the BLARK definition tells you which
ingredients you need
– the specification tells you what they should
look like
• Descriptive:
– a BLARK instantiation comes with a
description of its components
Hamburg, 22-11-2004
[email protected]
38
Main beneficiaries (1)
• academic and industrial researchers:
material to try out ideas and conduct pilot
studies
• industrial developers: only for generic
activities, since specific applications require
more user and domain orientation
• educators: material for experimental work
by students in labs
Hamburg, 22-11-2004
[email protected]
39
Main beneficiaries (2)
• probably not the main languages in Europe
(EN, FR, GE) as they are pretty well
covered anyway
• mostly the languages that are not supported
by a strong market (because of small size or
poor economy)
Hamburg, 22-11-2004
[email protected]
40
References
• Binnenpoorte et al at LREC 2002 (see also
www.elsnet.org/dox/lrec2002-binnenpoorte.pdf
• ELRA Newsletter vol 3, n 2, 1998 (see also
www.elsnet.org/blark.html)
• NEMLAR: see www.nemlar.org for
– Arabic BLARK Report
– NEMLAR presentation at Cairo conference
• Romary & Ide at LREC 2004 (see also
www.elsnet.org/lrec2004-roadmap/RomaryIde.ppt)
Hamburg, 22-11-2004
[email protected]
41
Concluding remarks
• The BLARK aims at providing a common
definition of the notion ‘minimal set of resources’
• It should help language communities to come
closer to the idea of creating an equal playing
field, in spite of market forces
• It should facilitate porting of expertise
• It is necessarily dynamic, as technologies evolve
rapidly
Hamburg, 22-11-2004
[email protected]
42
Thanks!
Contact:
[email protected]
Hamburg, 22-11-2004
[email protected]
43
Descargar

The BLARK