Language Identification and IT
Peter Constable and Gary Simons
SIL International
[email protected]
[email protected]
www.sil.org
Language identification
The use of identificational codes for tagging
information objects to indicate the language in
which the information is expressed
<body xml:lang=“en”>
17th International Unicode Conference
San Jose, CA September 2000
Language identification
Not considering automated language detection
Considering only language identifiers, not
identifiers for paralinguistic notions, such as
writing system or locale
17th International Unicode Conference
San Jose, CA September 2000
About the Ethnologue
SIL Ethnologue
•
•
•
•
•
catalogue of all modern languages in the world
lists over 6,800 living languages
result of decades of research
system of three-letter codes
http://www.sil.org/ethnologue
17th International Unicode Conference
San Jose, CA September 2000
About the Ethnologue
17th International Unicode Conference
San Jose, CA September 2000
About the Ethnologue
17th International Unicode Conference
San Jose, CA September 2000
About the Ethnologue
Existing user base for Ethnologue codes:
•
•
•
•
•
•
SIL
UNESCO
Linguistic Data Consortium (850+ agencies)
The Linguist List (12,500 individual linguists)
The Endangered Language Fund
others
17th International Unicode Conference
San Jose, CA September 2000
Linguistic diversity
# of languages:
Europe: 237
Asia: 2202
Africa: 2062
Americas: 1020
17th International Unicode Conference
Pacific: 1312
San Jose, CA September 2000
Motivation for this paper
Languages covered by standards
• ISO 639-x covers approx. 400 languages;
• existing needs to go much further—over 6,800
languages
• immediate need among linguists and other
researchers for use in XML
17th International Unicode Conference
San Jose, CA September 2000
Five issues
Change
Categorization
Inadequate definition
Scale
Documentation
17th International Unicode Conference
San Jose, CA September 2000
The need for language identifiers
Language-specific processing
•
•
•
•
•
•
spell-checking
sorting
morphological parsing
speech recognition/synthesis
language-specific typographic behaviour
etc.
17th International Unicode Conference
San Jose, CA September 2000
The need for language identifiers
Language-specific processing
• choosing appropriate resources
Los eventos deportivos pra la juventud
ህ ጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ።
17th International Unicode Conference
San Jose, CA September 2000
The need for language identifiers
Two distinct issues:
• identify the language
• apply the specific processing for that language
17th International Unicode Conference
San Jose, CA September 2000
The need for language identifiers
Language detection
• identify language by inspection of data itself
• available only for a few languages
• not practical for searching large corpora (e.g. the
Internet)
• doesn’t work on short text segments
She said, “chat”.
17th International Unicode Conference
San Jose, CA September 2000
The need for language identifiers
Language-specific processing
• in general, must tag information objects to indicate
language
• identifiers are needed to distinguish every
language
17th International Unicode Conference
San Jose, CA September 2000
Issue #1: change
Languages are constantly changing
Implications:
• systems of language tags cannot be static
• the speech variety (varieties) denoted by a tag is
time-bound
“English” c. 1700 A.D. ≠ “English” c. 2000 A.D.
17th International Unicode Conference
San Jose, CA September 2000
Issue #2: categorization
Typical question: Are Serbian and Croatian the
same language, or different languages?
Operational definitions of language
• many different ways to formulate a definition
• different definitions create different categorizations
• different categorizations serve different purposes
17th International Unicode Conference
San Jose, CA September 2000
Issue #3: inadequate definition
Existing systems do not consistently employ a
single operational definition
• ISO 639-2: codes for “languages” and for groups
of languages
nav = Navajo
ath = Athapascan languages
• ISO 639-2: some “languages” are groups of
languages
que = “Quechua” (47 distinct languages)
17th International Unicode Conference
San Jose, CA September 2000
Issue #3: inadequate definition
Consistent use of a single definition in a given
namespace is beneficial
“Requiring a single definition imposes too
much constraint on users”
• users may legitimately have different requirements
• but no control results in confusion, especially
when thousands of identifiers are added
17th International Unicode Conference
San Jose, CA September 2000
Issue #4: Scale
Number of languages exceed existing systems
by an order of magnitude (400 vs. 6,800)
Existing systems do not scale well
17th International Unicode Conference
San Jose, CA September 2000
Issue #4: Scale
ISO 639-x
• slow process unable to cope with large volume of
requests
• minimal attestation (50 documents) not
appropriate for lesser-known languages
• mnemonic codes (impossible for thousands of
languages)
• confusion due to inconsistent definition
17th International Unicode Conference
San Jose, CA September 2000
Issue #4: Scale
RFC 1766
• process unable to cope with large volume of
requests
• confusion due to inconsistent definition
• unclear how to create tags
17th International Unicode Conference
San Jose, CA September 2000
Issue #5: documentation
Existing systems: can’t tell what codes denote
• ISO 639-x: language, or group of languages?
ara, “Arabic”: Standard only? all variants?
• ISO 639-x: which of several alternate possibilities?
bin, “Bini”
= dial. of Yoruba (Nigeria; 20,000,000)
= dial. of Anyin (Côte d'Ivoire; 810,000)
= alt. name for Edo (Nigeria; 1,000,000)
= alt. name for Pini (Australia; dying)
17th International Unicode Conference
San Jose, CA September 2000
Issue #5: documentation
• ISO 639-x: 2- vs. 3-letter codes
st, “Sesotho”
= nso, “Sotho, Northern”?
= sot, “Sotho, Southern”?
= both?
to, “Tonga”
= tog, “Tonga (Nyasa)”?
= ton, “Tonga (Tonga Islands)”?
17th International Unicode Conference
San Jose, CA September 2000
Solving these problems
Requirements of an adequate system:
• able to scale
• able to deal with change, track history of change
• use a single operational definition for a given
namespace
• apply definition consistently within a namespace
• complete, maintained, online documentation
17th International Unicode Conference
San Jose, CA September 2000
What the Ethnologue offers
Scale: already there
• enumeration of languages
• set of three-letter codes
Change: careful management
• no re-use of codes
• have begun recording revision history
17th International Unicode Conference
San Jose, CA September 2000
What the Ethnologue offers
Definition: single definition, applied quite
consistently
• definition: primary criterion of mutual nonintelligibility as a basis for identifying candidates
for separate literacy, literature
• all categories are of the same type; no language
families, groups, writing systems
17th International Unicode Conference
San Jose, CA September 2000
What the Ethnologue offers
Documentation
• extensive information maintained for every
language
• new site will provide various reports
• alternate names, location, population, etc.
• related ISO codes, relationship
• return Ethnologue data given an ISO code
• evaluating possibilities for returning results as
XML
17th International Unicode Conference
San Jose, CA September 2000
Integration with RFC 1766, XML
Ethnologue codes immediately available using
“x-”
“Hopi”:
<body xml:lang=“x-hop”>
<body xml:lang=“x-sil-hop”>
• private-use tags not ultimately satisfactory
17th International Unicode Conference
San Jose, CA September 2000
Integration with RFC 1766, XML
Register thousands of new tags with IANA
• process would not be able to cope
• problems devising that many tags
• create considerable confusion in the single
namespace
17th International Unicode Conference
San Jose, CA September 2000
Integration with RFC 1766, XML
Register “i-sil-” to specify a namespace
maintained by a particular agency
<body xml:lang=“i-sil-hop”>
• deals with scale
• creates a namespace with a particular definition
that is consistently applied
• avoids confusion of having a single namespace for
all needs
• allow alternate namespaces
17th International Unicode Conference
San Jose, CA September 2000
Integration with RFC 1766, XML
Possible refinement: define primary tag “n-”
<body xml:lang=“n-sil-hop”>
• first sub-tag identifies a registered namespace of
identifiers
• each namespace provides its own operational
definition(s)
• “i-” usage more consistent (languages only)
• “i-” specifies a privileged namespace (doesn’t
require “n-”)
17th International Unicode Conference
San Jose, CA September 2000
Conclusions
Language identifiers required for language-specific
processing
Immediate need for thousands of new language identifiers;
in particular, for use in XML
Five problem areas—need to be considered in any system
SIL Ethnologue codes address all five problems
Revising RFC 1766 to add a namespace mechanism can
support this and would offer many benefits
17th International Unicode Conference
San Jose, CA September 2000
Descargar

Language Identification and IT