Eliciting Features from Minor Languages
Alison Alvarez [email protected]
Lori Levin [email protected]
Robert Frederking [email protected]
Erik Peterson [email protected]
Jeff Good (MPI Leipzig) [email protected]
Max Planck Institute for Evolutionary
AnthropologyDeutscher Platz 6
04103 Leipzig
Language Technologies Institute Carnegie Mellon University
5000 Forbes Avenue Pittsburgh, PA 15217
Overview
This research is part of the AVENUE Machine Translation
Project. AVENUE is supported by the US National Science
Foundation, NSF grant number IIS-0121-631
In the field of Machine Translation fully aligned and tagged
translation corpora are considered to be one of the most valuable
resources for automatically training translation systems.
However, among minority languages such resources are hard to
find. It is possible to overcome this obstacle by using techniques
inspired by field linguistics. That is, by drawing on bilingual
informants to translate and align given sentences. Field linguists
have relied on questionnaires that have remained relatively static
over a number of years. We want the flexibility to change the
questionnaire to reflect different semantic domains, different
Feature Structure Design
Feature Specification
goals for machine translation systems, different levels of detail,
etc. We also want the questionnaire to be available in multiple
languages. For example, we would want a version of the
questionnaire in Spanish for use by Latin American minority
language speakers. We also want flexibility in lexical selection in
order to avoid cultural bias and to choose appropriate lexical
items for the major language. This paper will look at methods for
specifying the scope and depth of an elicitation corpus as well as
methods for quick design and implementation of elicitation
corpora.
The resulting can also be used as a test suite to explore existing
machine translation systems or design far-reaching corpora for
studying low resource languages.
Our Goals
1. Tools for semi-automated corpus design:
<feature>
<feature-name>np-my-number
</feature-name>
((subj ((np-my-general-type pronoun-type common-noun-type)
“Multiply out by these
(np-my-person person-first person-second person-third) lists of values”
(np-my-number num-sg num-pl)
(np-my-biological-gender bio-gender-male bio-gender-female)
(np-my-function fn-predicatee)))
{[(predicate ((np-my-general-type common-noun-type)
(np-my-definiteness definiteness-minus)
Disjoint set of
(np-my-person person-third)
copula types
(np-my-function predicate))) (c-my-copula-type role)]
and their
[(predicate ((adj-my-general-type quality-type)))
predicates
(c-my-copula-type attributive)]
[(predicate ((np-my-general-type common-noun-type)
(np-my-person person-third)
(np-my-definiteness definiteness-plus)
(np-my-function predicate))) (c-my-copula-type identity)]}
“Use all values
(c-my-secondary-type secondary-copula) (c-my-polarity #all)
of polarity”
(c-my-function fn-main-clause)(c-my-general-type declarative)
(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral)
(c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future)
(c-v-my-phase-aspect durative))
<value>
<value-name>num-sg
</value-name>
</value>
<value>
<value-name>num-pl
</value-name>
</value>
<value>
<value-name>num-dual
</value-name>
</value>
<note>
Notes for analysis of data:
CS, 2.1.2.4.1 page 38, seem
to imply that some
combinations of numbers are
more expected than others
</note>
</feature>
A control language is used to define the size and scope of the set of feature structures that will
be used by GenKit to generate the corpus
• Test suite for MT
• Structured corpus for input to machine learning
2. A user interface for producing high quality, word-aligned
parallel corpora (Elicitation Tool)
Feature Structures
((subj ((np-my-general-type pronoun-type) (np-my-person person-third)
(np-my-number num-sg) (np-my-biological-gender bio-gender-male)
(np-my-function fn-predicatee)(np-my-animacy anim-human)
(np-my-info-function info-neutral)(np-d-my-distance-from-speaker distance-neutral)
(np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph)
(np-my-semantic-class NEED_VALUES)(np-pronoun-exclusivity exclusivity-n/a)
(np-pronoun-antecedent-function antecedent-n/a)))
(predicate ((np-my-general-type common-noun-type) (np-my-person person-third)
(np-my-function predicate)(np-my-animacy anim-human)
(np-my-info-function info-neutral)
(np-d-my-distance-from-speaker distance-neutral)
(np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph)
(np-my-number num-sg)(np-my-semantic-class NEED_VALUES)
(np-pronoun-exclusivity exclusivity-n/a)
(np-pronoun-antecedent-function antecedent-! n/a)))
(c-my-copula-type role) (c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause) (c-my-generaltype declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state)(c-v-my-absolutetense past)(c-v-my-phase-aspect durative)(c-my-imperative-degree imp-degree-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-semrole-neutral)(c-my-minor-type minor-n/a)(c-my-headedness-rc rc-head-n/a)(c-my-answer-type ans-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(cmy-focus-rc focus-n/a)(c-my-actor's-status actor-neutral)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a))
The elicitation tool provides a simple interface for bilingual informants with no linguistic
training and limited computer skills to translate and word-align a corpus in some source
language. The output of the elicitation tool is a text file containing triplets of eliciting sentence,
elicited sentence, and alignment. The elicitation tool can produce bilingual glossaries based on
the aligned corpus. It also has a simple "auto-align" option to add alignments for unambiguous
word pairs in the same file.
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense past)…)
Mapping
I was a teacher
I was a teacher
watashi wa sensei deshita
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human) (tense past)
(animacy human)…)) (identifiability -)…)))
(person third)
(animacy human)
(identifiability (animacy human)
)
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human) (tense past)
(animacy human)…)) (identifiability -)…)))
(person third)
(animacy human)
(identifiability (animacy human)
)
(person first)
(num sg)
I was a teacher
watashi wa sensei deshita
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense past)…)
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human) (tense past)
(animacy human)…)) (identifiability -)…)))
(person first)
(num sg)
(person third)
(animacy human)
(identifiability (animacy human)
)
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense past)…)
(person first)
(num sg)
Minimal Pair Linking
Translation/Alignment
Sentence Selection
Feature Detection
They are multi-level sets of feature-value pairs that are used to reflect the grammatical structures intended for elicitation. When paired with an English grammar
and lexicon the above feature structure will generate ‘He was a teacher.’
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense past)…)
(tense past)
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human)
(animacy human)…)) (identifiability -)…)))
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense present)…)
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense past)…)
watashi wa sensei deshita
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human) (tense past)
(animacy human)…)) (identifiability -)…)))
(person first)
(num sg)
=== ≠
(person third)
(animacy human)
(identifiability (animacy human)
)
watashi wa sensei desu
((Subj((person first)(num sg)
(animacy human)(head-token-1 I)))
(Obj((person third)
(animacy human)(identifiability -)…)))
(tense present)…)
(tense present)
(Subj((person first)
(num sg)
(Obj(((person third)
(animacy human)
(animacy human)…)) (identifiability -)…)))
(person third)
(animacy human)
(identifiability (animacy human)
)
(person first)
(num sg)
(person third)
(animacy human)
(identifiability (animacy human)
)
(person first)
(num sg)
Difference Detection
The Elicitation
Tool
3. Automated learning of morpho-syntax for low-resource
languages
“I was a teacher”
Watashi wa sensei deshita
“I am a teacher”
Watashi wa sensei desu
(Subj((person first)
(num sg)
(Obj(((person third)
(tense present)
(animacy human)
(animacy human)…)) (identifiability -)…)))
(person first)
(num sg)
(person third)
(animacy human)
(identifiability (animacy human)
)
Substitution mismatch
Difference is found on ME
Descargar

AVENUE / MILE - Carnegie Mellon School of Computer …