Towards automatic enrichment and analysis of
linguistic data for low-density languages
Fei Xia
University of Washington
Joint work with William Lewis and Dan Jinguji
Motivation: theoretical linguistics
• For a particular language (e.g., Yaqui), find the
answers for the following questions:
– What is word order: SVO, SOV, VSO, ….?
– Does it have double-object construction?
– Can a coordinated phrase be discontinuous? (e.g.,
“NP1 Verb and NP2”)
– ….
• We want to know the answers for hundreds of
languages.
Motivation: computational linguistics
• For a particular language, we want to build
– a Part-of-speech tagger and a parser
• Common approach: create a treebank
– a MT system
• Common approach:
– collect parallel data
– test translation divergence (Dorr, 1994; Fox 2002; Hwa et
al, 2002)
Main ideas
• Projecting structures from a resource-rich
language (e.g., English) to a low-density
language.
• Tapping the large body of Web-based
linguistic data  using ODIN dataset
Structure projection
• Previous work
–
–
–
–
(Yarowsky & Ngai, 2001): POS tags and NP boundaries
(Xi & Hwa, 2005): POS tags
(Hua et al., 2002): dependency structures
(Quirk et al., 2005): dependency structures
• Our work:
– Projecting both dependency structures and phrase structures
– It does not require a large amount of parallel data or handaligned data.
– It can be applied to hundreds of languages.
Outline
• Background: IGT and ODIN
• Data enrichment
– Word alignment
– Structure projection
– Grammar extraction
• Experiments
• Conclusion and future work
Background: IGT and ODIN
Interlinear Glossed Text (IGT)
Rhoddodd yr
athro
lyfr
i’r
Gave-3sg the teacher book to-the
bacjgem ddoe
boy
yesterday
The teacher gave a book to the boy yesterday
(Bailyn, 2001)
ODIN
• Online Database of Interlinear text
• Storing and indexing IGT found in scholarly
documents on the Web
• Searchable by language name, language family,
concept/gram, etc.
• Current size
– 36439 instances
– 725 languages
Data Enrichment
The goal
• Original IGT: three lines
• Enriched IGT:
– English phrase structure (PS), dependency
structure (DS)
– Source PS and DS
– Word alignment between source and English
translation
Three steps
• Parse the English translation
• Align the source sentence and its English
translation
• Project the English PS and DS onto the
source side
Step 1: Parsing the English translation
The teacher gave a book to the boy yesterday
Step 2: Word alignment
Source-gloss alignment
Gloss-translation alignment
Heuristic word aligner
Gave-3sg the teacher book to-the boy yesterday
The teacher gave a book to the boy yesterday
The aligner aligns two words if they have the same
root form.
Limitation of heuristic word aligner
1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
I
caught
the
pig
and
the
cat
Statistical word aligner
• GIZA++ package (Och and Ney, 2000)
– It implemented IBM models (Brown et. al.,
1993)
– Widely used in statistical MT field
• Parallel corpus formed by the gloss and
translation lines of all the IGT examples in
ODIN.
Improving word aligner
• Train both directions (glosstrans,
transgloss) and combine the results
• Split words in the gloss line into morphemes
1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
 1SG pig -NNOM –SG grasp -PST and cat –NNOM -SG
Improving word aligner (cont)
Pedro-NOM Goyo-ACC yesterday horse-ACC steal-PRFVSAY-PRES
Pedro says Goyo has stolen the horse yesterday .
Add (x,x) sentence pairs:
(Pedro, Pedro)
(Goyo, Goyo)
…..
Step 3: Projecting structures
• Projecting DS
– Previous work:
• (Hwa et. al, 2002)
• (Quirk et. al, 2005)
• Projecting PS
Projecting phrase structure
Projecting PS
• Copy the English PS and remove all the
unaligned English words
• Replace English words with corresponding
source words
• Starting from the root, reorder children of
each node.
• Attach unaligned source words
Starting with English PS
The teacher gave a book to the boy yesterday
Replacing English words
Reordering children
Calculating phrase spans
“Reordering” NP and VP
Removing VP
Removing a node in PS
After removing VP
Reordering VBD and NP
Removing NP
Merging IN and DT
Before “reordering”
After reordering
1
2
3 4 5
6
7
Reordering two children of x:
y1 and y2
Let Si be the phrase span of yi:
• S1 and S2 don’t overlap: reorder two nodes
according to the spans.
• S1 ½ S2: remove y2
• S1 ¾ S2: remove y1
• S1 and S2 overlap, and neither is a strict
subset of the other: remove both nodes.
If y1 and y2 are leaf nodes, merge them.
Attaching unaligned source words
y
yi
yj
yk
Information that can be extracted
from enriched IGT
• Grammars for source language
• Transfer rules
• Examples with interesting properties (e.g.,
crossing dependencies)
Grammars
S  VBD NP NP PP NP
NP  DT NN
NP  NN
PP  IN+DT NN
Examples of crossing dependencies
Inepo kow-ta
bwuise-k into mis-ta
1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
I caught the pig and the cat
(Martinez Fabian, 2006)
Examples of crossing dependencies
Examples of crossing dependencies
Outline
• Background: IGT and ODIN
• Data enrichment
• Experiments
• Conclusion and future work
Experiments
• Test on a small set of IGT examples for
seven languages:
– SVO: German (GER) and Hausa (HUA)
– SOV: Korean (KKN) and Yaqui (YAQ)
– VSO: Irish (GLI) and Welsh (WLS)
– VOS: Malagasy (MEX)
Test set
Numbers in the last row come from the Ethnologue
(Gorden, 2005)
Human annotators checked system output and corrected
- English DS
- word alignment
- source DS
Heuristic word aligner
 High
precision, low recall.
Statistical word aligner: training data
When gloss words are
not split into morphemes
When gloss words are split into morphemes
A significant improvement: 0.812  0.909
When (x,x) pairs are added to training data
Adding (x,x) pairs: 0.909  0.919
Combining two word aligners: 0.919  0.928
Projection results
Oracle results with perfect English DS
and/or word alignment
Potential improvement: 81.45  90.64
Remaining errors
• Oracle result: 90.64
• Manually checked 43 errors in German
data:
– 26 (60.5%) due to translation divergence
(e.g., head switching)
– 8 (18.6%) due to mistakes of the projection
heuristics
– 9 (20.9%) due to non-exact translation
An example of non-extract translation
der Antrag des
oder der
the petition of-the.SG or
the petition of the docent
(Daniels, 2001)
Dozenten
of-the.PL docent.MSC
Extracted CFG for Yaqui
• S  NP VP
• S  VP
• VP  NP Verb
• VP  Verb
49/77
9/77
23/95
17/95
• VP  NP NP Verb
2/95
• VP  NP Verb CC NP 2/95
 Yaqui looks like an SOV language
Extracted CFGs
Conclusion
• We present a methodology for projecting structure (DS
and PS) from English onto source data.
• Applied to seven languages with promising results:
– Word alignment:
94.03
– Source DS:
81.45
– Source DS (oracle): 90.64
• From enriched data, we extract CFGs and examples of
crossing dependencies.
Future direction: theoretical linguistics
• For a particular language (e.g., Yaqui), find the
answers for the following questions:
– What is word order: SVO, SOV, VSO, ….?
– Does it have double-object construction?
– Can a coordinated phrase be discontinuous? (e.g.,
“NP1 Verb and NP2”)
– ….
• Our plan:
– Improve current algorithms
– Test our system on more languages
Future direction: computational linguistics
For a particular language, we want to build
• a Part-of-speech tagger and a parser
– Our plan: use enriched data as “seed” and
experiment with prototype-driven learning stategies
(Haghighi and Klein, 2006)
• a MT system
– Our plan:
• Use enriched data as “seed”, as in (Quirk and Corston-Oliver,
2006)
• Test translation divergence automatically for dozens or even
hundreds of languages.
Thank you
Backup slides
Structural queries on the source side
• Find examples of double objects
• Find examples of long distance wh-movements
• Determine the word order between
– subject and VP
– noun and relative clause
– verb and PP
Need to know the structure of the source
sentence
Gloss-translation alignment
• Both are in “English”
1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
I caught the pig and the cat
• We experimented with two word aligners
Combining two word aligners
(1) Combining the alignment output: union, intersection,
refined.
(2) Add the aligned pairs produced by heuristic word aligner
to the training data
(3) Modify the heuristic word aligner so that two words are
aligned if
– they have the same root form, or
– they are good translations according to translation model
produced by GIZA++
(3) yields modest gain: (0.914, 0.919)  0.928
Projecting DS
• Copy the English DS and remove all the
unaligned English words
• Replace English words with corresponding
source words
• Remove duplicates if any
• Attach unaligned source words
Starting with English DS
The teacher gave a book to the boy yesterday
Replacing English words with source words
Removing duplicates
Attaching unaligned source words
The heuristics described in (Quirk et al., 2005):
yi
yj
yk
Summary of the DS projection algorithm
Links between the two structures
Links between two structures
Links between two structures
One can extract transfer rules, treelets etc.
MEX and GLI are VOS or VSO?
• MEX:
– S  VP …
– S  Verb …
90/102
11/102
• GLI:
– S  VP NP …
– S  Verb …
22/41
19/41
Descargar

Towards automatic enrichment and analysis of linguistic