Source Language Adaptation
for Resource-Poor Machine Translation
Pidong Wang, National University of Singapore
Preslav Nakov, QCRI, Qatar Foundation
Hwee Tou Ng, National University of Singapore
Introduction
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Overview
 Statistical Machine Translation (SMT) systems
Need large sentence-aligned bilingual corpora (bi-texts).
 Problem
Such training bi-texts do not exist for most languages.
 Idea
Adapt a bi-text for a related resource-rich language.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
3
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Idea & Motivation
 Idea: reuse bi-texts from related resource-rich
languages to improve resource-poor SMT
 Related languages have
 overlapping vocabulary (cognates)
 e.g., casa (‘house’) in Spanish, Portuguese
 similar
 word order
 syntax
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
4
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Resource-rich vs. Resource-poor Languages
 Related EU – nonEU languages
 Swedish – Norwegian
 Bulgarian – Macedonian
 Related EU languages




Spanish – Catalan
Czech – Slovak
Irish – Gaelic Scottish
Standard German – Swiss German
We will explore
these pairs.
 Related languages outside Europe





MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)
Hindi – Urdu
Turkish – Azerbaijani
Russian – Ukrainian
Malay – Indonesian
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
5
Our Main focus:
Improving
Indonesian-English SMT
Using Malay-English
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
6
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Malay vs. Indonesian
~50% exact word overlap
Malay
 Semua manusia dilahirkan bebas dan samarata dari segi
kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati dan
hendaklah bertindak di antara satu sama lain dengan
semangat persaudaraan.
Indonesian
 Semua orang dilahirkan merdeka dan mempunyai martabat
dan hak-hak yang sama.
 Mereka dikaruniai akal dan hati nurani dan hendaknya
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
7
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Malay Can Look “More Indonesian”…
~75% exact word overlap
Malay
Post-edited Malay to look “Indonesian” (by an Indonesian speaker).
 Semua manusia dilahirkan bebas dan samarata dari
segi kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati
dan hendaklah bertindak di antara satu sama lain
dengan semangat persaudaraan.
We attempt to do this automatically:
adapt Malay to look Indonesian
Indonesian
Then, use
it to improve SMT…
 Semua manusia dilahirkan bebas dan mempunyai martabat
dan hak-hak yang sama.
 Mereka mempunyai pemikiran dan perasaan dan hendaklah
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
8
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Method at a Glance
Indonesian
poor
English
Malay
Adapt
Step 1:
Adaptation
Indonesian
poor
English
“Indonesian”
Step 2:
Combination
Indonesian +
“Indonesian”
English
Note that we have no Malay-Indonesian bi-text!
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
9
Step 1:
Adapting Malay-English
to “Indonesian”-English
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
10
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Bi-text Adaptation:
Overview
Given a Malay-English sentence pair
1. Adapt the Malay sentence to “Indonesian”
• Word-level paraphrases
• Phrase-level paraphrases
• Cross-lingual morphology
2. We pair the adapted “Indonesian” with English from MalayEnglish sentence pair
Thus, we generate a new “Indonesian”-English sentence pair.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
11
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Bi-text Adaptation:
Overview
Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
Decode using a large Indonesian LM
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
12
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Bi-text Adaptation:
Overview
Pair each with the English counter-part
Malaysia’s GDP is expected to reach 8 per cent in 2010.
Thus, we generate a new “Indonesian”-English bi-text.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
13
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Adaptation:
Extracting Paraphrases
 Indonesian translations for Malay: pivoting over English
ML1
ML2
ML3
ML4
ML5
Malay sentence
ML-EN
bi-text
EN1
IN-EN
bi-text
EN2
EN11
IN1
EN3
EN3
IN2
EN4
EN12
IN3
IN4
English sentence
English sentence
Indonesian sentence
 Weights
Note: we have no Malay-Indonesian bi-text, so we pivot.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
14
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Adaptation:
Issue 1
IN
poor
EN
ML
IN-EN bi-text is small, thus:
 Unreliable IN-EN word alignments  bad ML-IN paraphrases
 Solution:
 improve IN-EN alignments using the ML-EN bi-text
 concatenate: IN-EN*k + ML-EN
» k ≈ |ML-EN| / |IN-EN|
 word alignment
 get the alignments for one copy of IN-EN only
Works because of cognates between Malay and Indonesian.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
15
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Adaptation:
Issue 2
IN
poor
EN
ML
IN-EN bi-text is small, thus:
 Small IN vocabulary for the ML-IN paraphrases
 Solution:
 Add cross-lingual morphological variants:
 Given ML word: seperminuman
 Find ML lemma: minum
 Propose all known IN words sharing the same lemma:
»
diminum, diminumkan, diminumnya, makan-minum,
makananminuman, meminum, meminumkan, meminumnya,
meminum-minuman, minum, minum-minum, minum-minuman,
minuman, minumanku, minumannya, peminum, peminumnya,
perminum, terminum
Note: The IN variants are from a larger monolingual IN text.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
16
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Word-Level Adaptation:
Issue 3
IN
poor
EN
ML
Word-level pivoting
 Ignores context, and relies on LM
 Cannot drop/insert/merge/split/reorder words
 Solution:
Phrase-level pivoting
 Build ML-EN and EN-IN phrase tables
 Induce ML-IN phrase table (pivoting over EN)
 Adapt the ML side of ML-EN to get “IN”-EN bi-text:
» using Indonesian LM and n-best “IN” as before
 Also, use cross-lingual morphological variants
- Models context better: not only Indonesian LM, but also phrases.
- Allows many word operations, e.g., insertion, deletion.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
17
Step 2:
Combining
IN-EN + “IN”-EN
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
18
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Combining IN-EN and “IN”-EN bi-texts
 Simple concatenation: IN-EN + “IN”-EN
 Balanced concatenation: IN-EN * k + “IN”-EN
 Sophisticated phrase table combination: (Nakov and Ng,
EMNLP 2009), (Nakov and Ng, JAIR 2012)

Improved word alignments for IN-EN

Phrase table combination with extra features
Improved Statistical Machine Translation for Resource-Poor Languages Using Related
Resource-Rich Languages. (EMNLP 2009)
Preslav Nakov, Hwee Tou Ng
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
19
Experiments & Evaluation
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
20
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Data
(tokens)
 Translation data (for IN-EN)
 IN2EN-train: 0.9M
 IN2EN-dev: 37K
 IN2EN-test: 37K
 EN-monoling.: 5M
 Adaptation data (for ML-EN  “IN”-EN)
 ML2EN:
8.6M
 IN-monoling.: 20M
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
21
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Isolated Experiments:
Training on “IN”-EN only
22.00
21.00
20.00
20.63
BLEU
20.89
21.24
20.06
19.50
18.67
19.00
18.00
17.00
16.00
15.00
14.50
14.00
System combination using MEMT (Heafield and Lavie, 2010)
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
22
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Combined Experiments:
Training on IN-EN + “IN”-EN
22.00
21.50
BLEU
21.64
21.55
21.62
21.00
20.50
20.10
19.79
20.00
19.50
19.00
18.50
18.49
18.00
simple
concatenation
balanced
concatenation
ML2EN(baseline)
phrase table
combination
Our method
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
23
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Experiments: Improvements
23.00
BLEU
21.24
22.00
21.64
20.10
21.00
20.00
18.67
19.00
18.00
17.00
16.00
15.00
14.50
14.00
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
24
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Application to Other Languages & Domains
 Improve Macedonian-English SMT by adapting
Bulgarian-English bi-text
 Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)
29.50 
OPUS movie subtitles
29.05
BLEU
29.00
28.38
28.50
27.97
28.00
27.50
27.33
27.00
BG2EN(A)
WordParaph+morph(B) PhraseParaph+morph(C) System combination of
A+B+C
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
25
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Conclusion
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
26
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Conclusion & Future Work
 Adapt bi-texts for related resource-rich languages, using
 confusion networks
 word-level & phrase-level paraphrasing
 cross-lingual morphological analysis
 Achieved:
+6.7 BLEU over ML2EN
+2.6 BLEU over IN2EN
+1.5-3.0 BLEU over comb(IN2EN,ML2EN)
 Future work
 add split/merge as word operations
 better integrate word-level and phrase-level methods
Thank you!
 apply our methods to other languages & NLP problems
Supported by the Singapore National Research Foundation under its International Research
Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
27
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Further Analysis
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
28
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Paraphrasing
Non-Indonesian Malay Words Only
So, we do need to paraphrase all words.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
29
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Human Judgments
Is the adapted sentence better Indonesian
than the original Malay sentence?
100 random sentences
Morphology yields worse top-3 adaptations
but better phrase tables, due to coverage.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
30
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Reverse Adaptation
Idea:
Adapt dev/test Indonesian input to “Malay”,
then, translate with a Malay-English system
Input to SMT:
- “Malay” lattice
- 1-best “Malay” sentence from the lattice
Adapting dev/test is worse than adapting the training bi-text:
So, we need both n-best and LM
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
31
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Related Work
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
32
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Related Work (1)
 Machine translation between related languages
 E.g.
 Cantonese–Mandarin (Zhang, 1998)
 Czech–Slovak
(Hajic & al., 2000)
 Turkish–Crimean Tatar (Altintas & Cicekli, 2002)
 Irish–Scottish Gaelic (Scannell, 2006)
 Bulgarian–Macedonian (Nakov & Tiedemann, 2012)
 We do not translate (no training data), we “adapt”.
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
33
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Related Work (2)
 Adapting dialects to standard language (e.g., Arabic)
(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)
 manual rules
 Normalizing Tweets and SMS
(Aw & al., 2006; Han & Baldwin, 2011)
 informal text: spelling, abbreviations, slang
 same language
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
34
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea
Related Work (3)
 Adapt Brazilian to European Portuguese (Marujo & al. 2011)
 rule-based, language-dependent
 tiny improvements for SMT
 Reuse bi-texts between related languages (Nakov & Ng. 2009)
 no language adaptation (just transliteration)
Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
35
Descargar

Computational Linguistics in the Internet Era