Current Trends in MT
Andy Way
NCLT, School of Computing,
Dublin City University,
Dublin 9, Ireland
[email protected]
www.nclt.dcu.ie/mt/
Overview of Talk
•Current Trends
•From EACL-06 to ACL-07
•Topics
•Country of Origin
•Ongoing and Future Work at DCU
•Other Important Research
•Future General Directions
•Increased convergence within MT
•Increased convergence between MT
and rest of NLP
•Concluding Remarks
NCLT, Dublin, April 2007
2
Current Trends
EACL-06 MT Track
featured 24 papers in
a number of areas:
SMT
8
Evaluation
5
Word
Alignment
4
Applications
2
Lexicon/WSD
1
RBMT
1
EBMT
1
Corpus
Building
1
Hybrid MT
1
NCLT, Dublin, April 2007
3
Current Trends: Country of Origin
Of the 24 MT papers:
• 18 (75%) were from Europe
•6 from UK
•6 from Spain
•3 from Germany
•1 each from Romania, Italy & Ireland
• 6 (25%) were from N. America (5 from USA)
• 0 were from Asia
NCLT, Dublin, April 2007
4
Current Trends: Success Rates (by Country)
Of the 24 MT papers, 7 (29%) were accepted
(general EACL acceptance rate 19.7%: 52/264)
•2 from USA (out 0f 5)
•2 from Germany (out of 3)
•1 from UK (out of 6)
•1 from Romania (out of 1)
•1 from Canada (out of 1)
NCLT, Dublin, April 2007
5
Current Trends: Success Rates (by Topic)
Of the 7 accepted MT papers
•2 were on SMT (out of 8)
•2 were on word alignment (out of 4)
•2 were on evaluation (out of 5)
•1 was on hybrid MT (out of 1)
NCLT, Dublin, April 2007
6
Current Trends
ACL-07 MT Track
features 67 papers in
a number of areas:
SMT
29
Word
Alignment
10
Evaluation
9
Lexicon/WSD
6
Tree  String
4
RBMT
3
EBMT
2
Corpus
Building
2
Hybrid MT
1
Applications
1
NCLT, Dublin, April 2007
7
Current Trends
General Issues
ACL-07 SMT Track
features 29 papers in
a number of areas:
11
Reordering
5
Parsing/
Structure
5
Phrases
3
LM
2
Decoding
2
Sent.
Alignment
1
NCLT, Dublin, April 2007
8
Current Trends: Summary of Themes
Of the 67 MT papers:
• 54 (80%) involve corpus-based MT
• 9 (13%) involve evaluation
• 3 (4%) involve RBMT
NCLT, Dublin, April 2007
9
Current Trends: Country of Origin
Of the 67 MT papers:
• 32 (48%) are from Asia
• 19 (28%) are from N. America (18 from USA)
• 16 (24%) are from Europe
NCLT, Dublin, April 2007
10
Current Trends: Country of Origin
China
Of the 32 papers
from Asia:
20
Taiwan
3
Japan
3
India
2
Hong Kong
1
Korea
1
Thailand
1
Singapore
1
NCLT, Dublin, April 2007
11
Current Trends: Country of Origin
Spain
Of the 16 papers
from Europe:
Ireland
UK
Germany
France
Italy
Denmark
Turkey
Czech. Rep.
Hungary
NCLT, Dublin, April 2007
3
3
2
2
1
1
1
1
1
1
12
Change 06—07 (by Topic)
45
40
35
30
25
20
15
10
5
0
2006
2007
NCLT, Dublin, April 2007
SMT
Eval
W.A.
Apps
Lex
RBMT
EBMT
Corpus
Hybrid
Tree->$
13
Change 06—07 (by Country)
35
30
25
20
15
10
5
0
2006
2007
NCLT, Dublin, April 2007
UK
Spain
USA
Germany
Italy
Ireland
Canada
China
Taiwan
Japan
India
Rest Asia
Rest EU
14
Current Trends: Success Rates (by Country)
• Of the 67 MT papers, 17 were accepted
accepted (25.4%; overall acceptance
rate 22.4%) from the following
countries:
• USA: 8 (out of 18)
• China: 3 (out of 20)
• Ireland: 2 (out of 3)
• UK: 2 (out of 2)
• Canada: 1 (out of 1)
• Singapore: 1 (out of 1)
NCLT, Dublin, April 2007
15
Current Trends: Success Rates (by Topic)
• Of the 17 successful MT papers:
•3
•2
•2
•2
•1
•1
•1
•1
•1
•1
were on language modelling/decoding
were on evaluation
were on word alignment
were on reordering
was on word-sense disambiguation
was on treestring models
was on SMT via pivot languages
was on multi-parallel corpora
was on hybrid MT
was on transductive learning
NCLT, Dublin, April 2007
16
Consequences of these Trends
•The ‘system’ is at breaking point
•Do we need a pre-selection phase?
•As in many other areas, a ‘new world order’ is
emerging
•There is very little internal QA as yet
•Standard of English and basic structure is lacking
•But … they’re doing OK already, and they’ll
improve!
•Relatively few ‘world centres’ in MT at present
•Despite massive increase in MT use, big
decrease in teaching of MT – paradox!
NCLT, Dublin, April 2007
17
Ongoing Work in DCU
• Integrating Syntax into SMT
– Supertag translation and target language models
– Adding source language information
– Tree-to-Tree Translation (DOT, LFG-DOT: also
treestring models), inc. porting monolingual
parsing techniques to the bilingual case
• Applications
– Automatic Translation of DVD subtitles
– Sign-Language MT
– Large-Scale Open Evaluation (inc. parallel
computation)
• New Language Pairs, Corpora etc.
NCLT, Dublin, April 2007
18
System Development
System
Lang. Pairs
#Sent. Pairs
Gaijin ‘97
ENDE
1836
wEBMT ‘03
FREN
219,ooo Penn-II
NPs & VPs
TMI-04
FREN
203,000
ACL-05
FREN
322,000
MaTreX OpenLab
ESEN
958,000
MaTreX NIST-06
ChineseEN
ArabicEN
NCLT, Dublin, April 2007
3,000,000
19
Ongoing Work in DCU (cont’d)
• Dependency- (and Semantically) Marked-Up
Corpora
• New models of Word Alignment
• New integrated models of subtree/substring
alignment
• New dependency-based Evaluation metrics
• New Decoders
– EBMT
– Memory-Based
• Open-Source Components
NCLT, Dublin, April 2007
20
Ongoing Work in DCU (cont’d)
Collaborative work:
• Tilburg (Memory-based Decoding)
• Donostia (Basque MT)
• Aachen (Sign-Language MT)
• Amsterdam (Integrating Syntax & SMT)
• St. Andrew’s (DOT)
• Edinburgh (SMT)
• CMU (Hybrid SMT—EBMT)
NCLT, Dublin, April 2007
21
Future Work in DCU
• Spoken Language Translation
NCLT, Dublin, April 2007
22
Future Work in DCU
•
•
•
•
•
•
•
MT via SMS
Automatic Interpreting
Enhanced hybrid models
Scalability
Tuning MT to text type & genre
MT using Pivot languages (‘triangulation’)
Better quality phrases (cf. CONLL
monolingual chunking shared task)
• …
NCLT, Dublin, April 2007
23
Future General Directions
• Corpus Building (integrating syntax,
semantics … discourse …)
– cf. data size vs. data quality …
– Filtering/pruning training data (‘safe’ alignments)
•
•
•
•
•
•
Word Alignment
Language Modelling
Decoding
Evaluation Methods
Large-scale Open Evaluations
Further Convergence between models
NCLT, Dublin, April 2007
24
Dekai Wu’s 3D MT Space
NCLT, Dublin, April 2007
25
Convergence between MT and Rest of NLP
• For some time now not many MT
researchers doing syntax and vice-versa.
• With move (back) to trees instead of
strings:
– Reconnect with wealth of tree automata
literature
– Get lots of implemented algorithms for
free!
NCLT, Dublin, April 2007
26
Concluding Remarks
So … there’s plenty for us still to do!
Two worries:
• MT R&D seems to be at an all-time
high, yet we’re not teaching MT any
more.
• Most (S)MT people come from different
backgrounds, but huge danger that
some people are merely reinventing the
wheel …
NCLT, Dublin, April 2007
27
Thanks!
The end
beginning!
NCLT, Dublin, April 2007
28
Descargar

Multilingual Unification-based Grammar Development