What’s New in
Statistical Machine Translation
Kevin Knight
USC/Information Sciences Institute
USC/Computer Science Department
Machine Translation
美国关岛国际机场及其办公室均接获一
名自称沙地阿拉伯富商拉登等发出的电
子邮件,威胁将会向机场等公众地方发
动生化袭击後,关岛经保持高度戒备。
The U.S. island of Guam is maintaining a high
state of alert after the Guam airport and its offices
both received an e-mail from someone calling
himself the Saudi Arabian Osama bin Laden and
threatening a biological/chemical attack against
public places such as the airport.
Thousands of Languages Are Spoken
MANDARIN
SPANISH
ENGLISH
BENGALI
HINDI
PORTUGUESE
RUSSIAN
JAPANESE
GERMAN
WU (China)
JAVANESE
KOREAN
FRENCH
VIETNAMESE
885,000,000
332,000,000
322,000,000
189,000,000
TURKISH
URDU
MIN NAN (China)
JINYU (China)
182,000,000
170,000,000
170,000,000
125,000,000
98,000,000
77,175,000
75,500,800
75,000,000
72,000,000
67,662,000
TELUGU
66,350,000
YUE (China)
66,000,000
MARATHI 64,783,000
TAMIL
63,075,000
KANNADA
ORIYA
PANJABI
SUNDA
59,000,000
58,000,000
49,000,000
45,000,000
GUJARATI
POLISH
ARABIC
UKRAINIAN
44,000,000
44,000,000
42,500,000
41,000,000
ITALIAN
XIANG (China)
MALAYALAM
HAKKA (China)
37,000,000
36,015,000
34,022,000
34,000,000
33,663,000
31,000,000
30,000,000
27,000,000
Source: Ethnologue
Recent Progress in Statistical MT
2002
2003
insistent Wednesday may
recurred her trips to Libya
tomorrow for flying
Egyptair Has Tomorrow to
Resume Its Flights to
Libya
Cairo 6-4 ( AFP ) - An
official announced today
in the Egyptian lines
company for flying
Tuesday is a company
"insistent for flying" may
resumed a consideration
of a day Wednesday
tomorrow her trips to
Libya of Security Council
decision trace
international the imposed
ban comment.
Cairo 4-6 (AFP) - Said an
official at the Egyptian
Aviation Company today
that the company egyptair
may resume as of
tomorrow, Wednesday its
flights to Libya after the
International Security
Council resolution to the
suspension of the
embargo imposed on
Libya.
2005
news
broadcast
foreign
language
speech
recognition
English
translation
searchable
archive
Warren Weaver (1947)
ingcmpnqsnwf cv fpn owoktvcv
hu ihgzsnwfv rqcffnw cw owgcnwf
kowazoanv ...
Warren Weaver (1947)
e
e e
e
ingcmpnqsnwf cv fpn owoktvcv
e
e
e
hu ihgzsnwfv rqcffnw cw owgcnwf
e
kowazoanv ...
Warren Weaver (1947)
e
e e
the
ingcmpnqsnwf cv fpn owoktvcv
e
e
e
hu ihgzsnwfv rqcffnw cw owgcnwf
e
kowazoanv ...
Warren Weaver (1947)
e
he e
the
ingcmpnqsnwf cv fpn owoktvcv
e
e
e t
hu ihgzsnwfv rqcffnw cw owgcnwf
e
kowazoanv ...
Warren Weaver (1947)
e
he e
of the
ingcmpnqsnwf cv fpn owoktvcv
e
e
e t
hu ihgzsnwfv rqcffnw cw owgcnwf
e
kowazoanv ...
Warren Weaver (1947)
e
he e
of the
fof
ingcmpnqsnwf cv fpn owoktvcv
e f
o e o
oe t
hu ihgzsnwfv rqcffnw cw owgcnwf
ef
kowazoanv ...
Warren Weaver (1947)
e
he e
of the
ingcmpnqsnwf cv fpn owoktvcv
e
e
e t
hu ihgzsnwfv rqcffnw cw owgcnwf
e
kowazoanv ...
Warren Weaver (1947)
e
he e
is the
sis
ingcmpnqsnwf cv fpn owoktvcv
e s
i e i
ie t
hu ihgzsnwfv rqcffnw cw owgcnwf
es
kowazoanv ...
Warren Weaver (1947)
decipherment is the analysis
ingcmpnqsnwf cv fpn owoktvcv
of documents written in ancient
hu ihgzsnwfv rqcffnw cw owgcnwf
languages ...
kowazoanv ...
Warren Weaver (1947)
Can this be
computerized?
The non-Turkish guy next
to me is even deciphering
Turkish! All he needs is a
statistical table of letter-pair
frequencies in Turkish …
Collected mechanically from a
Turkish body of text, or corpus
“When I look at an article in Russian, I
say: this is really written in English, but it
has been coded in some strange symbols.
I will now proceed to decode.”
- Warren Weaver, March 1947
“When I look at an article in Russian, I
say: this is really written in English, but it
has been coded in some strange symbols.
I will now proceed to decode.”
- Warren Weaver, March 1947
“... as to the problem of mechanical
translation, I frankly am afraid that the
[semantic] boundaries of words in
different languages are too vague ... to
make any quasi-mechanical translation
scheme very hopeful.”
- Norbert Wiener, April 1947
Spanish/English corpus
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
Spanish/English corpus
Translate: Clients do not sell pharmaceuticals in Europe.
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
???
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
???
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
process of
elimination
Centauri/Arcturan [Knight 97]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
cognate?
Centauri/Arcturan [Knight 97]
Your assignment, put these words in order:
{ jjat, arrat, mat, bat, oloat, at-yurp }
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
zero
fertility
“When I look at an article in Russian, I
say: this is really written in English, but it
has been coded in some strange symbols.
I will now proceed to decode.”
- Warren Weaver, March 1947
The required statistical tables have millions of entries…?
Too much for the computers of Weaver’s day.
 Not enough RAM!
IBM Candide Project (1988-1994)
• How to get quantities of human translation in
computer readable form?
– parallel corpus
IBM’s John Cocke,
inventor of
CKY parsing &
RISC processors
Canadian
bureaucrat
IBM Candide Project (1988-1994)
• How to get quantities of human translation in
computer readable form?
– parallel corpus
IBM’s John Cocke,
inventor of
CKY parsing &
RISC processors
Canadian
bureaucrat
IBM Candide Project (1988-1994)
• How to get quantities of human translation in
computer readable form?
– parallel corpus
IBM’s John Cocke,
inventor of
CKY parsing &
RISC processors
IBM Candide Project
[Brown et al 93]
French/English
Bilingual Text
Statistical Analysis
French
J’ ai si faim
English
Text
Statistical Analysis
Broken
English
What hunger have I,
Hungry I am so,
I am so hungry,
Have me that hunger …
English
I am so hungry
Mathematical Formulation
Given source sentence f:
argmaxe P(e | f) =
argmaxe P(f | e) · P(e) / P(f) =
by Bayes Rule
argmaxe P(f | e) · P(e)
P(f) same for all e
Broken
English
French
Translation
Model P(f | e)
J’ ai si faim
English
Language
Model P(e)
Decoding algorithm
argmaxe P(e) · P(f | e)
I am so hungry
Language Modeling
Goal of a language model for MT:
He is on the soccer field
He is in the soccer field
Is table the on cup the
The cup is on the table
American shrine
American company
Need to make these
decisions, because
translation model may
not have a lot of
context information!
The Classic Language Model
Word Bigrams
Process model of English:
Generate each word based only on the previous word.
P(I saw water on the table) =
P(I | START) ·
P(saw | I) ·
P(water | saw) ·
P(on | water) ·
P(the | on) ·
P(table | the) ·
P(END | table)
Probabilities can be tabulated
from an online English corpus …
just like Weaver’s Turkish case.
Trigram Language Model
to the said
royal purchase plan trustco
part operations of its its is
international expand banking
[Soricut & Marcu, 05]
Trigram Language Model
to the said
royal purchase plan trustco
part operations of its its is
international expand banking
the banking trustco is said to
expand its purchase part of its
royal international plan operations
[Soricut & Marcu, 05]
Trigram Language Model
to the said
royal purchase plan trustco
part operations of its its is
international expand banking
the banking trustco is said to
expand its purchase part of its
royal international plan operations
royal trustco said the purchase is
part of its plan to expand
its international banking operations
N-grams have a lot of
semantics in them!
[Soricut & Marcu, 05]
Trigram Language Model
to the said
royal purchase plan trustco
part operations of its its is
international expand banking
with the stressed
relationship part
own longstanding its its for
chinese boeing , ,
the banking trustco is said to
expand its purchase part of its
royal international plan operations
royal trustco said the purchase is
part of its plan to expand
its international banking operations
for its part, stressed the longstanding
relationship with its own, chinese boeing
boeing, for its part, stressed its own
longstanding relationship with the chinese
[Soricut & Marcu, 05]
Translation Model?
Process model of translation:
Mary did not slap the green witch
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Maria no dió una bofetada a la bruja verde
Translation Model?
Process model of translation:
Mary did not slap the green witch
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
What are all
the possible
moves and
what probability
tables control
those moves?
Maria no dió una bofetada a la bruja verde
The Classic Translation Model
Word Substitution/Permutation [Brown et al., 1993]
Process model of translation:
Mary did not slap the green witch
Mary not slap slap slap the green witch
Mary not slap slap slap NULL the green witch
Maria no dió una bofetada a la verde bruja
n(3|slap)
50k entries
P-Null
1 entry
t(la|the)
25m entries
d(j|i)
Maria no dió una bofetada a la bruja verde
2500 entries
Trainable
The Classic Translation Model
Word Substitution/Permutation [Brown et al., 1993]
Process model of translation:
Mary did not slap the green witch
n(3|slap)
50k entries
?
P-Null
1 entry
t(la|the)
25m entries
d(j|i)
Maria no dió una bofetada a la bruja verde
2500 entries
Still trainable!
Classic Formula for P(f | e)
NULL stuff
P(f | e) =
Σ (
a
sum over
alignment
possibilities
m – Φ0
Φ0
l
) · P-Null m – 2Φ0 · (1-P-Null) Φ0 ·  Φi! · (1 / Φ0!) ·
i=0
l
m
m
 n(Φi | ei) ·  t(fj | eaj) · 
i=1
j=1
j:aj <> 0
fertility
word translation
d(j | aj, l, m)
re-ordering
Set parameter values so formula assigns the highest possible probability
to observed human translations. This is a 25m-dimensional search space.
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All P(french-word | english-word) equally likely
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,
so P(la | the) is increased.
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“maison” co-occurs with both “the” and “house”, but
P(maison | house) can be raised without limit, to 1.0,
while P(maison | the) is limited because of “la”
(pigeonhole principle)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!
• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.
• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
• Software: GIZA++
Sample Translation Probabilities
Translation Model
e
national
the
farmers
f
P(f | e)
nationale
0.47
national
0.42
nationaux
0.05
nationales
0.03
le
0.50
la
0.21
les
0.16
l’
0.09
ce
0.02
cette
0.01
agriculteurs
0.44
les
0.42
cultivateurs
0.05
producteurs
0.02
[Brown et al 93]
Translation Model
e
national
the
farmers
f
P(f | e)
nationale
0.47
national
0.42
nationaux
0.05
nationales
0.03
le
0.50
la
0.21
les
0.16
l’
0.09
ce
0.02
cette
0.01
agriculteurs
0.44
les
0.42
cultivateurs
0.05
producteurs
0.02
new French
sentence f
P(f | e)
potential
translation e
Language Model
w1
Translation Model
e
national
the
farmers
f
of
P(f | e)
w2
P(w2 | w1)
the
0.13
a
0.09
another
0.01
nationale
0.47
some
0.01
national
0.42
kong
0.98
nationaux
0.05
said
0.01
nationales
0.03
stated
0.01
le
0.50
la
0.21
les
0.16
l’
0.09
ce
0.02
cette
0.01
agriculteurs
0.44
les
0.42
cultivateurs
0.05
producteurs
0.02
hong
new French
sentence f
P(f | e)
potential
translation e
P(e)
Language Model
w1
Translation Model
e
national
the
farmers
f
of
P(f | e)
w2
P(w2 | w1)
the
0.13
a
0.09
another
0.01
nationale
0.47
some
0.01
national
0.42
kong
0.98
nationaux
0.05
said
0.01
nationales
0.03
stated
0.01
le
0.50
la
0.21
les
0.16
l’
0.09
ce
0.02
cette
0.01
agriculteurs
0.44
les
0.42
cultivateurs
0.05
producteurs
0.02
hong
new French
sentence f
P(f | e)
potential
translation e
P(e)
P(f | e) · P(e)  score for e
Search for Best Translation
voulez – vous vous taire !
Search for Best Translation
voulez – vous vous taire !
you – you you quiet !
Search for Best Translation
voulez – vous vous taire !
you – you quiet !
Search for Best Translation
voulez – vous vous taire !
quiet you – you you !
Search for Best Translation
voulez – vous vous taire !
shut you – you you !
Search for Best Translation
voulez – vous vous taire !
you shut !
Search for Best Translation
voulez – vous vous taire !
you shut up !
Classic Decoding Algorithm
Given f, find the English string e that
maximizes P(e) · P(f | e)
NP-Complete [Knight 99].
Brown et al 93:
“In this paper, we focus on the
translation modeling problem.
We hope to deal with the
[decoding] problem in a later
paper.”
Beam Search Decoding
[Brown et al US Patent #5,477,451]
1st English 2nd English 3rd English 4th English
word
word
word
word
start
end
all source
words
covered
Each partial translation hypothesis contains:
- Last English word chosen + source words covered by it
- Next-to-last English word chosen
- Entire coverage vector (so far) of source sentence
- Language model and translation model scores (so far)
[Jelinek 69;
Och, Ueffing, and Ney, 01]
Beam Search Decoding
[Brown et al US Patent #5,477,451]
1st English 2nd English 3rd English 4th English
word
word
word
word
start
best predecessor
link
end
all source
words
covered
Each partial translation hypothesis contains:
- Last English word chosen + source words covered by it
- Next-to-last English word chosen
- Entire coverage vector (so far) of source sentence
- Language model and translation model scores (so far)
[Jelinek 69;
Och, Ueffing, and Ney, 01]
Classic Results
•
•
•
nous avons signé le protocole .
we did sign the memorandum of agreement .
we have signed the protocol .
(Foreign Original)
(Human Translation)
(MT)
•
•
•
où était le plan solide ?
but where was the solid plan ?
where was the economic base ?
(Foreign Original)
(Human Translation)
(MT)
the Ministry of Foreign Trade and Economic Cooperation, including foreign
direct investment 40 billion US dollars today provide data include
that year to November china actually using foreign 46.959 billion US dollars and
very slow = one page per day
Okay!
I know, so far, this talk should be called …
What’s Old in Statistical
Machine Translation!!
Further Developments
• Follow-on projects
– Hong Kong
– Aachen
– Behavior Design Corporation
• JHU Summer Workshop 1999
– Build & distribute statistical MT tools
– Create standard training & testing data
– Disseminate tutorial material
– “MT in a Day”
– Ask new questions
How Much Data Do We Need?
Quality of
automatically trained
machine translation
system
Amount of bilingual training data
Advances in Statistical MT
2000-2004
Ready-to-Use Online Bilingual Data
180
160
140
120
Millions of words
100
(English side)
80
60
40
20
0
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
1994
French/English
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
Ready-to-Use Online Bilingual Data
180
160
140
120
Millions of words
100
(English side)
80
60
40
20
0
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
1994
French/English
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
+ European parliament data [Koehn 05]
BLEU Evaluation Metric
(Papineni et al 02)
Reference (human) translation:
The U.S. island of Guam is
maintaining a high state of alert
after the Guam airport and its
offices both received an e-mail
from someone calling himself the
Saudi Arabian Osama bin Laden
and threatening a
biological/chemical attack against
public places such as the airport .
Machine translation:
The American [?] international
airport and its the office all
receives one calls self the sand
Arab rich business [?] and so on
electronic mail , which sends out ;
The threat will be able after public
place and so on the airport to start
the biochemistry attack , [?] highly
alerts after the maintenance.
• N-gram precision (score is between 0 & 1)
What percentage of machine n-grams can be
found in the reference translation?
Gross measure over 1000 test sentences.
Not allowed to use same portion of reference
translation twice (can’t cheat by typing out
“the the the the the”)
Brevity penalty: can’t just type out single word
“the” (and get precision 1.0)
BLEU in Action
枪手被警方击毙。
(Foreign Original)
the gunman was shot to death by the police .
(Reference Translation)
the gunman was police kill .
wounded police jaya of
the gunman was shot dead by the police .
the gunman arrested by police kill .
the gunmen were killed .
the gunman was shot to death by the police .
gunmen were killed by police ?SUB>0 ?SUB>0
al by the police .
the ringer is killed by the police .
police killed the gunman .
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
BLEU in Action
枪手被警方击毙。
(Foreign Original)
the gunman was shot to death by the police .
(Reference Translation)
the gunman was police kill .
wounded police jaya of
the gunman was shot dead by the police .
the gunman arrested by police kill .
the gunmen were killed .
the gunman was shot to death by the police .
gunmen were killed by police ?SUB>0 ?SUB>0
al by the police .
the ringer is killed by the police .
police killed the gunman .
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
green
red
= 4-gram match
= word not matched
(good!)
(bad!)
BLEU Tends to Predict Human Judgments
2.5
Adequacy
2.0
R2 = 88.0%
Fluency
R2 = 90.2%
BLEU Score
1.5
Linear
(Adequacy)
Linear
(Fluency)
1.0
0.5
0.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-0.5
-1.0
-1.5
-2.0
-2.5
Human Judgments
slide from G. Doddington (NIST)
Experiment-Driven Progress
BLEU
35
Evaluate new MT research ideas every day!
(and be alerted about bugs…)
30
25
20
ISI Syntax-Based MT
Chinese/English
NIST 2002 Test Set
15
Mar
1
Apr
1
May
1
2005
Draw Learning Curves
0.35
Swedish/English
French/English
German/English
Finnish/English
0.3
BLEU
score
0.25
0.2
0.15
0.1
0.05
0
10k
20k
40k
80k
160k
320k
# of sentence pairs used in training
Experiments by
Philipp Koehn
Flaws of Word-Based MT
• Can’t translate multiple English words to
one French word
• Can’t translate phrases
– “real estate”, “note that”, “interest in”
• Isn’t sensitive to syntax
– Adjectives/nouns should swap order
– Verb comes at the beginning in Arabic
• Doesn’t understand the meaning (?)
The MT Triangle
interlingua
logical form
syntax
words
SOURCE
logical form
syntax
words
TARGET
The MT Swimming Pool
interlingua
logical form
syntax
words
logical form
syntax
words
Commercial
Rule-Based Systems
interlingua
logical form
syntax
words
SOURCE
logical form
syntax
words
TARGET
interlingua
logical form
syntax
Knight et al 95
- meaning-based translation
- composition rules
logical form
syntax
Language Model
words
SOURCE
words
TARGET
interlingua
logical form
syntax
Wu 97, Alshawi 98
- inducing syntactic structure
as a by-product of aligning
words in bilingual text
logical form
syntax
Language Model
words
SOURCE
words
TARGET
interlingua
logical form
syntax
Yamada/Knight (01,02)
- tree/string model
- used existing target language
parser
logical form
syntax
Language Model
words
SOURCE
words
TARGET
Well, these all seem like good ideas.
Which one had the most dramatic effect on
MT quality?
None of them!
Phrases
How do you translate
“real estate” into French?
interlingua
logical form
syntax
phrases
words
SOURCE
real estate
real number
dance number
dance card
memory card
memory stick
…
logical form
syntax
phrases
words
TARGET
Phrase-Based Statistical MT
Morgen
fliege
ich
Tomorrow
I
will fly
nach Kanada
to the conference
zur Konferenz
In Canada
• Foreign input segmented into phrases
– “phrase” just means “word sequence”
• Each phrase is probabilistically translated into English
– P(to the conference | zur Konferenz)
– P(into the meeting | zur Konferenz)
• Phrases are probabilistically re-ordered
See [Koehn et al, 2003] for an overview.
How to Learn the Phrase
Translation Table?
• One method: “alignment templates” [Och et al 99]
• Start with word alignment
• Collect all phrase pairs that are consistent with
the word alignment
Word Alignment Induced Phrases
Maria
Mary
did
not
slap
the
green
witch
no
dió
una bofetada a
la
bruja verde
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair Probabilities
• A certain phrase pair (f-f-f, e-e-e) may appear
many times across the bilingual corpus.
• No EM training
• Just relative frequency:
P(f-f-f | e-e-e) =
count(f-f-f, e-e-e)
----------------------count(e-e-e)
Phrase-Based MT
• This is currently the best way to do Statistical MT!
• What took so long to move from words to phrases?
– Missing RAM
• 25m parameters  billions of parameters
• Trick idea: build test-corpus-specific phrase table (takes 5 hours!)
• Now solved in commercial deployments
– Missing computing power
– Many competing ideas to shake out
• Koehn 03 summarizes several variations
– Empirical effectiveness even better than intuition would predict
• This is not building a ladder to the moon!
– If you can’t translate “real estate” into French, you are sunk
Advanced Training Methods
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f) =
e
argmax P(e) x P(f | e)
e
Advanced Training Methods
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f) =
e
argmax P(e)2.4 x P(f | e)
e
… works better!
Advanced Training Methods
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f) =
e
argmax P(e)2.4 x P(f | e) x length(e)1.1
e
Rewards longer hypotheses, since
these are unfairly punished by P(e)
Advanced Training Methods
argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 …
e
Lots of features vote on every potential translation.
Exponential model.
Problem: How to set the exponent weights?
IDEA 1: maximize probability of the data
IDEA 2: maximize BLEU score of MT system
20.64% BLEU
WTM fixed at 1.0
17.96% BLEU
plot by Emil Ettelaie
Maximum BLEU Training
• Novel algorithm developed by [Och 03]
• Opened gates to “feature hacking”
– Word-based feature to smooth phrase pair
counts (“Model1 Inverse”)
– Phrase-specific propensities to re-order
• Currently limited to ~25 features
Advances in Statistical MT
2005
Google’s Language Model
• Previously, largest language model was
trained on 1b words of English
• 20b words of news 
significant impact on news translation
• 200b words of web 
helpful
Maryland’s Hiero system [Chiang 05]
• Previously:
– ne mange pas  does not eat
• New phrase pairs with variables and
reordering
– ne X pas  does not X
– le X1 du X2  X2 's X1
• Nesting
– “does not X” itself becomes an X
• CKY decoder
John
Cocke
ISI’s Syntax-Based MT System
• First strong showing for an SMT system
that knows what nouns and verbs are!
• Why syntax?
“Frequent high-tech exports are bright spots for
foreign trade growth of Guangdong has made
important contributions.”
– Need much more grammatical output
– Need accurate control over re-ordering
– Need accurate insertion of function words
String Output
枪手 被 警方 击毙 .
The gunman killed by police .
Tree Output
枪手 被 警方 击毙 .
The gunman killed by police .
DT NN VBD IN NN
NPB
PP
NP-C
VP
S
Tree Output
枪手 被 警方 击毙 .
Gunman by police shot .
NN IN NN VBD
NPB
PP
NP-C
VP
S
Tree Output
枪手 被 警方 击毙 .
The gunman was killed by police .
DT NN AUX VBN IN NN
NPB
PP
NP-C
VP
S
Sample Rules Learned from Data
VP
SBAR
VB
said IN
x0:S
that
NP
x0:NP
PP
IN
from
x1:NP
 "说" "," x0
 "说" x0
 "他" "说" "," x0
 "指出" "," x0
 x0
0.57
0.09
0.02
0.02
0.02
 x1 x0
 "来自" x1 x0
 x1 "的" x0
 "从" x1 x0
 "来自" x1 "的" x0
0.27
0.15
0.06
0.06
0.06
Sample Rules Learned from Data
S
x0:NP
VP
x1:VB
VP
x1:VB
0.82
0.02
 x0 x1 x2
 x1 x0 x2
0.54
0.44
(Chinese/
English)
x2:NP
S
x0:NP
 x0 x1 x2
 x0 x1 "," x2
x2:NP
(Arabic/
English)
subject-verb inversion
Format is Expressive
Phrasal Translation
VP
S
está, cantando
PRO
VBZ VBG
Non-contiguous Phrases
VP
hay, x0
VP
there VB
singing
is
Non-constituent Phrases
x0:NP
poner, x0
VB x0:NP
PRT
put
on
are
Context-Sensitive
Word Insertion
NPB
DT
the
x0:NNS
x0
Multilevel Re-Ordering
NP
S
x0:NP VP
x1:VB
Lexicalized Re-Ordering
x1, x0, x2
x2:NP2
PP
x0:NP
x1,
, x0
P x1:NP
of
[Knight & Graehl, 2005]
Story Gets More Interesting…
MT
Applications
Automata Theory
Tree
Transducers
(Rounds 70)
Story Gets More Interesting…
Transformational
Grammar
(Chomsky 57)
MT
Linguistic Theory
Automata Theory
Tree
Transducers
(Rounds 70)
Applications
Story Gets More Interesting…
Transformational
Grammar
(Chomsky 57)
MT (05)
Compression (01)
Linguistic Theory
Applications
QA (03)
Generation (00)
Automata Theory
Tree
Transducers
(Rounds 70)
Story Gets More Interesting…
Transformational
Grammar
(Chomsky 57)
MT (05)
Compression (01)
Linguistic Theory
QA (03)
Applications
Generation (00)
Automata Theory
Tree
Transducers
(Rounds 70)
Algorithms
Efficient Transducer
Algorithms
Generic Tree
Toolkits
Summary
• Making good progress
• Algorithms + Data + Evaluation + Computers
• Interdisciplinary work
– Natural language processing
– Machine learning
– Linguistics
– Automata theory
• Hope that more people will join!
Thank you
Syntax-Based vs Phrase-Based
BLEU
phrase-based system
35
30
25
20
Chinese/English
NIST 2002 Test Set
15
Mar
1
Apr
1
May
1
2005
Future PhD Theses?
“Syntax-based Language Models for Improving Statistical MT”
“Discriminative Training of Millions of Features for MT”
“Semantic Representations Induced from Multilingual EU and UN Data”
“What Makes One Language Pair More Difficult to Translate Than
Another”
“A State-of-the-Art MT System Based on Syntactic Transformations”
“New Training Methods for High-Quality Word Alignment”
+ many unpredictable ones…
Summary
• Phrase-based models are state-of-the-art
–
–
–
–
–
Word alignments
Phrase pair extraction & probabilities
N-gram language models
Beam search decoding
Feature functions & learning weights
• But the output is not English
– Fluency must be improved
– Better translation of person names, organizations, locations
– More automatic acquisition of parallel data, exploitation of
monolingual data across a variety of domains/languages
– Need good accuracy across a variety of domains/languages
Available Resources
•
Bilingual corpora
– 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)
– Lots of French/English, Spanish/French/English, LDC
– European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
• (www.isi.edu/~koehn/publications/europarl)
– 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI
• (www.isi.edu/natural-language/download/hansard/)
•
Sentence alignment
– Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)
– Xiaoyi Ma, LDC (Champollion)
•
Word alignment
– GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)
– GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)
– Manually word-aligned test corpus (500 French/English sentence pairs), RWTH
Aachen
– Shared task, NAACL-HLT’03 workshop
•
Decoding
– ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)
– ISI Pharoah phrase-based decoder
•
•
Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)
Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)
Some Papers Referenced on Slides
•
ACL
–
–
–
–
–
–
–
–
–
–
–
–
•
[Och, Tillmann, & Ney, 1999]
[Och & Ney, 2000]
[Germann et al, 2001]
[Yamada & Knight, 2001, 2002]
[Papineni et al, 2002]
[Alshawi et al, 1998]
[Collins, 1997]
[Koehn & Knight, 2003]
[Al-Onaizan & Knight, 2002]
[Och & Ney, 2002]
[Och, 2003]
[Koehn et al, 2003]
EMNLP
– [Marcu & Wong, 2002]
– [Fox, 2002]
– [Munteanu & Marcu, 2002]
•
AI Magazine
– [Knight, 1997]
•
www.isi.edu/~knight
– [MT Tutorial Workbook]
•
AMTA
– [Soricut et al, 2002]
– [Al-Onaizan & Knight, 1998]
•
EACL
– [Cmejrek et al, 2003]
•
Computational Linguistics
– [Brown et al, 1993]
– [Knight, 1999]
– [Wu, 1997]
•
AAAI
– [Koehn & Knight, 2000]
•
IWNLG
– [Habash, 2002]
•
MT Summit
– [Charniak, Knight, Yamada, 2003]
•
NAACL
–
–
–
–
[Koehn, Marcu, Och, 2003]
[Germann, 2003]
[Graehl & Knight, 2004]
[Galley, Hopkins, Knight, Marcu, 2004]
Ready-to-Use Online Bilingual Data
140
120
Chinese/English
100
Millions of words 80
(English side)
60
Arabic/English
French/English
40
20
2004
2002
2000
1998
1996
1994
0
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
Ready-to-Use Online Bilingual Data
180
160
140
120
Millions of words
100
(English side)
80
60
40
20
0
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
1994
French/English
+ 1m-20m words for
many language pairs
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
Ready-to-Use Online Bilingual Data
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
French/English
1994
Millions of words
(English side)
???
180
160
140
120
100
80
60
40
20
0
 One Billion?
From No Data to Sentence Pairs
• Easy way: Linguistic Data Consortium (LDC)
• Really hard way: pay $$$
– Suppose one billion words of parallel data were sufficient
– At 20 cents/word, that’s $200 million
• Pretty hard way: Find it, and then earn it!
–
–
–
–
–
–
De-formatting
Remove strange characters
Character code conversion
Document alignment
Sentence alignment
Tokenization (also called Segmentation)
Sentence Alignment
The old man is
happy. He has
fished many times.
His wife talks to
him. The fish are
jumping. The
sharks await.
El viejo está feliz
porque ha pescado
muchos veces. Su
mujer habla con él.
Los tiburones
esperan.
Sentence Alignment
1. The old man is
happy.
2. He has fished
many times.
3. His wife talks to
him.
4. The fish are
jumping.
5. The sharks await.
1. El viejo está feliz
porque ha
pescado muchos
veces.
2. Su mujer habla
con él.
3. Los tiburones
esperan.
Sentence Alignment
1. The old man is
happy.
2. He has fished
many times.
3. His wife talks to
him.
4. The fish are
jumping.
5. The sharks await.
1. El viejo está feliz
porque ha
pescado muchos
veces.
2. Su mujer habla
con él.
3. Los tiburones
esperan.
Sentence Alignment
1. The old man is
happy. He has
fished many
times.
2. His wife talks to
him.
3. The sharks await.
1. El viejo está feliz
porque ha
pescado muchos
veces.
2. Su mujer habla
con él.
3. Los tiburones
esperan.
Note that unaligned sentences are thrown out, and
sentences are merged in n-to-m alignments (n, m > 0).
Tokenization (or Segmentation)
• English
– Input (some byte stream):
"There," said Bob.
– Output (7 “tokens” or “words”):
" There , " said Bob .
• Chinese
– Input (byte stream):
– Output:
美国关岛国际机场及其办公室均接获
一名自称沙地阿拉伯富商拉登等发出
的电子邮件。
美国 关岛国 际机 场 及其 办公
室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
Lower-Casing
• English
– Input (7 words):
" There , " said Bob .
– Output (7 words):
" there , " said bob .
Idea of tokenizing and lower-casing:
The
the
“The
“the
the
Smaller vocabulary size.
More robust counting and learning.
Recent Progress in Statistical MT
• Why is that?
– Better algorithms that learn patterns from data
– More data
– Faster, cheaper computers with more RAM
– Community-wide test sets
– Novel automated evaluation methods
– Shared software tools
Three Problems for Statistical MT
• Translation model
– Given a pair of strings <f,e>, assigns P(f | e) by formula
– <f,e> look like translations
 high P(f | e)
– <f,e> don’t look like translations
 low P(f | e)
• Language model
– Given an English string e, assigns P(e) by formula
– good English string
 high P(e)
– random word sequence
 low P(e)
• Decoding algorithm
– Given a language model, a translation model, and a new
sentence f … find translation e maximizing P(e) · P(f | e)
Web Language Models
French input
She has a lot of nerve.
[20]
It has a lot of nerve.
[3]
?
[Soricut, Knight, Marcu, 02]
Used by Google in 2005 to increase performance of
their research MT system!
Descargar

What’s New in Statistical Machine Translation