CS4025: Machine Translation
l
l
l
Background, how languages differ
MT Techniques
Controlled languages
For more info: J&M, chap 21 in 1st ed, 25 in 2nd .
Also extra notes.
Computing Science, University of Aberdeen
1
Machine Translation
l
Automatically translate texts between languages
(eg, English to Japanese)
» Or assist human translators?
l
One of the oldest dreams of NLP, AI, and CS
(first system in 1954).
Computing Science, University of Aberdeen
2
Varieties of Machine Translation
Translating from a source language to a target
language.
(FA)MT – (full automatic) Machine Translation
HAMT – Human Aided MT (aid before or after)
MAHT – Machine Aided Human Translation
Computing Science, University of Aberdeen
3
Brief History of MT
Serious but naïve work in the 1950’s
1966 ALPAC report (speed, cost, accuracy)
terminated most research funding
“Underground” MT systems developed into
products (e.g. SYSTRAN) in the 1970’s
More MT products emerged in the 1980’s and
1990’s, though still relatively simple
MT now in everyday widespread use (e.g. for web
pages), in spite of its problems
Computing Science, University of Aberdeen
4
Translation is Hard:
Language differences
Lexical
Meanings assigned to a word
» to know a person
» to know a fact
Boundaries on a scale
» friend vs acquaintance
Preferences
» sibling vs brother vs elder brother
Gaps
» Japanese has no word for privacy
Computing Science, University of Aberdeen
5
Overlaps between word senses (Eng/Fr)
Computing Science, University of Aberdeen
6
Syntactic differences
Morphology vs word-order
» English: John saw Jane
» Russian: John[+subject] saw Jane[+object]
Which word orders
» English: a cheap car
» French: a car cheap
Argument order (e.g. VSO/SVO/SOV languages)
» English: John likes apples
» Spanish: apples gustar John
Computing Science, University of Aberdeen
7
Pragmatic differences
Zero pronouns
» Bake [] for 20 minutes
Extra distinctions
» Relative-status markers in Japanese
Cultural knowledge
» mu -> curtains of her bed, not just curtains
Computing Science, University of Aberdeen
8
Translating from Japanese to English…
dai yu zi zai chuang shang gan nian bao chai you
ting jian chuang wai zhu shao xiang ye zhe shang,
yu sheng xi li, qing han tou mu, bu jue you di xia lei
lai.
Dai-yu alone on bed top think-of-with-gratitude Bao-chai
again listen to window outside bamboo tip plantain leaf of ontop rain sound sigh drop clear cold penetrate curtain not
feeling again fall down tears come
As she lay there alone, Dai-yu’s thoughts turned to Baochai… Then she listened to the insistent rustle of the rain on
the bamboos and plantains outside her window. The coldness
penetrated the curtains of her bed. Almost without noticing
it she had begun to cry.
Computing Science, University of Aberdeen
9
Perfect Translation needs World
Knowledge
Example: Translating “it” into a language which
associates grammatical gender with nouns requires
identifying the antecedent:
» A hollow cylinder … rests on a surface … and an object is
suspended so that it …
English
German
Gender
Pronoun
Surface
Flaeche
Feminine
sie
Cylinder
Zylinder
Masculine
er
Object
Objekt
Neuter
es
Computing Science, University of Aberdeen
10
Approaches to MT
Computing Science, University of Aberdeen
11
Direct Translation
No intermediate representation. Possibly
morphological analysis and simple reordering
principles
Input: [Japanese text]
After word-by-word translation
» I give PAST pen on desk John to
After word-order, det rewrite rules
» I give PAST the pen on the desk to John
After morphology
» I gave the pen on the desk to John
Computing Science, University of Aberdeen
12
Direct Translation - Issues
Completely tied to a language pair
» Complete new system for each pair
Problems dealing with ambiguity:
Example (Russian-English)
» My trebuem mira
» We require world
» We want peace
(direct translation)
(correct translation)
Don’t need complex NLP
» used in cheap translators
Useful as a “default translation” if more complex
techniques fail
Computing Science, University of Aberdeen
13
Structural Transfer
Three steps
» parse input text (reusable)
» rewrite parse tree into parse tree of new language
(specific to language pair)
– English NP -> Det Adj N
– French NP -> Det N Adj
becomes
» generate output text (reusable)
More in next lecture
Computing Science, University of Aberdeen
14
Structural Transfer - Issues
Most popular approach (?)
» Used in Systran (Altavista translator)
n*(n-1) transfer components needed for
translation between n languages
Good for syntax, less good for words, pragmatics
» supplement with other techniques, such as statistical
translation of individual words?
Computing Science, University of Aberdeen
15
Interlingua Approach
Two steps
» full analysis of input text, into a meaning (interlingua)
– eg, know into KnowFact or KnowPerson
» full generation of output text, from meaning
Can’t be done except in a small domain
Preserving ambiguity
» if target language uses same word for KnowFact or
KnowPerson, no need to disambiguate know
Computing Science, University of Aberdeen
16
Interlingua Approach - Issues
Interlingua must contain all aspects of meaning
needed for all the languages (e.g. gender for
Spanish cats)
Interlingua must reflect all the different views on
how the world is made up (e.g. Japanese “yasai”
refers mostly to vegetables, but also mint but not
carrots)
For this to work, the domain must be restricted
and the languages similar
Translation between n languages only needs n
analysis components and n generation components
Computing Science, University of Aberdeen
17
Statistical Approach
Noisy channel model for speech rec: look for
Sentence that maximises P(Sig|Sent)*P(Sent)
MT: look for translation Sent that maximises
P(Input|Sent)*P(Sent)
» faithfulness*fluency??
» P(Sent) - estimated using bigrams/trigrams
» P(Input|Sent) - estimated by analysing a corpus of
human-translated texts
– eg, how often is know translated as savoir (know fact) and
how often as connaitre (know person)
– Also model reordering, insertions, deletions
Computing Science, University of Aberdeen
18
Statistical Approach - Issues
P(Input|Sent)
» Very hard to model situations where translation
reorders material, even if this has a simple
syntactic description
» How “faithful” is a proposed output sentence to
the original input text?
» Less clear what this means once we go beyond
translating individual words
» Combine with direct techniques?
Computing Science, University of Aberdeen
19
MT Performance
Translating 100 sentences is trivial, the problems
are all in the scaling-up.
» Good dictionaries are key.
Three uses
» Fully automatic rough translation
– like Altavista/Systran Babelfish
» Draft translations which a human post-edits (humans can
postedit quickly as long as less than 20% of words need
to be changed)
» Tools for translators (MAHT)
Computing Science, University of Aberdeen
20
Another approach to HAMT:
Controlled Languages
A controlled (simplified, basic) English is a subset
of full English.
» Limited vocabulary: repair but not fix
» Limited syntax: I ate but not I have eaten
Mainly used for technical documents
Originally intended to make manuals easier for
non-native speakers
MT works much better if input is Controlled
English
Computing Science, University of Aberdeen
21
AECMA Simplified English
(Emerging) standard for commercial aerospace
industry.
Designed by academic linguists as well as
practitioners (technical authors).
Computing Science, University of Aberdeen
22
AECMA: vocabulary
Fixed vocabulary (2000 words?) with additions
limited to specific areas (eg, company names).
Goal is “each word means only one thing”, and
“each concept is expressed by only one word”. No
ambiguity, no synonyms.
Computing Science, University of Aberdeen
23
Example words
Above: only use to indicate physical position
» Legal: The wing is above the wheel
» Illegal: The engine temperature is above normal
» Legal: The engine temperature is more than normal
Test: use as noun only
» Legal: the system test
» Illegal: Test the circuit.
» Legal: Do a test on the circuit.
Computing Science, University of Aberdeen
24
AECMA: Syntax
Rule: Forbid “unusual” English syntax
Ex: only simple past, present, future tenses
» Illegal: Any other information is to be ignored
» Legal: Ignore any other information
Ex: No gerunds
» Illegal: Changing the light is dangerous.
» Legal: It is dangerous to change the light.
Computing Science, University of Aberdeen
25
AECMA: Syntax Examples (2)
Only two noun-noun modifiers
» Illegal: The aircraft door attachment bolt
» Legal: The attachment bolt of the aircraft door
Verbs and det. must be included
» Illegal: Rotary switch to INPUT
» Legal: Set the rotary switch to INPUT
Computing Science, University of Aberdeen
26
AECMA: Stylistic Rules
Sentences should be 20 words or less
Paragraphs should be 6 sentences or less.
Start warnings with a command
» Illegal: The oil used in the engine contains toxic additives
which may be absorbed through the skin.
» Legal: Do not get the oil on your skin. It is poisonous.
Computing Science, University of Aberdeen
27
Controlled-Language MT
Much easier
» No problems disambiguating words
» Hard syntax is forbidden
» May also prohibit/restrict pronouns
Authors must write in CE
» CE conformance checkers
Lot of commercial interest
Computing Science, University of Aberdeen
28
Descargar

Natural-Language Processing