Machine Translation and Lexical
Resources Activity at IIT Bombay
Pushpak Bhattacharyya
Computer Science and Engineering
Department
Indian Institute of Technology Bombay
[email protected]
http://www.cse.iitb.ac.in/pb
Interlingua Methodology
Directly obtain the meaning of the source sentence.
Do target sentence generation from the meaning
representation.
John gave the book to Mary.
Meaning representation:
give-action:
agent: john
object: the book
receiver: mary
Competing approaches
• Direct
• Transfer based
MT Architectures: Vauquois'
triangle
State of Affairs
• Systran reports 19 different langauge pairs.
• 8 alright for intended use.
• Even fewer are capable of quality written or
spoken text translation.
ENGLISH-SPANISH-ENGLISH
• ...In that Empire, the Art of Cartography attained such
Perfection that the map of a single Province occupied
the entirety of a City, and the map of the Empire, the
entirety of a Province
• ... en ese imperio, el arte de la cartografía logró tal
perfección que el mapa de una sola provincia ocupó
la totalidad de una ciudad, y el mapa del imperio, la
totalidad de una provincia
• ... in that empire, the art of the cartography obtained
such perfection that the map of a single province
occupied the totality of a city, and the map of the
empire, the totality of a province
Provided by Systran on 19/11/02
ENGLISH-KOREAN-ENGLISH
• ...In that Empire, the Art of Cartography attained such Perfection
that the map of a single Province occupied the entirety of a City,
and the map of the Empire, the entirety of a Province
• 저 제국안에, 단순한 지방의 지도가 도시
의 완전을 점유했다 고 Cartography의 예
술은 같은 얀벽,및 제국, 지방의 완전의
지도 를 달성했다
• Inside that empire, the map of the region where it is simple
occupied the perfection of the city the art of the Cartography is
same, yan it attained the map of of perfection of the wall and
empire and region
Provided by Systran on 19/11/02
UNL Based MT: the scenario
ENGLISH
ENCONVERSION
RUSSIAN
UNL
DECONVERSION
FRENCH
HINDI
Universal Networking
Language
Common language for computers to express
information written in natural language
(Uchida et. al. 2000)
Application:
Electronic language to overcome language
barrier
Information Distribution System
UNL Example
arrange
agt
John
obj
meeting
plc
residence
Components of the UNL System
• Universal Word
• Relation Labels
• Attributes
Universal Word
[saayaa] "shadow(icl>darkness)"; the place
was now in shadow
[laoSamaa~] "shadow(icl>iota)"; not a shadow
of doubt about his guilt
[saMkot] "shadow(icl>hint)" ; the shadow of
the things to come
[Cayaa] "shadow(icl>deterrant)"; a shadow
over his happiness
Universal Word
(foreign concepts)
[aput] "snow(icl>thing)";
[pukak] "snow(aoj<salt like)";
[mauja] "snow(aoj<soft, aoj<deep)";
[massak] "snow(aoj<soft)";
[mangokpok] "snow(aoj<watery)";
Relation
agt (agent) Agt defines a thing which initiates an action.
agt (do, thing)
Syntax
agt[":"<Compound UW-ID>] "(" {<UW1>|":"<Compound UW-ID>}
"," {<UW2>|":"<Compound UW-ID>} ")"
Detailed Definition
Agent is defined as the relation between:
UW1 - do, and
UW2 - a thing
where:
UW2 initiates UW1, or
UW2 is thought of as having a direct role in making UW1 happen.
Examples and readings
agt(break(icl>do), John(icl>person)) John breaks
agt(translate(icl>do), computer(icl>machine)) computer translates
Attributes
• Used to describe what is said from the
speaker's point of view.
• In particular captures number, tense,
aspect and modality information.
Example Attributes
• I see a flower
UNL: obj(see(icl>do), flower(icl>thing))
• I saw flowers
UNL: obj(see(icl>do).@past, flower(icl>thing).@pl)
• Did I see flowers?
UNL: obj(see(icl>do).@past.@interrogative,
flower(icl>thing).@pl)
• Please see the flowers?
UNL: obj(see(icl>do).@past.@request,
flower(icl>thing).@pl.@definite)
The Analyser Machhine
Analysis
Rules
C
Node List
Dictionary
Enconverter
ni-1
A
ni
A
C
C
ni+1
ni+2
ni+3
A
D
Node-net
B
C
E
Strategy for Analysis
• Morphological Analysis
• Syntactico-Semantic Analysis
Analysis of a simple sentences
<< A Report of John’s genius reached King’s ears>>
article and noun are combined and [email protected] is added to the noun.
<<[Report ][of] John’s genius reached king’s ears>>
Right shift to put preposition with the succeeding noun.
<</Report /[of ][John’s] genius reached king’s ears>>
Ram’s being a possessing noun, shift right.
<</Report //of / [John’s] [genius] reached king’s ears>>
These two nouns are resolved into relation pos and first noun is deleted:
Simple sentence (continued)
<</Report /[of][genius] reached King’s ears>>
The preposition of is then combined with noun and a dynamic attribute OFRES
is added to entry of genius.
<<[Report][of genius ] reached King’s ears>>
Using the attribute OFRES these two nouns are resolved to relation mod and
the second noun is deleted.
<<[Report ][reached] King’s ears>>
Shift right again and solve King’s ears, relation pof is generated.
<</Report /[reached][ ears]>>
Relation obj is generated here and then relation agt is generated between
Report and ears
<</reached />>
UNL as Interlingua and
Language Divergence
(Dave, Parikh, Bhattacharyya, JMT, 2003)
• Stands for the discrepancy in representation
due to the inherent characteristics of the
languages.
• Syntactic Divergence
• Lexical Semantic Divergence
Issue of free word order
jaIma nao caaorI krnaovaalao
laD,ko kao laazI sao
maara.
jaIma nao laazI sao
caaorI krnaovaalao
laD,ko kao maara.
caaorI krnaovaalao
laD,ko kao jaIma nao laazI sao
maara.
caaorI krnaovaalao
laD,ko kao laazI sao
jaIma nao maara.
laazI sao
jaIma nao caaorI krnaovaalao
laD,ko kao maara.
• Use made of the fact that in Hindi post positions stay adjacent
to nouns (opposed to the preposition stranding divergence).
• Flexibility in parsing- hit and preserve the predicate till the
end.
Conjuct and compound verbs
Typical Indian language phenomenon. Conjunct for verb-verb,
compound for other POS+verb.
vah gaanao
lagaI
She started singing
H calao jaaAao
Go away.
H $k jaaAao
E Stop there.
H Jauk jaaAao
E Bend down.
Possibility of combinatorial explosion in the lexicon. Possible
solution: wordnet?
Use of Lexical Resources
•Automatic Generation of the UW to
language dictionary
(Verma and Bhattacharyya, Global Wordnet Conference,
Czeck Republic, 2004)
•Universal Word generation
•Semantic attribute generation
•Heavy use of wordnets and ontologies
Wordnet and Lexical Resources
•Approximately 12000 Hindi synsets
corresponding to about 35000 root
words of Hindi.
•Approximately 7000 Hindi synsets
corresponding to about 16000 root
words of Hindi.
•Verb Hierarchy of approximately
4000 unique words corresponding to
6000 senses.
WordNet Sub-Graph
saMrcana
a
Hyponymy
Aavaasa , inavaasa
Hypernymy
Meronymy
rsaao[Gar
Hyponymy
Aa^Mgan
a
baramada
M
e
r
o
n
y
m
y
Sayana kxa
Gar , gaRh
Gloss
manauYyaaoM ka Cayaa huAa vah
sqaana jaao dIvaaraoM sao Gaor kr
banaayaa jaata hO
Hyponymy
AQyana kxa
Aitiqa gaRh
AaEama
JaaopD,I
Languages under Study
Language
Analysis Status
Generation
Status
English
D- 60000
R- 5000
D- 60000
R- 400
Hindi
D- 75000
R- 5700
D- 75000
R- 6500
Marathi
D- 4000
R- 2200
D- 4000
R- 6000
Bengali
D- 500
R- 1800
D- 500
R- 2100
Conclusions
• Predicate preservation strategy used for
English, Hindi, Marathi, Bengali (Spanish
being added).
• Focus in marathi on morphology for
Marathi.
• Focus on kaarak (case) system for Bengali.
• Extremely lexical knowledge hungry.
Conclusions
• Work going on in the creation of Indian
language wordnets (Hindi, Marathi in IIT
Bombay; Dravidian in Anna University).
• Interlingua has a the attractive possibility of
being used as a knowledge representation
and applying to interesting applications like
summarization, text clustering, meaning
based multilingual search engines.
Descargar

Presentation of the UNL System - AU-KBC