Natural Language Processing:
a Multilingual Word Sense
Disambiguation Perspective
Pushpak Bhattacharyya
Computer Science and Engineering Department
IIT Bombay
www.cse.iitb.ac.in/~pb
Motivating Factors
1. Ambiguity
2. Multilinguality
Ambiguity
The Crux of the problem
Stages of language processing
•
•
•
•
•
•
•
Phonetics and phonology
Morphology
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Pragmatics
Discourse
Phonetics
•
•
Processing of speech
Challenges
– Homophones: bank (finance) vs. bank (river
bank)
– Near Homophones: maatraa vs. maatra (hin)
– Word Boundary
• aajaayenge (aa jaayenge (will come) or aaj aayenge (will come
today)
• I got [ua]plate
– Phrase boundary
• mtech1 students are especially exhorted to attend as such seminars
are integral to one's post-graduate education
– Disfluency: ah, um, ahem etc.
Morphology
• Word formation rules from root words
• Nouns: Plural (boy-boys); Gender marking (czar-czarina)
• Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had
sat); Modality (e.g. request khaanaa khaaiie)
• First crucial first step in NLP
• Languages rich in morphology: e.g., Dravidian, Hungarian,
Turkish
• Languages poor in morphology: Chinese, English
• Languages with rich morphology have the advantage of easier
processing at higher stages of processing
• A task of interest to computer science: Finite State Machines for
Word Morphology
Lexical Analysis
• Essentially refers to dictionary access and
obtaining the properties of the word
e.g. dog
noun (lexical property)
take-’s’-in-plural (morph property)
animate (semantic property)
4-legged (-do-)
carnivore (-do)
Challenge: Lexical or word sense disambiguation
Lexical Disambiguation
First step: part of Speech Disambiguation
• Dog as a noun (animal)
• Dog as a verb (to pursue)
Sense Disambiguation
• Dog (as animal)
• Dog (as a very detestable person)
Needs word relationships in a context
• The chair emphasised the need for adult education
Very common in day to day communications
Satellite Channel Ad: Watch what you want, when you want
(two senses of watch)
e.g., Ground breaking ceremony/research
Technological developments bring in new
terms, additional meanings/nuances for
existing terms
– Justify as in justify the right margin (word processing
context)
– Xeroxed: a new verb
– Digital Trace: a new expression
– Communifaking: pretending to talk on mobile when
you are actually not
– Discomgooglation: anxiety/discomfort at not being
able to access internet
– Helicopter Parenting: over parenting
Syntax Processing Stage
Structure Detection
S
VP
NP
V
NP
I
like
mangoes
Challenges in Syntactic
Processing: Structural Ambiguity
• Scope
1.The old men and women were taken to safe locations
(old men and women) vs. ((old men) and women)
2. No smoking areas will allow Hookas inside
• Preposition Phrase Attachment
• I saw the boy with a telescope
(who has the telescope?)
• I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of
seeing)
• I saw the boy with the pony-tail
(world knowledge: pony-tail cannot be an instrument of
seeing)
Very ubiquitous: newspaper headline “20 years later, BMC
pays father 20 lakhs for causing son’s death”
Structural Ambiguity…
• Overheard
– I did not know my PDA had a phone for 3 months
• An actual sentence in the newspaper
– The camera man shot the man with the gun when he was near
Tendulkar
• (P.G. Wodehouse, Ring in Jeeves) Jill had rubbed
ointment on Mike the Irish Terrier, taken a look at the
goldfish belonging to the cook, which had caused anxiety
in the kitchen by refusing its ant’s eggs…
• (Times of India, 26/2/08) Aid for kins of cops killed in
terrorist attacks
Headache for Parsing: Garden
Path sentences
• Garden Pathing
– The horse raced past the garden fell.
– The old man the boat.
– Twin Bomb Strike in Baghdad kill 25 (Times of India
05/09/07)
Semantic Analysis
• Representation in terms of
• Predicate calculus/Semantic
Nets/Frames/Conceptual Dependencies and
Scripts
• John gave a book to Mary
• Give action: Agent: John, Object: Book,
Recipient: Mary
• Challenge: ambiguity in semantic role labeling
– (Eng) Visiting aunts can be a nuisance
– (Hin) aapko mujhe mithaai khilaanii padegii
(ambiguous in Marathi and Bengali too; not in
Dravidian languages)
Pragmatics
• Very hard problem
• Model user intention
– Tourist (in a hurry, checking out of the hotel,
motioning to the service boy): Boy, go upstairs
and see if my sandals are under the divan. Do not
be late. I just have 15 minutes to catch the train.
– Boy (running upstairs and coming back panting):
yes sir, they are there.
• World knowledge
– WHY INDIA NEEDS A SECOND OCTOBER (ToI,
2/10/07)
Discourse
Processing of sequence of sentences
Mother to John:
John go to school. It is open today. Should you
bunk? Father will be very angry.
Ambiguity of open
bunk what?
Why will the father be angry?
Complex chain of reasoning and application of
world knowledge
Ambiguity of father
father as parent
or
father as headmaster
Complexity of Connected Text
John was returning from school dejected – today was
the math test
He couldn’t control the class
Teacher shouldn’t have made him
responsible
After all he is just a janitor
Multilinguality
Great Linguistic Diversity
•
•
Major streams
– Indo European
– Dravidian
– Sino Tibetan
– Austro-Asiatic
Some languages are ranked
within 20 in the world in terms of
the populations speaking them
– Hindi and Urdu: 5th (~500
milion)
– Bangla: 7th (~300 million)
– Marathi 14th (~70 million)
3 Language Formula
• Every state has to implement
– Hindi
– The state language (Marathi, Gujarathi, Bengali etc.)
– English
• Big translation requirement, e.g.,during the financial year
ends
Major Language Processing Initiatives
• Mostly from the Government: Ministry of IT, Ministry of
Human Resource Development, Department of Science
and Technology
• Recently great drive from the industry: NLP efforts with
Indian language in focus
– Google
– Microsoft
– IBM Research Lab
– Yahoo
– TCS
Technology Development in Indian
Languages (TDIL)
• Started by the Ministry of IT in
2000
• 13 resource center across the
country
• Responsibility for two
languages: one major and one
minor
• For example,
– IIT Bombay: Marathi and
Konkani
– IIT Kanpur: Hindi and
Nepali
– ISI Kolkata: Bangla and
Santhaali
– Anna University: Tamil
Recent Initiatives
• NLP Association of India: 4 years old: recently efforts are
on making tools and resources freely available on the
websit of NLPAI
• LDC-IL (like the Linguistic Data Consortium at UPenn)
• National Knowledge Commission: special drive on
translation (human and machine)
Recent Initiatives cntd
• Consortia set up already for ILIL MT, E-IL MT and CLIA
• SAALP: South Asian
Association for Language
Processing (formed with
SAARC countries)
Industry Scenario: English
• How to use NLP to increase the search engine
performance (precision, recall, speed)
• Google, Rediff, Yahoo, IRL, Microsoft: all have search
engine, IR, IE R & D projects outsourced from USA and
being carried out in India.
Industry Scenario: Indian Language
•
English-Hindi MT is regarded as critical
•
IBM Research lab has massive English Hindi Parallel Corpora
(news domain)
– Statistical Machine Translation
•
Microsoft India at Bangalore has opened a Multilingual
Computing Division
•
Google and Yahoo India is actively pursuing IL search engine
NLP at IIT Bombay
Center for Indian Language Technology
(CFILT)
CFILT: History
• Set up using a generous grant from Ministry of Communication and
Information Technology, Govt. of India in 2002
• Preceded by 4 years of work on deep computational semantics:
Universal Networking Language (UNL) project from United Nations
– IITB part of large international effort on language
technology involving research groups from15
countries
IIT Bombay’s effort on MT and
accessory systems
What is interlingua?
A vehicle for machine translation
Hindi
English
Interlingua
(UNL)
Analysis
French
generation
Chinese
UNL: a United Nations project
•
•
•
•
•
•
Started in 1996
10 year program
15 research groups across continents
First goal: generators
Next goal: analysers (needs solving various ambiguity problems)
Current active language groups
– UNL_French (GETA-CLIPS, IMAG)
– UNL_Hindi (IIT Bombay with additional work on UNL_English)
– UNL_Italian (Univ. of Pisa)
– UNL_Portugese (Univ of Sao Paolo, Brazil)
– UNL_Russian (Institute of Linguistics, Moscow)
– UNL_Spanish (UPM, Madrid)
Dave, Parikh and Bhattacharyya, Journal of Machine Translation, 2002
PhD scholars associated with the
lab (in order or seniority)
• Manish Shrivastava (Shallow parsing of Indian
languages)
• Smriti Singh (jt. With HSS) (Hindi Morphology): to
complete in a year
• R. Ananthakrishnan (Machine Translation): to
complete in a year (research accepted in ACL09,
Singapore)
• Manoj Chinnakotla (Cross Lingual Search): to
complete in a year
• A. Vasudevan (Automatic Morphology Learning)
• Mitesh Khapra (disambiguation)
• A. Balamurali (sentiment Analysis)
Masters and Bachelor students
• Shorter association
• Average no. every year
– Masters students: 5
– DD students: 3
– B.Tech students: 4
• During overlap of entry level and graduating students: as
many as 20
Research staff
•
•
•
•
•
•
Dictionary making: 3
Wordnets: 5
Translation (EnglishHindi): 1
Translation (EnglishMarathi): 2
Annotation: 2
Morphology: 2
Other staff
• Project manager (MBA): 1
– Half day
• Office Administration: 1
Important Research Themes
• Creation of Foundational Resources and Tools
(Linguistic and Computational)
• Efficient in-memory navigation on knowledge
structures like wordnets (Computational)
• Efficient Indexing for Multilingual Search
(Computational)
• Indian Language Parsing (Linguistic and
Computational)
Important Research Themes (contd)
• User Modeling for sentiment analysis, IR, QA etc.
(Cognitive Science and Computational)
• Uniform framework for Indian Language Dictionaries
(Linguistic)
• Combining Statistical and Linguistic Techniques fro
NLP tasks (Computational Linguistic)
Impact, Use and Visibility of
created resources
• Hindi Wordnet
– Free download with API under GPL
– Available from LDC (linguistics data consortium),
Upenn: topmost linguistic data repository in the
worlds
– Commercial license purchased by Google for work
on Indian language search engine
– To be available from ELRA: language data
repository of Europe
– Available from LDC-IL: LDC of India
Impact, Use and Visibility of
created resources (continued)
• Hindi Wordnet
–
–
–
–
Daily reference form all over the world
More than 3000 downloads
Pivot for wordnets of many Indian languages
Base resource used by many researchers for IL work
on translation, summarization, cross lingual search
Impact, Use and Visibility of
created resources (continued)
• Hindi English Dictionary (UNL framework)
– Free for research purposes under GPL
– Daily reference from all over the world
– Every day feedback from users
Bengali
Wordnet
Dravidian
Language
Wordnet
Sanskrit
Wordnet
Punjabi
Wordnet
Hindi
Wordnet
North East
Language
Wordnet
Konkani
Wordnet
Marathi
Wordnet
English
Wordnet
Linked wordnets
• Immense Lexical Resource
• Great benefits to machine translation, cross lingual
search
• Very useful for language teaching, pedagogy,
comparative linguistics
• Akin to Eurowordnet, but critical differences due to
typical Indian language characteristics
Cross Lingual Search
CLIA is a real need
• Great language diversity in India
• Low comfort level with English
– less than 5% of the total population of about 700
million can use English effectively
• Need for critical information in large quantity and high
quality, especially in agriculture, health, tourism,
education and sectors
• CLIA project started in 2006: domains- tourism and
health
Defining Diagram
CLIA Consortium Members
Name of Institute
Assigned
Language(s)
•
•
•
•
•
•
•
•
IIT Bombay (Consortium Leader)
IIT-Kharagpur (consortium co-leader)
IIIT Hyderabad
Anna University-KBC
Anna University-College of Engg
ISI Kol
Jadavpur University Kolkata
CDAC-Pune
•
•
CDAC-Noida
Utkal University
Marathi, Hindi
Bengali
Telugu, Hindi
Tamil
Tamil
Bengali
Bengali
Marathi, Hindi,
Tamil
Punjabi
--
Related consortium: E-IL MT project
• English to Indian Language MT
• Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya,
Telugu, Tamil
• Approaches: Statistical MT, Example Based MT
• Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH,
IIITA
Related consortium:IL-IL MT project
• Indian Language to Indian Language MT
• Indian Languages: Hindi, Marathi, Bengali, Punjabi,
Tamil, Telugu, Kannada
• Approach: Transfer Based
• Members: IIITH (c), CDAC Pune, IIT Bombay, JU,
University of Hyderabad, AU KBC
All three projects are time bound and
result oriented
• 2 years time frame (extension granted for 1 year)
• Strict deliverables
• For each project the budget outlay is about Rs 80 million
(USD 2 million)
Sample Output screen
Output screen if Input language is Hindi, and English tab is selected
A Specific work: Multilingual
WSD
WORD SENSE
DISAMBIGUATION
in a Multilingual setting
(joint work with Mitesh Khapra, Sapan Shah
and Piyush Kedia; accepted in EMNLP09,
Singapore)
Main Message
• It is possible to circumvent the problem of scarcity of
resources by projecting parameters like sense
distributions, corpus-co-occurrences, conceptual
distance, etc. from one language to another
Overcome Resource Scarcity
• Parallel corpora, wordnets and sense annotated corpora
are scarce resources.
• With respect to these resources, languages show
different levels of readiness
• however a more resource fortunate language can help a
less resource fortunate language
Use projections
• WSD method applicable even when no sense tagged
corpora for that language is available
• This is achieved by projecting wordnet and corpus
parameters from another language to the language in
question.
Large scale MT and CLIA Efforts in
India needing all words Domain
Specific WSD
• Consortium Project on Cross Lingual Information
Access: IITB (Lead), IITKGP, ISI, IIITH, CDAC, JU, Anna
Univ, Utkal Univ
• Consortium Project on E-IL Machine Translation: IITB,
IIITH, CDAC (Lead), JU, Amrita Univ, IIITA
• Consortium Project on IL-IL Machine Translation: IITB,
IITKGP, IIITH (Lead), CDAC, JU
Linked Wordnets
Bengali
Wordnet
Dravidian
Language
Wordnet
Sanskrit
Wordnet
Punjabi
Wordnet
Hindi
Wordnet
North East
Language
Wordnet
Konkani
Wordnet
Marathi
Wordnet
English
Wordnet
Word Sense Disambiguation:
background
Problem Definition
• Obtain the sense of
– A set of target words, or of
– All words (all word WSD, more difficult)
• Against a
– Sense repository (like the wordnet), or
– A thesaurus (not same as wordnet, does not have
semantic relations)
Elaboration (example word:
operation)
•
•
•
•
Operation, surgery, surgical operation, surgical procedure, surgical process
-- (a medical procedure involving an incision with instruments; performed to
repair damage or arrest disease in a living body; "they will schedule the
operation as soon as an operating room is available"; "he died while
undergoing surgery") TOPIC->(noun) surgery#1
Operation, military operation -- (activity by a military or naval force (as a
maneuver or campaign); "it was a joint operation of the navy and air force")
TOPIC->(noun) military#1, armed forces#1, armed services#1, military
machine#1, war machine#1
Operation -- ((computer science) data processing in which the result is
completely specified by a rule (especially the processing that results from a
single instruction); "it can perform millions of operations per second")
TOPIC->(noun) computer science#1, computing#1
mathematical process, mathematical operation, operation -((mathematics) calculation by mathematical methods; "the problems at the
end of the chapter demonstrated the mathematical processes involved in
the derivation"; "they were learning the basic operations of arithmetic")
TOPIC->(noun) mathematics#1, math#1, maths#1
WSD: KNOWLEDEGE BASED v/s MACHINE
LEARNING BASED v/s HYBRID APPROACHES
 Knowledge



Based Approaches
Rely on knowledge resources like WordNet, Thesaurus
etc.
May use grammar rules for disambiguation.
May use hand coded rules for disambiguation.
 Machine
Learning Based Approaches
Rely on corpus evidence.
 Train a model using tagged or untagged corpus.
 Probabilistic/Statistical models.

 Hybrid
Approaches
 Use
corpus evidence as well as semantic relations from
WordNet.
62
OVERLAP BASED APPROACHES
Require a Machine Readable Dictionary (MRD).

Find the overlap between the features of different senses of an
ambiguous word (sense bag) and the features of the words in its
context (context bag).

These features could be sense definitions, example sentences,
hypernyms etc.

The features could also be given weights.

The sense which has the maximum overlap is selected as the
contextually appropriate sense.
CFILT - IITB

63
LESK’S ALGORITHM
Sense Bag: contains the words in the definition of a candidate sense of the
ambiguous word.
E.g. “On burning coal we get ash.”
Ash

Sense 1
Trees of the olive family with pinnate leaves,
thin furrowed bark and gray branches.

Sense 2
The solid residue left when combustible
material is thoroughly burned or oxidized.

Sense 3
To convert into ash
Coal

CFILT - IITB
Context Bag: contains the words in the definition of each sense of each context
word.
Sense 1
A piece of glowing carbon or burnt wood.

Sense 2
charcoal.

Sense 3
A black solid combustible substance formed
by the partial decomposition of vegetable
matter without free access to air and under the
influence of moisture and often increased
pressure and temperature that is widely used
as a fuel for burning
64
In this case Sense 2 of ash would be the winner sense.
WSD USING CONCEPTUAL DENSITY
Select a sense based on the relatedness of that word-sense to the
context.

Relatedness is measured in terms of conceptual distance

(i.e.. how close the concept represented by the word and the concept
represented by its context words are)

This approach uses a structured hierarchical semantic net (WordNet)
for finding the conceptual distance.

Smaller the conceptual distance higher will be the conceptual
density.

CFILT - IITB

(i.e. if all words in the context are strong indicators of a particular concept then
that concept will have a higher density.)
65
CONCEPTUAL DENSITY (EXAMPLE)


CFILT - IITB
The dots in the figure represent the
senses of the word to be
disambiguated or the senses of the
words in context.
The CD formula will yield highest
density for the sub-hierarchy
containing more senses.

The sense of W contained in the subhierarchy with the highest CD will be
chosen.
66
CONCEPTUAL DENSITY (EXAMPLE)
administrative_unit
body
division
CFILT - IITB
committee
CD = 0.062
CD = 0.256
department
government department
local department
jury
operation
police department
jury
administration
The jury(2) praised the administration(3) and operation (8) of Atlanta Police
Department(1)
Step 1:
Step 2:
Make a lattice of the nouns Step 3:
in the context, their senses
and hypernyms.
Compute the conceptual
density of resultant concepts
(sub-hierarchies).
The concept with the highest
CD is selected.
Step 4: Select the senses below the
selected concept as the correct
sense for the respective words.
67
WSD USING RANDOM WALK ALGORITHM
0.46
0.97
a
S3
b
S3
a
0.42
S3
c
e
0.35
S2
S2
f
CFILT - IITB
0.49
0.63
S2
k
g
h
i
0.92
j
S1
Bell
Step 1:
Add a vertex for each
possible sense of each
word in the text.
Step 2: Add weighted edges using
definition based semantic
similarity (Lesk’s method).
0.56
S1
0.58
S1
l
0.67
S1
ring church Sunday
Step 3:
Apply graph based ranking
algorithm to find score of
each vertex (i.e. for each
word sense).
Step 4: Select the vertex (sense)
which has the highest score.
68

Machine Learning Based Approaches
Supervised Approaches
 Semi-supervised Algorithms
 Unsupervised Algorithms

CFILT - IITB
69
SUPERVISED APPROACHES –
COMPARISONS
Approach
Corpus
Not reported
Decision Lists
96%
Not applicable
Senseval3 – All
Words Task
Tested on a set of
12 highly
polysemous
English words
WSJ6 containing
191 content words
Exemplar Based
68.6%
disambiguation (kNN)
SVM
72.4%
Not reported
Perceptron trained
HMM
73.74%
67.60
72.4%
Senseval 3 –
Lexical sample
task (Used for
disambiguation of
57 words)
Senseval3 – All
Words Task
Average Baseline
Accuracy
60.90%
63.9%
CFILT - IITB
Average Recall
Naïve Bayes
Average
Precision
64.13%
63.7%
55.2%
60.90%
70
Driving factor for our work
• No single existing solution to WSD
completely meets our requirements of
multilinguality, high domain
accuracy and good performance in
the face of not-so-large annotated
corpora
Parameters Used for WSD (1/5)

Domain Specific Sense Distributions

Sense distributions for a word within a domain are different from those in
a general corpus.
सवु िधा


3530:~:NOUN:~:वह स्थिति स्िसमें कोई काम करने में कुछ कठिनिा या अड़चन न हो:~:"दस
ू रं की
अपेक्षा आपके साि काम करने में ज्यादा सुववधा है ":~:सुववधा, सुभीिा, सुगमिा, आसानी, सहूलियि
(ease) It is easier to work with you than with others
28213:~:NOUN:~:वह सेवा िो एक संथिा या कोई उपकरण आपको दे िा है :~:"इस मोबाइि में इंटरनेट
की भी सुववधा है ":~:सुववधा
(facilty) This mobile has has internet facility on it.
केन्द्र



771:~:NOUN:~:ककसी वत्ृ ि या पररधध या पंस्ति के िीक बीचंबीच का बबन्द ु या भाग:~:"इस वत्ृ ि के केंद्र
बबंद ु से िािी हुई एक रे खा खींचो":~:केंद्र_बबंद,ु केंद्र, केन्द्र_बबन्द,ु केन्द्र, मध्य_बबंद,ु मध्य-बबन्द,ु नालभ
(middle point) Draw a line through the center of this circle.
28322:~:NOUN:~:वह भवन िो ककसी ववशेष काम के लिए समवपिि हो या िहााँ कोई ववशेष काम होिा
हो:~:"वे िोग शोध के लिए एक अिग केंद्रीय भवन बनाना चाहिे हैं":~:केंद्रीय_भवन, केंद्र, केन्द्रीय_भवन,
केन्द्र, सेंटर, सेन्टर
(a setup for a specific task) They want to open a separate center for undertaking research.
The sense shown in red is the most frequent sense in the Wordnet whereas the sense shown
in green is the most frequent sense as observed in the tourism corpus.
72
Parameters Used for WSD (2/5)

Dominant Concepts Within a Domain

Corpus evidence can be used to find the dominant concepts within a
domain.
Tourism
Health
{place, country, city, area}
{flora, fauna}
{mode of transport}
{fine arts}
{doctor, nurse}
{patient}
{disease}
{treatment}
Candidate senses lying in the hierarchy of these dominant concepts can
be given higher weightage than other candidate senses.
सागर – Sense1 -- खारे पानी की िह विशाल राशश जो पथ्
ृ िी के स्थल भाग को चारों ओर
से घेरे हुए है “राम ने िानरी सेना की सहायता से सागर पर सेतु का ननमााण ककया था“
(sea) Rama had created a bridge on the sea with the help his army of
monkeys.
सागर – Sense2 -- ककसी विषय के ज्ञान या गुण आदि का बहुत बडा आगार "संत कबीर
ज्ञान के सागर थे“
(Ocean of knowledge or qualities: metaphorical) Saint Kabir was an
ocean of knowledge.
 Sense1 should be given a higher weightage as compared to Sense2 as
Sense1 belongs to the dominant concept {place}

73
Parameters Used for WSD (3/5)

Corpus co-occurrence frequency of senses.
Better as compared to corpus co-occurrence frequency
of words.
E.g.
 The synset {हॉटि: hotel} has a high co-occurrence with
the synset {भोिन, खाना: meals}
 The synset {क्षेत्र:: region } has a high co-occurrence with
the synset { प्रदे श, राज्य, प्रांि: state, province }
 The synset {समय: time} has a high co-occurrence with
the synset { अच्छा, बठ़िया, िीक: good, appropriate,
accurate}

74
Parameters Used for WSD (4/5)

Conceptual distance between nouns (Agirre Eneko &
German Rigau (1996)
वथिु_923, thing
क्षेत्र_2022,
region
भभ
ू ाग_3108, geo
region
ििीय_धरािि_25563,
water surface
नदी_4430,
सागर_2650,
river
sea



अमि
ि वथिु_1897,
ू _
abstract thing
सागर_8231,
sea (metaphor)
8231:~:NOUN:~:ककसी विषय के ज्ञान या गण
ु आदि का बहुत बडा आगार (sea; metaphor)
2650:~:NOUN:~:खारे पानी की िह विशाल राशश जो पथ्
ृ िी के स्थल भाग को चारों ओर से घेरे हुए है
(sea; physical)
4430:~:NOUN:~:जल का िह प्राकृनतक प्रिाह जो ककसी पिात से ननकलकर ननश्चचत मागा से होता हुआ
समर
ु या ककसी िस
ू री निी में गगरता है (river)
75
Parameters Used for WSD (5/5)

Relations learnt from semantic graph
स्िस्थ_1831
healthy
MODIFIES_NOUN
MODIFIES_NOUN
जंतु_748
animal
HYPONYM
आिमी_3389,
human
76
How are the parameters obtained

From sense tagged corpus, learn:
Domain specific sense distribution of words.
 Co-occurrence frequencies between senses.
 Dominant concepts in the domain.


From wordnet, learn:
Semantic relations between senses.
 Conceptual distance between senses.

77
Syset Based Multilngual
Dictionary
Adopted Multilingual Dictionary
Standard
Senses
Hindi
Marathi
Bangali
Oriya
Tamil
(W1, W2, W3,
W4, W5, W6 )
(W1, W2, W3, W4,
W5, W6 )
(W1, W2, W3)
(W1, W2 ,
W3)
(W1, W2,
W3, W4)
(W1, W2, W3)
(sun)
(सूय,ा सूरज, भान,ु भास्कर, प्रभाकर,
(सूय,ा भानु, दििाकर,
भास्कर, रवि, दिनेश,
दिनमणी)
...
...
...
(cub, lad,
laddie, sonny,
sonny boy)
(लडका, बालक, बच्चा, छोकडा,
छोरा, छोकरा, लौंडा )
…
…
…
…
…
…
(son, boy)
दिनकर, अंशम
ु ान, अंशम
ु ाली)
पत्र
ु , बेटा, लडका, लाल, सुत,
बच्चा, नंिन, पत
ू , गचरं जीि,
गचरं जी )
(
(मुलगा,
पोरगे )
पोरगा, पोर,
मुलगा, पुत्र, लेक,
गचरं जीि, तनय )
(
A row in the MultiDict
Cross Linkages: solve the lexical
substitution problem
Scoring Function for WSD
Expressions motivated by Hopfield
Energy expressions (Hopfieled,
1982)
Iterative WSD (IWSD)
• Algorithm 1: performIterativeWSD(sentence)
• 1. Tag all monosemous words in the sentence.
• 2. Iteratively disambiguate the remaining words in the
sentence in increasing order of their degree of polysemy.
• 3. At each stage select that sense for a word which
maximizes the score given by Equation (1)
Greedy nature of IWSD
Other approaches tried
• Exhaustive Graph Search
• Page Rank
Projecting Parameters
Projecting Parameters (1/4)
Statistics learnt for one language (say L1) should
be reusable for another language (say L2).
 Sense Distribution for words: Can be learnt for L1 by

using the cross-linkages between synset members in L1 and L2.

E.g. the 2 senses of the word अखेर and the corresponding cross-linked words
in Hindi are shown below
अखेर
अखेर
Sense_325
8
Sense_208
7
अंत
दे हांि
88
Projecting parameters (2/4)
Sr.
No.
Marathi
Word
1
गोड
Synset
{गोड, सरु े ि, मंिूळ, सम
ु धरु } – sounds
sweet
{गोड, मधरु } – tastes sweet
P(S|word) as learnt
from sense tagged
Marathi corpus
P(S|word) as learnt
from parallel sense
tagged Hindi corpus
0.063
0.056
0.937
0.944
2
मान
{मान, ग्रीवा} – neck
{प्रतिष्िा, इज्िि, आब, मान} – respect
0.4
0.6
0.36
0.64
3
आवड
{पसंिी, आवड} – liking
{आवड, हौस, गोडी, शौक} – hobby
0.24
0.76
0.21
0.79
4
उत्िर
{उत्िर, उत्िर_भाग} – north
{उत्िर, िबाब} – answer
0.94
0.06
0.98
0.02
5
आंबा
{आंबा, आम्रवक्ष
ृ } – mango tree
{आंबा} – mango fruit
0.28
0.72
0.29
0.71
89
Projecting parameters (3/4)

Co-occurrence of senses: Within a domain these remain
same (or proportional) across languages.
Sr.
No. .
Synset
Co-occurring Synsets
1
{ कानूनी, कानूनी, ववधधक }
2
{सीिा, लसया, िानकी}
3
{िक्ष्मी, कमिा, नारायणी}
{ भवन, इमारि, वाथिु }
{ संथिा }
{रामायण, रामायन}
{नातयका, हीरोइन}
{महान ्, महान, अजीम}
{ठहंद}ू
{प्रलसद्ध, नामी, प्रख्याि}
{सरथविी, प्रज्ञा, भारिी}
{मंठदर, मस्न्दर}
{ववष्णु, कमिेश}
{महाराष्र}
{समुदाय, समूह}
{अतिधि-गह
ृ , अतिधि-भवन}
{यात्रा, सफ़र}
4
{क्षेत्र, इिाका, इिाका,
भूखड
ं }
P(co-occurrence) as
learnt from sense
tagged Marathi
corpus
0.5
0.5
1
1
1
1
0.17
0.17
0.17
0.33
0.0019
0.019
0.0019
0.0019
P(co-occurrence) as
learnt from parallel
sense tagged Hindi
corpus
0.33
0.33
1
1
1
1
0.33
0.33
0.33
0.33
0.0017
0.012
0.0017
0.0017
90
Projecting parameters (4/4)

Domain Specific Dominant Concepts: These remain
the same across languages as the synset ids are same.

Conceptual Density: Sense Hierarchies for all languages
can be copied from Hindi Wordnet

Semantic Graph Distance: Sense graphs for all
languages can be copied from Hindi Wordnet (as the sense ids are
same)
Thus all the features can be learned in one language (Hindi) and used
in other languages.
91
Experimental Setup
Size of Manually tagged sense
corpora
Language
Hindi
Marathi
Bengali
Tamil
# of polysemous words
(tokens)
Tourism
Health
Domain
Domain
50890
29631
32694
8540
9435
17868
-
Per language synsets
Language
Hindi
Marathi
Bengali
Tamil
# of synsets in
MultiDict
29833
16600
10732
5727
Precision, Recall and F-scores of IWSD, PageRank and
Wordnet Baseline. Values reported with and without
parameter projection
Algorithm
Language
Marathi
P % R%
F%
Bengali
P % R% F%
81.29
80.42
80.85
81.62
78.75
79.94
73.45
70.33
71.86
79.83
79.65
79.79
79.61
79.61
79.61
76.41
76.41
76.41
71.11
58.07
71.11
58.07
71.11
58.07
75.05
52.25
75.05
52.25
75.05
52.25
IWSD (training on self corpora; no parameter
projection)
IWSD (training on Hindi and reusing parameters
for another language)
PageRank (training on self corpora; no parameter
projection)
PageRank (training on Hindi and reusing
parameters for another language)
Wordnet Baseline
Table 6: Precision, Recall and F-scores of IWSD, PageRank and Wordnet Baseline. Values are
reported with and without parameter projection.
Tamil Tourism corpus using
parameters projected from Hindi
Algorithm
IWSD (training on
P %
R %
F%
Tamil)
89.50
88.18
88.83
84.60
65.62
73.79
65.62
78.82
65.62
IWSD (training on
Hindi and reusing for
Tamil)
Wordnet Baseline
Marathi Health corpus parameters
projected from Hindi
Algorithm
IWSD
(training
P %
F%
on
Marathi)
IWSD (training on Hindi
and reusing for Marathi)
Wordnet Baseline
R %
84.28 81.25 82.74
75.96 67.75 71.62
60.32 60.32 60.32
CONCLUSION
Ambiguity resolution: crux of the problem in NLP
 Ambiguity resolution in the face of multiliguality: an
enormous challenge
 Domain specific WSD: more tractable
 Domain specific sense distributions play a very
important role: remain same across languages
 Resources for one language can be used for the
processing of another language: a very interesting
possibility

98
URLs
• For resources
www.cfilt.iitb.ac.in
• For publications
www.cse.iitb.ac.in/~pb
Thank you
Questions and comments?
Descargar

Natural Language Processing