Cognate or False
Friend? Ask the Web!
A Workshop on Acquisition and
Management of Multilingual Lexicons
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Preslav Nakov, University of California, Berkeley
Elena Paskaleva, Bulgarian Academy of Sciences
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Introduction
 Cognates and false friends
 Cognates are pair of words in different
languages that sound similar and are
translations of each other
 False friends are pairs of words in two
languages that sound similar but differ in
their meanings
 The problem
 Design an algorithm that can distinguish
between cognates and false friends
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognates and False Friends
 Examples of cognates
 ден in Bulgarian = день in Russian (day)
 idea in English = идея in Bulgarian (idea)
 Examples of false friends
 майка in Bulgarian (mother) ≠ майка in
Russian (vest)
 prost in German (cheers) ≠ прост in
Bulgarian (stupid)
 gift in German (poison) ≠ gift in English
(present)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
The Paper in One Slide
 Measuring semantic similarity
 Analyze the words local contexts
 Use the Web as a corpus
 Similarities contexts  similar words
 Context translation  cross-lingual
similarity
 Evaluation
 200 pairs of words
 100 cognates and 100 false friends
 11pt average precision: 95.84%
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity
 What is local context?
 Few words before and after the target word
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
 The words in the local context of given word are
semantically related to it
 Need to exclude the stop words: prepositions,
pronouns, conjunctions, etc.
 Stop words appear in all contexts
 Need of sufficiently big corpus
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity
 Web as a corpus
 The Web can be used as a corpus to
extract the local context for given word
 The Web is the largest possible corpus
 Contains big corpora in any language
 Searching some word in Google can return
up to 1 000 excerpts of texts
 The target word is given along with its local
context: few words before and after it
 Target language can be specified
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity
 Web as a corpus
 Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears
presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30
years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS,
CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate
for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers
once, and you will never use flowers delivery from florists again.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity
 Measuring semantic similarity
 For given two words their local contexts
are extracted from the Web
 A set of words and their frequencies
 Semantic similarity is measured as
similarity between these local contexts
 Local contexts are represented as
frequency vectors for given set of words
 Cosine between the frequency vectors in
the Euclidean space is calculated
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity
 Example of context words frequencies
word: flower
word
count
word: computer
word
count
fresh
order
rose
delivery
gift
welcome
red
...
Internet
PC
technology
order
new
Web
site
...
217
204
183
165
124
98
87
...
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
291
286
252
185
174
159
146
...
Contextual Web Similarity
 Example of frequency vectors
v1: flower
#
0
1
2
3
...
4999
5000
word
alias
alligator
amateur
apple
...
zap
zoo
v2: computer
freq.
#
3
2
0
5
...
0
6
0
1
2
3
...
4999
5000
 Similarity = cosine(v1, v2)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
word
alias
alligator
amateur
apple
...
zap
zoo
freq.
7
0
8
133
...
3
0
Cross-Lingual Similarity
 We are given two words in different
languages L1 and L2
 We have a bilingual glossary G of
translation pairs {p ∈ L1, q ∈ L2}
 Measuring cross-lingual similarity:
1. We extract the local contexts of the target
words from the Web: C1 ∈ L1 and C2 ∈ L2
2. We translate the context C1
G
C1*
3. We measure distance between C1* and C2
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup
 Local context extracted from the Web can
contain arbitrary parasite words like
"online", "home", "search", "click", etc.
 Internet terms appear in any Web page
 Such words are not likely to be
associated with the target word
 Example (for the word flowers)
 "send flowers online", "flowers here",
"order flowers here"
 Will the word "flowers" appear in the local
context of "send", "online" and "here"?
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup
 If two words are semantically related both
should appear in the local contexts of each
other
 Let #{x,y} = number of occurrences of x in
the local context of y
 For any word w and a word from its local
context wc, we define their strength of
semantic association p(w,wc) as follows:
 p(w, wc) = min{ #(w, wc), #(wc,w) }
 We use p(w,wc) as vector coordinates
when measuring semantic similarity
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Web Similarity Using Seed Words
 Adaptation of the Fung&Yee'98 algorithm*
 We have a bilingual glossary G: L1
 L2 of
translation pairs and target words w1, w2
 We search in Google the co-occurrences of
the target words with the glossary entries
 Compare the co-occurrence vectors
for each {p,q} ∈ G compare
max (google#("w1 p") and google#("p w1"))
with
max (google#"w2 q") and google#("q w2"))
* P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts.
In Proceedings of ACL, volume 1, pages 414–420, 1998
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation Data Set
 We use 200 Bulgarian/Russian pairs of
words:
 100 cognates and 100 false friends
 Manually assembled by a linguist
 Manually checked in several large
monolingual and bilingual dictionaries
 Limited to nouns only
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments
 We tested few modifications of our
contextual Web similarity algorithm
 Use of TF.IDF weighting
 Preserve the stop words
 Use of lemmatization of the context words
 Use different context size (2, 3, 4 and 5)
 Use small and large bilingual glossary
 Compared it with the seed words algorithm
 Compared with traditional orthographic
similarity measures: LCSR and MEDR
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments

BASELINE: random

MEDR: minimum edit distance ratio

LCSR: longest common subsequence ration

SEED: the "seed words" algorithm

WEB3: the Web-based similarity algorithm with the default
parameters: context size = 3, small glossary, stop words filtering,
no lemmatization, no reverse context lookup, no TF.IDF-weighting

NO-STOP: WEB3 without stop words removal

WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4
and 5

LEMMA: WEB3 with lemmatization

HUGEDICT: WEB3 with the huge glossary

REVERSE: the "reverse context lookup" algorithm

COMBINED: WEB3 + lemmatization + huge glossary + reverse
context lookup
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Resources
 We used the following resources:
 Bilingual Bulgarian / Russian glossary: 3
794 pairs of translation words
 Huge bilingual glossary: 59 583 word pairs
 A list of 599 Bulgarian stop words
 A list of 508 Russian stop words
 Bulgarian lemma dictionary: 1 000 000
wordforms and 70 000 lemmata
 Russian lemma dictionary: 1 500 000
wordforms and 100 000 lemmata
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation
 We order the pairs of words from the
testing dataset by the calculated similarity
 False friends are expected to appear on the
top and the cognates on the bottom
 We evaluate the 11pt average precision of
the obtained ordering
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing different context sizes; keeping the stop words
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing different improvements of the WEB3 algorithm
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of evaluated algorithms
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results: The Ordering for WEB3
r
Candidate
BG Sense
RU Sense
Sim.
Cogn.?
[email protected]
[email protected]
1
муфта
gratis
muff
0,0085
no
100.00%
1.00%
2
багрене / багренье
mottle
gaff
0,0130
no
100.00%
2.00%
3
добитък / добыток
livestock
income
0,0143
no
100.00%
3.00%
4
мраз / мразь
chill
crud
0,0175
no
100.00%
4.00%
5
плет / плеть
hedge
whip
0,0182
no
100.00%
5.00%
…
…
…
…
…
…
…
…
99
вулкан
volcano
volcano
0,2099
yes
81.82%
81.00%
100
година
year
time
0,2101
no
82.00%
82.00%
101
бут
leg
rubble
0,2130
no
82.18%
83.00%
…
…
…
…
…
…
…
196
финанси / финансы
finance
finance
0,8017
yes
51.28%
100.00%
197
сребро / серебро
silver
silver
0,8916
yes
50.76%
100.00%
198
наука
science
science
0,9028
yes
50.51%
100.00%
199
флора
flora
flora
0,9171
yes
50.25%
100.00%
200
красота
beauty
beauty
0,9684
yes
50.00%
100.00%
…
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Discussion
 Our approach is original because:
 Introduces semantic similarity measure
 Not orthographic or phonetic
 Uses the Web as a corpus
 Does not rely on any preexisting corpora
 Uses reverse-context lookup
 Significant improvement in quality
 Is applied to original problem
 Classification of almost identically spelled
true/false friends
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Discussion
 Very good accuracy: over 95%
 It is not 100% accurate
 Typical mistakes are synonyms,




hyponyms, words influenced by cultural,
historical and geographical differences
The Web as a corpus introduces noise
Google returns the first 1 000 results only
Google ranks higher news portals, travel
agencies and retail sites than books,
articles and forums posts
Local context could contains noise
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Conclusion and Future Work
 Conclusion
 Algorithm that can distinguish between
cognates and false friends
 Analyzes words local contexts, using the
Web as a corpus
 Future Work
 Better glossaries
 Automatic augmenting the glossary
 Different language pairs
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognate or False
Friend? Ask the Web!
Questions?
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Descargar

Cognate or False Friend? Ask the Web!