Sentence Similarity Measure Based on
Events and Content Words
Jian-fang Shan ; Zong-tian Liu ; Wen Zhou ;
Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
Fuzzy Systems and Knowledge Discovery, 2009.
FSKD '09. Sixth International Conference on (IEEE)
Presented by Jun-Ming Chen 2/26/2010
The proposed method
Experiments and evaluation
• Similarity between sentences plays an important role in a
variety of applications of natural language process.
– such as text summarization, information extraction and text clustering
• The research on sentences similarity calculation, which can be
classified into three categories:
– semantic-based
– syntax-based
– and both
• The first category: semantic-based
base on co-occurring words and word statistics, basically include:
• Probabilistic Model [1],
Vector Space Models(VSM) [2],
Edit Distance Method [3], N-gram Model [4], etc.
• The method of co-occurring-word-based
– fits to long text not fits to sentences,
– especially headline and tabloid of news, because the short texts may very
similar while co-occurring words are rare or even null.
• The VSM bases on TFIDF, is one of the most effective models in statistic-based
methods, neither fits to short text.
For example, two sentences, ”I love pop song” and ”Mom likes classic music”, describe the
keenness for music, there must be a high similarity between them.
– However, they do not share words,
– the method based on co-occurring words and the VSM both compute the similarity as 0.
[1] Chatterjee N. A statistical approach for similarity measurement between sentences for EBMT. 2001.
[2] Li S J Zhang J Huang X et al. Semantic computation in Chinese question-answering system. Comput Sci Tech, 17(6):933–939, 2002.
[3] E. S. Ristad P. N. Yianilos. Learning string-edit distance. IEEE PAMI, 20(5):522–532, 1998.
[4] Kishore Papineni Slim Roukos Todd Ward. Bleu:a method for automatic evaluation of machine translation. In In Proceedings of the 40th
annual Meeting of the Association for Computational Linguistics, pages 311–318, 2001.
The second category:
Use WordNet or HowNet as semantic knowledge base
– the semantic similarity between sentences
is calculated through integrating the semantic similarity between words.
Every considering semantic method treats texts as a bag of words and lose sight
of semantic relation among words in the context.
– Tom gave Mary a book as birthday gift
– Mary gave Tom a book as birthday gift
• Traditional methods compute the similarity as 1
• But, what‘s the meaning of this sentence ? (相同的字,但順序不同,意思也就不同)
So, “word itself” and “its sentence element”, such as subject, predicate,
accusative, etc, are what we must take into consideration.
WordNet ( ): a large lexical database of English
知網Hownet ( :詞彙知識庫
paper proposes a novel method to calculate sentence similarity, which takes
account both semantic and syntax:
measuring semantic similarity based on events
measuring syntactic similarity based on content words sequence
2.1. The definition and expression of event
2.2. Event extraction
• 2.1. The definition and expression of event
• 2.2. Event extraction
The definition and expression of event
– originated from Cognitive Science
– widely used in the computational linguistics literature
– Wang Shen[12]
• event as
”For some purpose somebody does something for someone with some
means sometimes and somewhere”
In this paper, we have the following definition
– 5H elements event model
• who do what to whom when and where
– the formal expression is e =(Hs, Hp, Ha, Ht, Hl)
[12] Wang Shen. Analysis on the semantic relationship in event structures on the basis of ontology. SinoUS English Teaching, ISSN1539-8072, USA, 5(3):61–70, 3 2008.
The definition and expression of event
The relations between 5H and sentence elements are:
– Hs ↔subject
– Hp ↔predicate
述語 / 謂語 (句子成分,對主語加以陳述)
• John “went home ”
– Ha ↔accusative
受格 (句子中,承受動作的詞)
• He speaks “ Japanese ”
– Ht ↔time
– Hl ↔location
Event extraction
• Given a sentence s
• Algorithm extractevent(s)
• Events are extracted from s by using the following steps
Mark sentence with POS tags and recognize named entities, such as
〈Person〉, 〈Organization〉, 〈 Location〉, 〈Time〉, etc,
by using Gate [13].
Parse the sentence by using Stanford Parser [14].
Analysis the results from the above steps, procuring event formal express
of the sentence.
[13] Gate, a general architecture for text engineer- ing. Available:
[14] The stanford parser. Available:
The proposed method
3.1. Semantic similarity between words
3.2. Semantic similarity between sentences
3.3. Syntactic similarity between sentences
3.4. Sentence similarity
= 3.1 +3.2 + 3.3
Semantic similarity between words
measure the semantic similarity between two words
– measure the distance between them in WordNet
Human usually consider that the shorter the path from one node to
another, the more similar they are.
girl and miss is 1
apple and banana is 3
girl and apple is 12
A shared parent of two words is called a subsumer.
The least subsumer (LS) of (girl, miss, young lady) and
(boy, male child) is (person, individual).
A part of WordNet-style hierarchy
Semantic similarity between words
• The path length is a simple way to compute the semantic
distance between two words.
• Wu & Palmer
– takes into account both path length and depth of the least
– the semantic similarity Sword−sem(w1,w2)
depth(s) :
the shortest distance from root node/word to node/word s on the WordNet-style
hierarchical graph
ls : denotes the least subsubmer of w1 and w2. (e.g. person, individual)
Semantic similarity between words
• The similarity between two words
– 0; if they belong to different category
– 1; if they refer to the same object
(abbreviate notation, alias name, etc)
– 1; if the two numeric strings have the same actual
value, such as 0.5 and 50%,otherwise the similarity
between them is 0.
– If the two named entities don’t refer to the same
object, meanwhile, they belong to the same
= 2 * 3/ (4+4) = 0.75
– such as two person-names, and
• we’ll assign a appropriate value to the similarity
Semantic similarity between sentences
• event-based semantic similarity calculation
extracting events from sentences
event elements similarity measure based on words similarity
events similarity measure based on elements similarity
sentences semantic similarity measure based on events similarity
event elements
event elements similarity
events similarity
sentences semantic similarity
Semantic similarity between sentences
Calling algorithm extractevent(s), events are extracted from s1 and s2
– i =(s, p, a, t, l),which denotes the subscript of event elements Hi
– H1i represents the element Hi of e1
– H2i represents the element Hi of e2
event elements
Measuring similarity between event elements:
– For each word h1i in H1i, we identify the word h2i in H2i
that has the highest semantic similarity between them
according to the formula (1)
– Then for each word h2i in H2i, we identify the word h1i in H1i that has the
highest semantic similarity between them
Semantic similarity between sentences
– Finally, we obtain the similarity between H1i and H2i , the formulas as
|H1i| is the number of words in H1i element
|H2i| is the number of words in H2i element
If there is only one word in H1i, and so do H2i,
formula (2) is simplified as
i.e., formula (1)
event elements H1i
SH (H1i, H2i) = [ Max ( Sword_sem(word2, h2p ) ) + Max ( Sword_sem(word3, h2p) ) ] /4
Semantic similarity between sentences
Measuring similarity between e1 and e2 through the following formula
– i =(s, p, a, t, l),which denotes the subscript of event elements Hi
– wi is the weight of element Hi
– when measuring the similarity between events, subject to
The similarity between events e1 and e2 is also the semantic similarity between
sentences s1 and s2, as following,
Syntactic similarity between sentences
The sentences need to be preprocessed, including filter stop words and
stemming words. Content words,
– such as nouns, verbs, adjectives, etc, are preserved
Measurement based on the number of co-occurring content words
the number of co-occurrence words
between s1 and s2
Given post- preprocessing sentences
Syntactic similarity between sentences
2. Measurement based on LCS (Longest Common Subsequence) [15]
在兩個序列的所有共同子序列中最長的我們稱之為「最長共同子序列」 (LCS)
此子序列不必連續出現於 S1 或 S2,只要出現的順序和S1 及S2 內出現的順序
例如:S1: “AAGTACC” , S2: “ATTACCT”,
則S1 及S2 的LCS為” “ ATACC”,長度為5。
According the definition, we define the following formula:
the length of longest subsequence s1 and s2
[15] Cormen T. R. C. E. Leiserson and R. L. Rivest. Introduction to Algorithms. The MIT Press., 1989.
Syntactic similarity between sentences
3. Taking the above two metrics into consideration, the formula for
overall sentence syntactic similarity is given:
weight ratio of the metric for calculating
the syntactic similarity
Sentence similarity
• The total sentence similarity is calculated according to the
following formula:
– the semantic similarity and syntactic similarity weighted by a
smoothing factor
Experiments and evaluation
The sentences used in the sentence retrieval based on similarity experiments are
the model summarizations of DUC 2004
– very short manual summarization (not more than 75 bytes) of
each document in corpus
– the corpus contained on 50 TDT topics documents
– each topic contained on average 10 documents
– the number of sentences (model manual summarization) is 2000
For each source sentence, five sentences are returned ordered by the similarity
between the source sentence and the retrieval sentence.
Out of the initial workshop and the road mapping effort has grown a continuing evaluation in the area of text
summarization called the Document Understanding Conferences (DUC). DUC 2000-2007
Topic Detection and Tracking (TDT) is a multi-site research project, now in its third phase, to develop core
technologies for a news understanding systems.
TDT2 – TDT5 (1998 -2004)
Experiments and evaluation
• Partial experiment result is shown in table 1. Our experiment
results show that the retrieved sentences based on similarity are
logical and accord with people’s practical experience.
Experiments and evaluation
Though the method used more weight ratios,
we can adjust relevance weight ratio
according to the objective retrieval results.
– For example, if we only care about
”somebody is at somewhere”,
we can set the weight ratios of Hp, Ha, Ht to zero.
Hs ↔subject
Hp ↔predicate
John “went home ”
Ha ↔accusative
He speaks “ Japanese ”
Ht ↔time
Hl ↔location
The weight ratios determination is still a key problem.
The paper proposed a novel method to
– calculate the similarity between sentences,
– which combines semantic and syntax.
• Firstly, semantic similarity is calculated
based on the events
that are extracted from the sentences.
• Secondly, syntactic similarity is calculated
based on the content words
that are preserved in the sentences after preprocessing.
• In the future
– The parameter modifications to obtain
better approximation by a large mount of
– the method for Chinese sentences.
Summative diagram of the whole Calculation Process
LCS (Longest Common Subsequence) need to
– Reduce the comparison time
The experiment should be further analyzed.
Our research
– Extend our initial study of tagging method
• How to take advantage of the tag integrated information to expand the
recommended capacity