Topic Segmentation
Gaël Dias and Elsa Alves
Human Language Technology Interest Group
Department of Computer Science
University of Beira Interior - Portugal
http://hultig.di.ubi.pt
Ljubljana, November 18
JOTA - Faculty of Arts
1
Guidelines









Introduction
TextTiling
Lexical Cohesion Profile
DotPlotting
Link Set Median
Problems
Asas
Conclusions
Problems and Future Work
Ljubljana, November 18
JOTA - Faculty of Arts
2
Introduction
The concept of Topic
Segmentation is the task
of breaking
documents into
topically coherent
multi-paragraph
subparts.
Ljubljana, November 18
Topic Segmentation
JOTA - Faculty of Arts
3
Introduction
Topic Segmentation is important for applications like:
Information Retrieval
Indeed, you should prefer a document in which the
occurrences of a word are concentrated into one or two
paragraphs since such a concentration is most likely to
contain a definition of what is the word.
Text Summarization
Topic Segmentation is one of the three phases of Automatic
Abstracting: Topic Segmentation, Sentence Extraction and
Sentence Compression.
Ljubljana, November 18
JOTA - Faculty of Arts
4
TextTiling
TextTiling (Hearst and Plaunt 1993) is one of the most famous
system for Topic Segmentation.
The basic idea of this algorithm is to search for parts of a text where
the vocabulary shifts from one subtopic to another. These points
are then interpreted as the boundaries of multi-paragraph units.
TextTiling is divided into 4 phases:
(1) Segmentation of the Text
(2) Cohesion Scorer
(3) Depth Scorer
(4) Boundary Selector
Ljubljana, November 18
JOTA - Faculty of Arts
5
TextTiling: Segmentation
Sentence length can vary considerably.
Therefore the text is first divided into small
fixed units (20 words), the token sequences.
Indeed, you should prefer a document in which the occurrences of a
word are concentrated into one or two paragraphs since such a
concentration is most likely to contain a definition of what is the
word. Topic Segmentation is one of the three phases of Automatic
Abstracting: Topic Segmentation, Sentence Extraction and
Sentence Compression. ….
gaps
20 words
Ljubljana, November 18
JOTA - Faculty of Arts
6
TextTiling: Cohesion
The cohesion scorer measures the amount of “topic continuity”
or cohesion at each gap, i.e. the amount of evidence that
the same is prevalent on both sides of the gap. Intuitively,
we want to consider gaps with low cohesion as possible
segmentation points.
In fact, we want to infer the breaking points from the
distribution of the words in the text i.e. when there is a
change of topic, previous words seem to disappear from the
rest of the text.
Ljubljana, November 18
JOTA - Faculty of Arts
7
TextTiling: Cohesion
Ljubljana, November 18
JOTA - Faculty of Arts
8
TextTiling: Cohesion
In order to calculate this cohesion, Textiling uses
the cosine similarity measure where the weights
of the words are the frequency in the text.
p
X
S ij  cos  X i , X j  
ik
 X jk
k 1
p

p
X ik 
2
k 1
Ljubljana, November 18
JOTA - Faculty of Arts

X jk
2
k 1
9
TextTiling: Cohesion
The similarity calculation can be illustrated as
the following figure where the x-axis is the gap
number.
Ljubljana, November 18
JOTA - Faculty of Arts
10
TextTiling: Depth Scorer
The depth scorer assigns a depth score to each
gap depending on how low its cohesion score is
compared to the surrounding gaps.
If cohesion at the gap is lower than at
surrounding gaps, then the depth score is high.
Conversely, if cohesion is about the same at
surrounding gaps, then the depth score is low.
However, this situation is not so linear. There
may be slight shifts and big shifts. Only the latter
may be considered as breaking points.
Ljubljana, November 18
JOTA - Faculty of Arts
11
TextTiling: Depth Scorer
s3
s1
s2
g1 g2 g3
Indeed, you should prefer a document in which the occurrences of a
word are concentrated into one or two paragraphs since such a
concentration is most likely to contain a definition of what is the
word. Topic Segmentation is one of the three phases of Automatic
Abstracting: Topic Segmentation, Sentence Extraction and
Sentence Compression. ….
s3
Ljubljana, November 18
s2
JOTA - Faculty of Arts
s1
12
TextTiling: Depth Scorer
The gap depth score is computed by summing
the heights of the two sides of the valley it is
located in.
For instance, for text 1, we would have:
depth(g2)=(s1-s2)+(s3-s2).
In a text with rapid fluctuations of topic
vocabulary, only the most radical changes will be
accorded the status of segment boundaries.
Ljubljana, November 18
JOTA - Faculty of Arts
13
TextTiling: Depth Scorer
s3
s1
s1
s5
s3
s2
s2
Ljubljana, November 18
s4
g1 g2 g3
g1 g2 g3 g4 g5
text1
text2
JOTA - Faculty of Arts
14
TextTiling: Depth Scorer
For a practical implementation, several
enhancements of the basic algorithm are
needed. First, we need to smooth cohesion
scores to address situations like in text 2.
Intuitively, the difference (s1-s2) should
contribute to the depth score of g4.
This is achieved by smoothing scores using a
low pass filter. For example, the cohesion score
si for gi is replaced by (si-1 + si + si+1)/3.
Ljubljana, November 18
JOTA - Faculty of Arts
15
TextTiling: Depth Scorer
s1
b1
Ljubljana, November 18
s3
b3
b2
g1
smoothing
s2
g2
s2=(s1+s2+s3)/3
JOTA - Faculty of Arts
b4
g3
smoothing
16
TextTiling: Boundary Selector
The boundary selector is the module that looks
at the depth scores and selects the gaps that are
the best segmentation points.
The module estimates the average μ and the
standard deviation σ of the depth scores and
selects all gaps as boundaries that have a depth
score higher that μ-c.σ where c is some
constant.
Ljubljana, November 18
JOTA - Faculty of Arts
17
Lexical Cohesion Profile
The Lexical Cohesion Profile (LCP) has been proposed by
(Kozima, 1993).
The basic idea is that the words in a segment are linked
together via lexical cohesion relations. So, LCP records
mutual similarity of words in a sequence of text.
The similarity of words, which represents their cohesiveness,
is computed using a predefined semantic network
automatically built from the LDOCE (English Dictionary).
Ljubljana, November 18
JOTA - Faculty of Arts
18
Lexical Cohesion Profile
The basic idea is that when a block shifting from left to right
looses in lexical cohesion, then, it should evidence a topic
change.
Ljubljana, November 18
JOTA - Faculty of Arts
19
Lexical Cohesion Profile
Some Results …
Human Reader
Ljubljana, November 18
LCP
JOTA - Faculty of Arts
20
Dotplotting
The Dotplotting methodology has been proposed for
finding discourse boundaries by (Reynar 1994)
The idea is based on lexical item repetition.
If a particular word appears in position x and y in a text,
then 4 points corresponding to the Cartesian product
should be plotted on the dotplot: (x,x) (x,y) (y,x) (y,y).
Ljubljana, November 18
JOTA - Faculty of Arts
21
Dotplotting
Text Segment = Lexical
Cohesion based on repetition
y
x
x
Ljubljana, November 18
y
JOTA - Faculty of Arts
22
Dotplotting
In order to find the |P| desired boundaries, we repeat the
minimization of the following measure:
where Va,b represents a vector containing the word counts
between a through b.
This technique has the advantage to compare all possible
segments with all the other ones and not just the surrounding
ones.
Ljubljana, November 18
JOTA - Faculty of Arts
23
Dotplotting
Ljubljana, November 18
JOTA - Faculty of Arts
24
Link Set Median Procedure
The idea of the Link Set Median (LSM), proposed
by (Sardinha1999) is to look at the similarity
between all the sentences with which each
adjacent sentence shares lexical items.
The set of sentences with which each sentence
has links can be seen to form a link set.
Ljubljana, November 18
JOTA - Faculty of Arts
25
Link Set Median Procedure
Ljubljana, November 18
JOTA - Faculty of Arts
26
Link Set Median Procedure
ALGORITHM
1. Identify the links for all sentences in the text
2. Create the link sets
3. Compute the median for each link set
4. Calculate the median difference for each link set and its
predecessor
5. Compute the average (mean) median difference for the text
6. Compare each (link set) median difference to the (text) average
median difference
7. If the median difference is higher than the average, insert a
segment boundary
8. Locate the section boundaries in the text and disregard sections
starting with sentence 1
9. Compare segment and section boundaries
Ljubljana, November 18
JOTA - Faculty of Arts
27
Link Set Median Procedure
Ljubljana, November 18
JOTA - Faculty of Arts
28
Problems

Lexical repetition shows reliability problems.

Systems based on lexical cohesion use existing
linguistic resources (dictionary, thesaurus,
ontology) that are usually available only for
dominating languages like English, French or
German, and as a consequence do not apply to
less favored languages.
Ljubljana, November 18
JOTA - Faculty of Arts
29
Asas
The idea of the Informative Similarity-based Topic
Segmentation System (Asas), proposed by (Dias
and Alves, 2005) is to look at:
(1)
(2)
(3)
(4)
(5)
The importance of words in global context
The importance of words in local context
The global importance of words
The Informative Similarity between Sentences and
Blocks of sentences
The Selection of Boundaries
Ljubljana, November 18
JOTA - Faculty of Arts
30
Asas
Text to Segment
Segmented Text
Asas
Texts of Context
Ljubljana, November 18
JOTA - Faculty of Arts
31
Asas: Global Importance
First, it is necessary to find what are the important
words in the text to segment. Only these should
be taken into account. For that reason, we apply
the well-known tf.idf measure introduced by
(Salton, 1975).
tf .idf  w , d  
Ljubljana, November 18
tf  w ; d 
 log
|d |
JOTA - Faculty of Arts
N
2
df ( w )
32
Asas: Local importance
Then, it is necessary to find what are the
important words for the sake of segmentation.
Indeed, if a word occurs in all the sentences of
the text, it is of no use for the task of
segmentation. For that reason, we apply the
well-known tf.idf measure to sentences that we
call tf.isf.
tf .isf  w , s  
Ljubljana, November 18
stf  w ; s 
 log
|s|
JOTA - Faculty of Arts
Ns
2
sf ( w )
33
Asas: Local importance
Useless for Topic Segmentation
Ljubljana, November 18
JOTA - Faculty of Arts
34
Asas: Local Importance
Finally, the more dense a word is, the more important it
is for the sake of segmentation. Indeed, if a word
occurs many times in a small portion of the text, it is
of great use for segmentation. For that purpose,
(Dias and Alves, 2005) proposed a density measure
based on the distance (in terms of words) of
consecutive occurrences of a word.
| w | 1
dens  w , d  
 ln dist occur k , occur k  1  e 
1
k 1
Ljubljana, November 18
JOTA - Faculty of Arts
35
Asas: Local Importance
not so strong
From moon to star
Ljubljana, November 18
JOTA - Faculty of Arts
36
Asas: Weight
So, in order to give a weight to each word for each
sentence in the text to segment, we propose this
simple measure:
weight
w , d  
tf .idf  w , d   tf .isf  w , s   dens  w , d 
where all measure have been normalized so that
they can be joined into a single measure.
Ljubljana, November 18
JOTA - Faculty of Arts
37
Asas: Similarity
Unlike in TextTiling and LCP, we use the sentence as the basic unit of
segmentation. The basic idea is to observe whether a given
sentence is more similar than the previous group of sentences or
more similar to the next group of sentences.
Si-3
Si-2
Si-1
Similarity
Block 1
Si
Si+1
Si+2
Si+3
Ljubljana, November 18
Similarity
Block 2
JOTA - Faculty of Arts
38
Asas: Similarity
For that purpose, we introduce an informative similarity measure
(Dias and Alves, 2005) in order to avoid the need of lexical
databases like thesauri or dictionaries.
The informative similarity measure is based on the cosine
similarity measure but integrating any word co-occurrence
association measure in its body.
The cosine measure
p
X
S ij  cos  X i , X j  
ik
Where:
k 1
p

k 1
Ljubljana, November 18
 X jk
p
X ik 
2

k 1
X jk
2
Xi=the vector representing a webpage
Xj=the vector representing the website
Xik=the weight of word in index k in the vector Xi
JOTA - Faculty of Arts
39
Asas: Similarity
(1) Ronaldo defeated the goalkeeper once more.
(2) Real Madrid striker scored again.
Ljubljana, November 18
JOTA - Faculty of Arts
40
Asas: Similarity
Each block and the sentence in focus are represented as vectors
of weights and the association measure used is the
Equivalence Index (Muller et al., 1997).
The informative similarity measure
p
S ij  infosimba
p
X
 X i, X j  
k 1
p

ik
l 1
p
  X ik  EI W ik , W il 
k 1 l  k 1
where,
EI  w , q   p  w | q   p  q | w  
Ljubljana, November 18
 X jl  EI W ik , W jl 
p
2


p
  X jk  EI W jk , W jl 
2
k 1 l  k  1
f w , q 
2
f  w   f q 
JOTA - Faculty of Arts
41
Asas: Selection of Boundary
In order to compare the similarity between blocks and the
sentence in focus, we propose the following solution.
ps  X i   log
infosimba
infosimba
 X i, X i  1
 X i, X i  1
So, if Xi is more similar to Xi-1, ps(Xi) will give a positive
number. On the contrary, if Xi is more similar to Xi-2, ps(Xi)
will give a negative number. In the case, where it is similar
to both blocks, the value of ps(Xi) will be 0.
Ljubljana, November 18
JOTA - Faculty of Arts
42
Asas: Selection of Boundary
Results with Equivalence Index
Ljubljana, November 18
JOTA - Faculty of Arts
43
Asas: Selection of Boundary
Each time the link value goes from positive to
negative between two consecutive sentences,
there exits a topic shift. We will call this
phenomenon a downhill.
downhills
Ljubljana, November 18
JOTA - Faculty of Arts
44
Asas: Selection of Boundary
A downhill is simply defined as follows
whenever the value of the ps score goes from
positive to negative between two consecutive
sentences Xi and Xi+1.
downhill
Ljubljana, November 18
 X i, X i  1 
ps  X i   ps  X i
JOTA - Faculty of Arts
 1

45
Asas: Selection of Boundary
Once all downhills in the text have been
calculated, their mean and standard deviation
are evaluated. The topic boundaries are then
elected if they satisfy the following constraint
downhill
Ljubljana, November 18
 X i , X i  1  x  2
JOTA - Faculty of Arts
46
Asas: Flexibility
This architecture has the advantage to accept different
association measures so that better tuning can be obtained.
And by choosing different context length (one word, k words, k
sentences, k paragraphs, k texts) for the calculation of
association measures between two words, different
applications can be obtained for our architecture:
-
-
Segmenting different news articles from a list of articles
(larger context).
Segmenting a technical text about one topic where each
segment is about a subtopic (small context).
Ljubljana, November 18
JOTA - Faculty of Arts
47
Conclusion

Language-independent unsupervised Topic
Segmentation system based on word-co-occurrence.

Avoids the accessibility to existing linguistic resources
such as electronic dictionaries or lexico-semantic
databases such as thesauri or ontology.

Solves the problems evidenced systems based uniquely
on lexical repetition that show reliability problems.
Ljubljana, November 18
JOTA - Faculty of Arts
48
Problems and Future Work

Existence of three main parameters: the block size, the window size to
calculate the association measure and the topic discovery threshold.

Exhaustive Evaluation: different association measures, different similarity
measures, different languages etc …

Comparison with systems that use linguistic resources.

More work must be done on the automatic boundary detection algorithm.
We are convinced that better algorithms may be proposed based on the
transformation of the representation of ps function into a graph or network.

The system will be downloadable at http://asas.di.ubi.pt/ under GPL License
when completely tested.
Ljubljana, November 18
JOTA - Faculty of Arts
49
The End
Thanks for your attention and patience 
Ljubljana, November 18
JOTA - Faculty of Arts
50
Descargar

Multiword Unit Hybrid Extraction