Finding Similar Defects
Using Synonymous Identifier Retrieval
Norihiro Yoshida, Takeshi Hattori, Katsuro Inoue
Osaka University, Japan
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
1
Similar Code fragment

One of factors that make software maintenance
more difficult
Source file A
Source file B
Modify it
Code fragment
CF
Similar code
fragment SF1
Similar code
fragment SF2
It is necessary
to determine
whether or not
modify them
It is necessary to develop automatic code retrieval
tool based on code similarity
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
2
Key Idea

In many cases, code fragments involving similar
identifier names have the similar functionalities.
e.g., type, variable, function names

Developers often need to inspect those code
fragments simultaneously.
It is necessary to develop automatic code retrieval
tool based on identifier similarity
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
3
SC-Retriever: Code retrieval tool based on identifier
similarity


Retrieves code fragments that are similar to a query code
fragment
Based on identifier similarity
 e.g.,

type, variable, function name
Determines synonymous words in target source files
Query Code
Fragment
Similar code
fragments
Identifier
extraction
Retrieval
Identifier
extraction
Synonymous
identifier determination
Target source files
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
4
Why should we determine synonymous words in
source code?


SC-Retriever needs to identify a set of code fragments that
have similar functionalities
Different developer often uses different identifier names even
if they implement the same functionalities
It is necessary to determine synonymous words
for identifying code fragments
that have similar functionalities.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
5
How to determine synonymous words?

We use an automatic synonymous words
determination technique in NLP.
 Dagan’s
method[1], which is based on co-occurrence
relation and do not use thesauruses and dictionaries.
 His method detects a set of synonymous words often
occurs a similar set of words in statements.
e.g., “Kids play soccer”, “Children play soccer”
Both “kids” and “children” co-occur with a set of words
“play“ “soccer”.  They are synonymous.
 Note
that we should set threshold for synonymous
words determination.
[1] I. Dagan, L. Lee, and F. C. N. Pereira. Similarity-based models of word
cooccurrence probabilities. Machine Learning, 34(1-3):43–69, 1999.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
6
How to match a query code fragment with target
source files?

if code fragments have the same or synonymous
identifiers as the query identifiers,….

those code fragments are extracted as similar code
fragments from the target source files
Identifiers in query code
fragment
Identifiers in target source
files
host
node
alloc
add
alloc
add
host
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
node
8
Case Study

Overview
 conduct
a case study with SC-Retriever and CCFinder
 retrieve defective functions in 2 software systems
 compare the efficiency of the retrieval

Target Systems
 Canna
(90KLOC, 2361 functions)
client-server Japanese character input system
Ver. 3.6 involves 19 buffer overflow defects
 Those defects exist in 18 functions.
 SPARS-J
(36KLOC, 859 functions)
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
9
Experimental Step
choose defective code fragments from Canna
source code
2. retrieve C functions in Canna source code.
1.
 SC-Retriever
we give 3 chosen code fragments as the queries.
 CCFinder
we detect code clones for those 3 code fragments.
3.
calculate the precisions, recalls and F-scores with
the retrieved results and the bug records
 F-score
is harmonic average between precision and
recall.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
10
Resultant precisions, recalls, and F-scores
CCFinder
SCRetriever(th= 0.1) SCRetriever(th= 0.2)
Queries
Prec.
Recall F-score Prec.
Recall F-score Prec.
Recall F-score
CFA
0.50
0.72
0.59
0.18
1.00
0.31
1.00
0.06
0.11
CFB
0.19
0.33
0.25
0.18
1.00
0.31
1.00
0.06
0.11
CFC
1.00
0.06
0.11
0.33
0.06
0.10
1.00
0.06
0.11

We set threshold for synonymous words determination, to
0.1 and 0.2.


F-Scores of SC-Retriever are higher than those of CCFinder.



If th is set to high value, a lot of synonymous words are detected
Recalls of SC-Retriever are relatively high
Precisions of CCFinder are relatively high
The results of SC-Retriever depends on queries and th
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
11
Future Work

Further case studies on defects in other software
systems.
 Code clone detection tool based on synonymous
words determination
 Method to calculate code fragment ranking based
on identifier similarity
 Other methods to determine synonymous words
 LSI,
dictionary, or thesauruses based method
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
12
Descargar

コードクローン間の依存関係に基づく リファクタリング支援環 …