Code Tuning
Chapter 25-26
Textbook & Reference
Steve McConnell. Code Complete: A Practical Handbook of
Software Construction. 2nd Edition. Microsoft Press, 2004.
Chapters 25 and 26
Jon Bentley, Programming Pearls, Second Edition, AddisonWesley, Inc., 2000.


2
Outline





Options of Performance Improvement
What Is Code Tuning?
Common Sources of Inefficiency
Code Tuning Process
Code Tuning Techniques





3
Logic
Loops
Data Transformations
Expressions
Object Creation and Flyweight/Factory Patterns
Options for Performance Improvement

Program requirements


Make sure it is a problem that needs to be solved
Boehm’s story (2000):



Program design


4
A system at TRW initially requiring subsecond response time led to a
highly complex design and an estimated cost $100M
Further analysis determined that users would be satisfied with 4
second response 90% of the time
Some problems have to be addressed at design level
Search algorithms
Options for Performance Improvement

OS interactions




Code compilation


Maybe no need to think about optimizing speed any further if
the right compiler is chosen
Hardware

5
Perhaps the OS routines are slow or fat
Working with external files, dynamic memory, or output
device probably interact with the OS
Your compiler may generate (or your libraries may invoke)
system calls you would never dream of.
Sometimes buying new hardware is the cheapest and best
way
What Is Code Tuning?
The practice of modifying correct code in ways that make
it run more efficiently


Small-scale changes that affect a single class / routine or a few
lines of code
Not the most effective or easiest or cheapest way to
improve performance
Not necessarily ‘better’ code.



6
No one, but programmers, usually cares how tight the code is
Why Is Code Tuning Appealing?

It is incredibly satisfying to reduce execution time by
tweaking a few lines!

Mastering the art of writing efficient code is a rite of
passage to becoming a serious programmer


7
Programming culture - writing micro-efficient code proves you
are cool
Like a tennis player picking up a tennis ball
Pareto Principle: 80/20 Rule
‘get 80% of the result with 20% of the effort’




Boehm (1987): 20% of a program’s routines consumes 80% of
its execution time
Knuth (1971): less than 4% of a program usually accounts for
more than 50% of its runtime.
Bentley (1988): a 1 KLOC program spent 80% of its time in a
5-line square root routine
Measure the code to find the hot spots and put your
resources into optimizing the few percent
Write most of code in an interpreted language and
rewrite the hot spots in a faster language


8
Which One Is Faster?

9
for i=1 to 10{
a[i] = i;
}
 a[1] = 1;
a[2] = 2;
a[3] = 3;
a[4] = 4;
a[5] = 5;
a[6] = 6;
a[7] = 7;
a[8] = 8;
a[9] = 9;
a[10] = 10;
Old Wives’ Tales

Reducing LOC improves the speed or size of the resulting
machine code – false


Certain operations are probably faster or smaller than others –
false



Previous Java example: 12.6 vs 3.23
No room for probably when talking about performance
The rule of the game change every time you change languages,
compilers, versions of libraries, processor, amount of memory
‘Fast’ is as important as ‘correct’ - false
10
Old Wives’ Tales - cont

You should optimize as you go – false


If you strive to write the fastest and smallest possible code
for each routine, your program will be fast and small?
Problematic!



11
It is almost impossible to identify performance bottlenecks before a
program is working completely.
Programmers are very bad at guessing which 4% accounts for 50%
of the execution time.
Focusing on optimization during initial development detracts from
achieving other program objectives.
When to Tune

Don’t optimize until you know you need to





Use a high-quality design
Make the program right
Make it modular and easily modifiable
Check performance when it is complete and correct
Complier optimizations


12
They might be more powerful than you expect
Compiler-optimized code may be faster than ‘tricky’ code
Complier optimizations




With a good optimizing compiler,
your code speed can improve 40+
percent
Many of the techniques described
in the next chapter produce gains
of only 15–30 percent.
Why not just write clear code and
let the compiler do the work?
Here are the results of a few tests
to check how much an optimizer
speeded up an insertion-sort
routine:

13
The only difference between
versions of the routine was that
compiler optimizations were
turned off for the first compile and
turned on for the second.
Compiler Optimizations - cont
sum=0;
for (row=0; row<rowCount; row++) {
for (column=0; column<columnCount; column++) {
sum = sum + martix[row][column]
} // expensive multiplications for access to 2D array
}
sum=0;
elementPointer = matrix;
lastElementPointer = matrix[rowCount-1][columnCount-1]+1;
while (elementPoint <lastElementPointer) {
sum = sum + elementPointer++;
}

No improvement - digging into the assembly code turned out
that the complier did the optimization.
14
Common Sources of Inefficiency

Input/output operations


System calls: expensive due to context switch (Saving
program states, recovering the kernel states and the
reverse)




If you have a choice of working with a file in memory vs on
disk, in a database, or across a network, use an in-memory
structure unless space is critical.
Write own services if part of the function is needed
Avoid going to the system
Work with the system vendor to make the call faster
Interpreted languages

15
Process each instruction before creating and executing
machine code
Input/output operations
Interpreted languages
16
Are They Functionally Equivalent?
for (column=0; column<MAX_COLUMNS; column++) {
for (row=0; row<MAX_ROWS; row++) {
table [row][column] = BlankTableElement();
}
}
for (row=0; row<MAX_ROWS; row++) {
for (column=0; column<MAX_COLUMNS; column++) {
table [row][column] = BlankTableElement();
}
}
1.
2.
17
Each element of table is about 4k bytes long
Pages are usually at least 4K bytes in size
Common Sources of Inefficiency - cont

Paging



An operation that causes the OS to swap pages of memory
It is much slower than an operation that works on only one
page of memory
Errors



18
Leaving debugging code turned on
Forgetting to de-allocate memory
Polling non-existent devices until timeout
Paging Example
Assume: each row is about the same page size.
for (column=0; column<MAX_COLUMNS; column++) {
for (row=0; row<MAX_ROWS; row++) {
table [row][column] = BlankTableElement();
}
} // accessing a different row causes a page fault
for (row=0; row<MAX_ROWS; row++) {
for (column=0; column<MAX_COLUMNS; column++) {
table [row][column] = BlankTableElement();
}
} // 1000 times faster on a machine with limited memory,!
19


If table has too many rows, every time the program
accesses a different row, the operating system will have to
switch memory pages.
The way the loop is structured, every single array access
switches rows, which means that every single array access
causes paging to disk.
20
Relative Costs of Common Operations
Operation
Example
C++
Java
Baseline (int assignment)
i=j
1
1
Call private routine with no para.
this.foo()
1
0.5
Object routine call
bar.foo()
2
1
Polymorphic routine call
abstractBar.foo()
2.5
2
Object reference
i=obj.num
1
1
Integer division
i=j/k
5
1.5
Floating point sqrt
x=sqrt(y)
15
4
Floating point ey
x = exp(y)
50
20
Floating point logarithm
x=log(y)
25
20
Measurements are sensitive to local machine environment and compiler.
Measurements between C++ and Java are not directly comparable.
21
Measurement

Performance aspects can be counterintuitive.


It is not worth sacrificing clarity for a performance
gamble, if it is not worth measuring to know that it is
more efficient


Experience from old machine or language does not help much.
Develop software by using well-designed code that is easy to
understand and modify.
Measurements need to be precise.
22
Code Tuning Process

If the performance is poor:



Save a working version
Measure the system to find hot spots
Determine where weak performance is from:




23
Go back if tuning is not appropriate
Tune the bottleneck identified
Measure each improvement at a time
If no improvement, revert to the code saved
Code Tuning Techniques





Logic
Loops
Data Transformations
Expressions
Object Creation
24
Logic

Stop testing when you know the answer

If (5<x) and (x<10) then … // Specification



Use short-circuit op: && vs. &
If (5<x) then if (x<10) then … // If the language does not support
short-circuit evaluation.
Search whether a negative is present
negativeInputFound = false;
for (i=0; i<count; i++) {
if (input [i]<0) {
negativeInputFound = true;
}
}
25
A better approach?


Is to stop scanning as soon as you find a negative value.
Any of these approaches would solve the problem:




26
Add a break statement after the negativeInputFound = true line.
If your language doesn't have break, emulate a break with a goto
that goes to the first statement after the loop.
Change the for loop to a while loop, and check for
negativeInputFound as well as for incrementing the loop counter
past count.
Change the for loop to a while loop, put a sentinel value in the
first array element after the last value entry, and simply check
for a negative value in the while test. After the loop terminates,
see whether the position of the first found value is in the array
or one past the end.
27
Can You Tune This Code?

Keyboard input in a word processor
Select inputCharacter
Case “+”, “=“
ProcessMathSymbol(inputCharacter)
Case “0” to “9“
ProcessDigit(inputCharacter)
Case “ ,”, “.”, “:”, “;”, “!”, “?”
ProcessPunctuation(inputCharacter)
Case “ ”
ProcessSpace(inputCharacter)
Case “A” to “Z”, “a” to “z”
ProcessAlpha(inputCharacter)
Case Else
ProcessError(inputCharacter)
End Select
28
Logic - cont

Arrange test so that the one that is fastest and most
likely to be true is performed first

If you know the likely frequency of input characters, put the
most common cases first.
Select inputCharacter
Case “A” to “Z”, “a” to “z”
ProcessAlpha(inputCharacter)
Case “ ” ProcessSpace(inputCharacter)
Case “ ,”, “.”, “:”, “;”, “!”, “?”
ProcessPunctuation(inputCharacter)
Case “0” to “9“ ProcessDigit(inputCharacter)
Case “+”, “=“ ProcessMathSymbol(inputCharacter)
Case Else ProcessError(inputCharacter)
End Select
29
Can You Tune This Code?
…
if ( ( (‘a’<=inputChar) && (inputChar <=‘z’)) ||
( (‘A’<=inputChar) && (inputChar <=‘Z’))) {
charType = CharacterType.Letter;
}
else if ( (inputChar==‘ ‘) ||(inputChar == ‘,’) ||
(inputChar==‘.‘) || (inputChar==‘!‘) || (inputChar==‘(‘) ||
(inputChar==‘)‘) || (inputChar==‘:‘) || (inputChar==‘;‘) ||
(inputChar==‘?‘) || (inputChar==‘-‘)) {
charType = CharacterType.Punctuation;
}
else if ((‘0’<=inputChar) && (inputChar <=‘9’)) {
charType = CharacterType.Digit; }
…
30
Logic - cont

Substitute table lookups for complicated expressions
(space for time)


Computing only once and storing results in a table.
Example1: Character type:




Example 2: Integer square root of integers 1..100
Lazy evaluation – avoid doing any work until needed

31
Store the type of each character in an array that’s accessed by the
character type code
charType = charTypeTable[inputChar];
For a table of 5K entries, generate the whole table at startup
time vs the small percentage used
Substitute table
Suppose you want to assign a
category number to something
based on which of three
groups— Groups A, B, and C—
it falls into:
32
33
Can You Tune This Code?
for (i=0; i<count; i++) {
if (sumType == SUMTYPE_NET) {
netSum = netSum+amount[i];
}
else {
grossSum = grossSum+ amount[i];
}
}
34
Loops -Unswitching
if (sumType == SUMTYPE_NET) {
for (i=0; i<count; i++) {
netSum = netSum+amount[i];
}
}
else {
for (i=0; i<count; i++) {
grossSum = grossSum+ amount[i];
}
}
35
Loops -Unswitching

If the decision doesn't change while the loop is executing,


36
you can unswitch the loop by making the decision outside the loop
putting loops inside the conditional rather than putting the
conditional inside the loop
Loops - Minimizing the work inside loops
for (i=0; i<rateCount; i++) {
netRate[i] = baseRate[i] * rates ->discounts->factors->net;
}
qualityDiscount = rates ->discounts->factors->net;
for (i=0; i<rateCount; i++) {
netRate[i] = baseRate[i] * qualityDiscount;
}
37
Can You Tune This Code
found = FALSE;
i=0;
while ( (!found) && (i<count) ) {
if (item[i] == testValue)
found=TRUE;
else
i++;
}
if (found) {…}

38
Compound test
Loops - Sentinel values

For a loop with a compound test, you can often save
time by simplifying the test.

For a search loop, one can use a sentinel value

39
A value that is put just past the end of the search range and
guaranteed to terminate
Loops - Sentinel values: cont
// set sentinel value, preserving the original value
initialValue = item[count];
item[count] = testValue;
i=0;
while (item[i] !=testValue) {
i++;
} // 3 tests to 1 test
if (i<count) {…}

40
Time savings: Java (44%), C#(23%) for a 100-elemenmt of
integers
Loops - Putting the busiest loop on the inside
for (column=0; column<100; column++) {
for (row =0; row<5; row++) {
sum = sum + table[row][column]
}
}




41
The outer executes much more often than the inner
Each time the loop executes, it has to initialize the loop
index, increment it on each pass through the loop, and check
it after each pass
Total number of loop executions: 100 + 100*5
Switching the inner and outer: 5 + 100*5
Loops - Strength Reduction

Replace an expensive operation (e.g. multiplication) with a
cheaper operation (e.g. addition)

for (i=0; i<=saleCount-1;i++){
commission(i) = (i+1) * revenue*basedCommission*discount
}

incrementalCommission = revenue*bassCommission*discount
cumulativeCommission = incrementalCommission
for (i=0;i<=saleCount-1;i++){
commission(i) = cumulativeCommission
cumulativeCommission = cumulativeCommission + incrementalCommission
}
42
Data Transformations


Use integers rather than floating point numbers
Use the fewest array dimensions possible

One dimensional representation of an array
for (entry=0; entry<numRows*numColumns; entries++) {
matrix [entry] =0;
}

43
Time savings: C++: 11%; Java: 47%, C#: 9%
Is Tuning Possible?
for (discountType =0; discountType <typeCount;
discountType++) {
for (discountLevel=0; discountLevel<levelCount;
discountLevel++){
rate[discountLevel] = rate[discountLevel] *
discount[discountType];
}
}
44
Data Transformations - cont

The reference to discount[discountType] doesn't change when
discountLevel changes in the inner loop.

Minimize array references
for (discountType =0; discountType <typeCount; discountType++) {
for (discountLevel=0; discountLevel<levelCount; discountLevel++){
rate[discountLevel] = rate[discountLevel] * discount[discountType];
}
}
thisDiscount = discount[discountType];
for (discountLevel=0; discountLevel<levelCount; discountLevel++){
rate[discountLevel] = rate[discountLevel] *thisDiscount;
}
45
Expressions

Use strength reduction







46
Replace multiplication with addition
Replace exponentiation with multiplication
Replace trigonometric routines with their trigonometric
identities
Replace longlong integers with longs and ints
Replace floating point with fixed-point numbers
Replace double precision with single precision
Replace integer multiplication-by-two and division-by-two with
shift operations
Is Tuning Possible?



Ax2 + Bx + C.
The letters A, B, and C are coefficients, and x is a variable.
Write code to evaluate an nth-order polynomial
47
Evaluate a polynomial – cont’
48
Evaluate a polynomial – cont’

(Ax + x)x + C
49
Is Tuning Necessary?
unsigned int log2(unsigned int x) {
return (unsigned int) (log(x)/log(2));
}
50
Expressions - cont

Initialize at compile time or precompute
const double LOG2 = 0.69314718;
…
unsigned int log2(unsigned int x) {
return (unsigned int) (log(x)/LOG2);
}

Any other ways to tune the code?
51
Expressions - cont

Be wary of system routines

log2 returned an integer value but used a floating point log()
routine to compute it
unsigned int log2(unsigned int x) {
if (x<2) return 0;
if (x<4) return 1;
if (x<8) return 2;
if (x<16) return 3;
…
if (x<2147483648) return 30;
return 31;
}
52
unsigned int log2(unsigned int x) {
unsigned int i=0;
while ((x=(x>>1))!=0) {
i++;
}
return i;
}
// hard to understand; should avoid
// unless you have a good reason
Is Tuning Possible?
payment = loanAmount / (
(1.0-Math.pow(1.0+(interestRate/12.0), -months))/
(interestRate/12.0)
);
53
Expressions - cont

Eliminate common subexpressions
payment = loanAmount / (
(1.0-Math.pow(1.0+(interestRate/12.0), -months))/
(interestRate/12.0)
);
 monthlyInterest = interestRate/12.0;
54
Can Performance Be Improved?

Draw 1000 circles with 6 different colors.

Version 1: Circle.java, Test.java
for (int i=0; i < NUMBER_OF_CIRCLES; ++i) {
Circle circle = new Circle(getRandomColor());
circle.draw(g, getRandomX(), getRandomY(), getRandomR());
//1000 object created.
}

Improvement with Flyweight


Version 2: Cricle.java, CircleFactory.java, Test.java
From Java Design Patterns at a Glance

55
http://www.javacamp.org/designPattern/
Flyweight



Intent: Use sharing to support large numbers of fine-grained objects
efficiently
The FlyweightPattern describes how to support a large number of fine
grained objects efficiently, by sharing commonalities in state.
For example, when designing a word processor application, you might
create an object for each character typed.



Each Character object might contain information such as the font face, size and
weight of each character.
The problem here is that a lengthy document might contain tens of thousands of
characters, and objects - quite a memory killer!
The Flyweight pattern addresses the problem by



56
creating a new object to store such information, which is shared by all characters
with the same formatting.
So, if I had a ten-thousand word document, with 800 characters in Bold TimesNew-Roman, these 800 characters would contain a reference to a flyweight
object that stores their common formatting information.
The key here is that you only store the information once, so memory
consumption is greatly reduced. -- TobinHarris
Flyweight Structure Summary
57
58
CircleFactory.java
class CircleFactory {
//store color
private static final HashMap circleByColor = new HashMap();
public static Circle getCircle(Color color) {
Circle circle = (Circle)circleByColor.get(color);
if(circle == null) {
circle = new Circle(color);
circleByColor.put(color, circle);
System.out.println("Creating " + color + " circle");
//see how many objects we create on command line
}
return circle;
}
}
59
Object Creation and Flyweight Pattern


Make instances of classes on the fly to improve
performance efficiently
Show a file system with directories




Java String Class is designed with Flyweight





60
No need to load all the files or directories at one loading time.
Show the upper level folders first.
If the user clicks a folder, then load its subdirectories and files.
String s1 = "hello";
String s2 = "hello"; //store in a string pool.
String s3 = new String("hello");
System.out.println(s1==s2); //true, share the memory address
System.out.println(s1==s3); //false
More about Code Tuning

Other techniques



Recoding in a low-level language
…
Code tuning is a controversial, emotional topic

61
Apply them with care if you decide to use
Descargar

Document