Maximizing Intel® Compiler
Performance using
Iterative Feedback Directed
Optimization (IFDO)
Ilya Cherny
Software Manager
Software and Services Group
Thanks to Leonid Brusencov, Sergey Ermolaev, Artem Chirtsov, Sergey Grebenkin!
October 23, 2008
Agenda
•Why IFDO
•What is IFDO
•Alternative approaches
•Experimental Results
•Future plans
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
A Performance Experiment
• Take 3 nested loops for matrix multiplication
for(i = 0; i < N; i++) {
for(j = 0; j < N; j++) {
c[i][j] = 0;
}
}
for(k = 0; k <
>icl
–help
c[i][j] +=
..
N; k++) {
a[i][k] * b[k][j];
}
/O3
optimize for
maximum speed and enable
• Compile
with maximum
optimization
high-level optimizations
.. measure time – 20 sec
• Run and
−‘icl –O3 matmul.c’
• Compile with enforced vectorization
−‘icl -O3 –mP2OPT_vec_always=T’
• Run and measure time – 11 sec
Why don’t switch on vectorization always?
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
A Performance Experiment II
•Switch ‘i’ and ‘j’ loops, move initialization loop outside
for(i = 0; i < N; i++)
for(j = 0; j < N; j++)
c[i][j] = 0;
for(j = 0; j < N; j++) {
for(i = 0; i < N; i++) {
for(k = 0; k < N; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
•Compiler with O3 optimization
•Measure run time – 18 sec
•Compile with enforced vectorization
•Measure run time – 21 sec
Thoughtless optimization loses 15%
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Why Optimizations Do Not Always Win?
Source code adding 4 integers
Compile to scalar code
for(i=0;i<4;i++)
A[i]=A[i]+B[i]
Compile to vector code
loop_start:
mov ebx, B[edi]
add A[edi], ebx
inc edi
loop loop_start
movdqa xmm1, A
paddd xmm1, B
movdqa A, xmm1
•Vector code could be 2-4x times faster, but..
−..what if array size is not a multiple of 4?
−..what if A is not aligned by 16 bytes?
−..what if A and B aligned differently?
•Final code has an overhead of these 3 ‘if’
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Heuristics for Parameterization of
Compiler Optimizations
• Compiler has hundreds of performance optimizations
• Each optimization has at least 1 Boolean parameter
• Compiler has heuristics to make a decision about each optimization
• But still code produced by the compiler is far from optimal
−Each application has unique behavior (loop counts, memory accesses)
−Each machine has unique characteristics (latencies, caches)
−It is impossible to target ALL compiler heuristics for ALL machines for ALL
applications
Compiler optimizations have considerable
performance headroom if parameterized right
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Agenda
•Why IFDO
•What is IFDO
•Alternative approaches
•Experimental Results
•Future plans
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
IFDO process
1. Optimization:
Compile
executable from
the sources
2.Feedback:
Run
executable
with profiling
4.Iterative:
Repeat while
time permits
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
3.Directed:
Analyze
results, define
options for the
next
compilation
Software and Services Group
‹#›
Search Algorithms
•Implemented 6 algorithms to do search in the options space
[20]
−exhaustive search with priorities
−batch elimination
−iterative elimination
−combined elimination
−genetic algorithm
−statistical selection [23]
Search algorithms find the maximum much faster
than after 2n iterations
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Compiler Optimization Options
•Selected 6 undocumented options which seem to have the
highest impact
−loop vectorization
−loop fusion
−loop distribution
−loop unrolling
−data blocking
−memory prefetch
•Number of total independent option values is 13
•Search space for these 6 options has 7000 combinations
•Extracted all compiler options from the sources and just
started the runs for more than 1000 options
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
IFDO Tool
•User defines how to build and
execute his application
application builder
•IFDO tool outputs the best
binary and all performance
data
1: Build application
IFDO_compiler
•Based on IFDO results user
can
application launcher
::User
−change compilation options or
pragmas
application executable
log_data [iteration]
−modify sources
−improve compiler
IFDO tool automates iteration process
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
IFDO Sample Output - SPEC CPU2000 173.applu
iter stat proc
ticks
impr% noise% parameter
val parameter val
------------------------------------------------------------------------------1 ... *Total*
9571493442
+0.00
0.21
1 ... _SSOR
7335205487
+0.00
0.26 VectorizeP
1 BlockP
1
1 ... _RHS
2218915958
+0.00
0.02 VectorizeP
1 BlockP
1
2
... *Total*
6234529813
+34.86
0.01
2
2
... SSOR
... _RHS
4048766558
2168605738
+44.80
+2.27
0.01
0.05
3
3
3
... *Total*
... SSOR
... RHS
4
4
4
... *Total*
... _SSOR
... _RHS
10806270474
8657609979
2102104477
-12.90
-18.03
+5.26
0.15
0.10
0.01
5
*.. *Total*
5602632636
+41.47
0.00
5
5
*..
*..
4043669682
1542238426
+44.87
+30.50
0.02
0.06
VectorizeP
VectorizeP
1
1
BlockP
BlockP
2
2
But Intel®
Compiler
improved
since 10.0 &
12889959745
-34.67
0.06
11309558116 -54.18
0.07 only
VectorizeP
2 BlockP now!
1
SPEC 173.applu
has
8%
headroom
1563970212 +29.52
0.01 VectorizeP
2 BlockP
1
SSOR
RHS
VectorizeP
VectorizeP
3
3
BlockP
BlockP
1
1
VectorizeP
VectorizeP
1
2
BlockP
BlockP
2
1
41% performance gain!
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Agenda
•Why IFDO
•What is IFDO
•Alternative approaches
•Experimental Results
•Future plans
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Comparison to other publications
•There are 23 references to the similar papers
•No publications have compiler with procedure level granularity
•All publications were limited by about 50 options, while we
started exploring 1000 of undocumented compiler options
Our work has two novel characteristics
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Comparison to other tools
Manual
options
search
PGO
prof_gen/
prof_use)
Granularity
Profiling
Conclusion
-
-/+
whole application
or source changes
any profiler, but
manually
user time
consuming
+
-
basic block level
basic block
counters only
2 iterations only
PathScale
PathOpt2
-
-
whole application
whole application
IFDO
+
+
function or loop
instrumentation,
VTune
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
40% less results
not available in
the product
Software and Services Group
‹#›
Agenda
•Why IFDO
•What is IFDO
•Alternative approaches
•Experimental Results
•Future plans
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Search Algorithms – Performance Growth
dependency on Iteration
5
1
11
21
31
41
51
61
71
81
91
101
Performance gain, %
4
ES
3
BE
IE
CE
2
GA
SS
1
0
Number of iteration
•BE works for independent options only, but just 14 iterations
•IE and CE get 99% in 30 iterations – the most effective
All algorithms gain 2.5%-4% in CPU2000 total time
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Procedure vs. Application granularity
10
9
7
6
Procedure
5
Application
4
3
2
1
0
16
4
16
8
17
3
17
5
17
6
17
7
17
9
18
1
18
3
18
6
18
7
18
8
18
9
19
1
19
7
25
2
25
3
25
4
25
5
25
6
30
0
30
1
Performance gain, %
8
SPEC CPU2000 Benchmark
8 from 22 benchmarks gain from procedure level
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
SECR 2008
is
tri
b
is
tri
b
is
tri
b
ut
io
nP
{2
}
kP
{2
}
ut
io
lo
c
nP
{3
}
ut
io
nP
{4
Fu
}
si
on
P{
2}
H
Fu
LO
si
on
Un
P{
ro
3}
llF
H
LO
ac
to
Un
rP
ro
{1
l
lF
H
}
LO
ac
t
Un
or
P{
ro
2}
llF
H
LO
ac
to
Un
rP
ro
{3
llF
}
ac
U
nr
to
ol
rP
lA
{4
nd
}
Ja
m
P{
Ve
2}
ct
or
iz
eP
Ve
{2
ct
}
or
iz
eP
{3
}
D
D
D
B
Performance increase, %
Options & Values Contribution to
Performance Increase
3
2.5
2
1.5
1
0.5
0
Options
Each option gives about 2 percent
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
1000 Options Impact on Performance
1000.00
900.00
800.00
Number of options
700.00
600.00
500.00
400.00
300.00
200.00
100.00
0.00
< - -80% -80% -40% -20% -10% -5% -2% -1% 0%
100%-100%-40% -20% -10% -5% -2% -1% 0% 1%
1%
2%
2% 5% 10% 20% 40% 80% >100%
5% 10% 20% 40% 80% 100%
Ranges of performance impact, %
Only 600 from 3000 options have zero impact
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Agenda
•Why IFDO
•What is IFDO
•Alternative approaches
•Experimental Results
•Future plans
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Future Plans
•Make experiments with all undocumented options
−3000 values if no impact on each other
−more than 23000 combinations!
•Implement storing of application properties
−number of FP expressions, number of loops, etc.
•Implement expert/machine learning system
−suggest options according to application properties
−may decrease number of iterations down to 1
−can substitute existing compiler heuristics?
Useful to both compiler developers and users
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Summary
•If performance’s critical, try at least ‘icl –O3’!
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Спасибо!
Thanks!
References 1-13
[1] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, and E. Rohou, “Iterative compilation in a non-linear optimization space”, In Proc.
ACM Workshop on Profile and Feedback Directed Compilation, 1998, Organized in conjunction with PACT98.
[2] K. Cooper, D. Subramanian, and L. Torczon, “Adaptive optimizing compilers for the 21st century”, J. of Supercomputing, 32(1),
2002.
[3] J. Bilmes, K. Asanovic, C. Chin, and J. Demmel, “Optimizing matrix multiply using PHiPAC: A portable, high- performance, ANSI C
coding methodology”, In Proc. ICS, pages 340-347, 1997.
[4] M. Stephenson and S. Amarasinghe, “Prediction unroll factor using supervised classification”, In ERRR/ACM International
Symposium on Code Generation and Optimization (CGO 2005), ERRR Computer Society, 2005.
[5] Yom-Toy, J. Thomson, O. Temam, A. Zaks, H. Leather, C. Miranda, M. Namolaru, E. Bonilla, Saclay, B. Mendelson, C. Williams,
Haifa, M. O’Boyle, P. Barnard, E. Ashton, E. Courtois, F. Bodin “MILEPOST GCC: machine learning based research compiler”, ARC,
International, UK, CAPS Enterprise, France, 2007.
[6] K. Hoste, L. Eeckhout, “COLE: Compiler Optimization Level Exploration”, ELIS Department, Ghent University, SingPietersnieuwstraat 41, B-9000 Gent, Belgium, 2008.
[7] K. Deb, “Multi-Objective Optimization using Evolutionary Algorithms”, Wiley, 2001.
[8] G. Fursin, J. Cavazos, M. O’Boyle, and O. Temam, “MiDataSets: Creating the Conditions for a More Realistic Evaluation of Iterative
Optimization”, ALCHEMY Group, INRIA Futurs and LRI, Paris-Sud University, France, 2007.
[9] M. Byler, M. Wolfe, J.R.B. Davies, C. Huson, and B. Leasure, “Multiple version loops.” In ICPP, 1987, pages 312-318, 2005.
[10] K. D. Cooper, M. W. Hall, and K. Kennedy, “Procedure cloning”, In Proceedings of the 1992 IEEE International Conference on
Computer Language, pages 99-105, 1992.
[11] P. Diniz and M. Rinard. “Dynamic feedback: An effective technique for adaptive computing”, In Proc. PLDI, pages 71-84, 1997.
[12] G. Fursin, C. Miranda, S. Pop, A. Cohen, O. Temam, “Practical Run-time Adaptation with Procedure Cloning to Enable Continuous
Collective Compilation”, Alchemy group, INRIA Futurs and LRI, Paris-Sud 11 University, Orsay, France, 2007.
[13] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent dynamic optimization system”, In ACM SIGPLAN Notices,
2000.
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
References 14-23
[14] R. H. Saavedra and D. Park, “Improving the effectiveness of software prefetching with adaptive execution”, In Conference on
Parallel Architectures and Compilation Techniques (PACT’96), 1996.
[15] M. Voss and R. Eigemann, “High-level adaptive program optimization with adapt”, In Proceedings of the Symposium on Principles
and practices of parallel programming, 2001.
[16] G. Fursin, A. Cohen, M. O’Boyle, and O. Temam, “A Practical Method For Quickly Evaluating Program Optimizations”, Institute for
Computing Systems Architecture, University of Edinburgh, UK, 2005.
[17] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior”, In 10th
International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[18] J. Lau, S. Schoenmackers, and B. Calder, “Transition phase classification and prediction”, In International Symposium on High
Performance Computer Architecture, 2005.
[19] Z. Pan, R. Eignmann, “Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning.”,
Proceedings of the International Symposium on Code Generation and Optimization, 2006.
[20] S. Triantafyllis, M.J. Bridges, E. Raman, G. Ottoni and D. August, “A Framework for Unrestricted Whole-Program Optimization.”,
Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2006.
[21] Z. Pan, R. Eignmann, “Fast, Automatic, Procedure-Level Performance Tuning.”, Proceedings of the 15th International Conference
on Parallel Architecture and Compilation Techniques, 2006.
[22] H. Feltl, “Ein Genetischer Algorithmus fuer das Generalized Assignment Problem”, Diplomarbeit, 2003.
[23] M. Haneda, P. Knijnenburg, H. Wijshoff “Automatic Selection of Compiler Options Using Non-Parametristic Inferential Statistics.”,
Proceedings of the 14th International Conference on Parallel Architecture and Compilation Techniques, 2005.
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Basic Foil with Take-Away Banner
Use Verdana Bold for Main Body Subheadings
• Use Verdana regular for main body text.
• Use charcoal gray (RGB 51 – 51 – 51) color as the default
• Text size can vary. Use these minimum recommended font sizes
−Slide title: 32 pt
−Main body subheadings: 20 pt
−Bullet points: 18 pt (with sub-bullets reducing by 2 pt each – 16 pt, 14 pt, etc)
−Tables: 14 pt
−Diagram and chart labels: 12 pt
• Emphasize with italics, bold or color (blue)
• Use text boxes to highlight content
• Primary text should be on background color, not photos, etc.
• Use bullets same color as text
Standard Take-Away Banner. Add Text Here!
SECR 2008
October 23
Maximizing Intel® Compiler
Performance using IFDO
Software and Services Group
‹#›
Descargar

Using SSG Templates