```Automatic Performance Tuning
of
Sparse-Matrix-Vector-Multiplication (SpMV)
and
Iterative Sparse Solvers
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr09
Outline
• Motivation for Automatic Performance Tuning
• Results for sparse matrix kernels
– Sparse Matrix Vector Multiplication (SpMV)
– Sequential and Multicore results
• OSKI = Optimized Sparse Kernel Interface
• Tuning Entire Sparse Solvers
Berkeley Benchmarking and OPtimization (BeBOP)
• Prof. Katherine Yelick
• Current members
– Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin,
Shoaib Kamil, Rajesh Nishtala, Vasily Volkov, Sam Williams, …
• Previous members
– Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich Vuduc,
• Many results here from current, previous students
• bebop.cs.berkeley.edu
Automatic Performance Tuning
• Goal: Let machine do hard work of writing fast code
• What does tuning of dense BLAS, FFTs, signal processing,
have in common?
– Can do the tuning off-line: once per architecture, algorithm
– Can take as much time as necessary (hours, a week…)
– At run-time, algorithm choice may depend only on few
parameters (matrix dimensions, size of FFT, etc.)
• Can’t always do tuning off-line
– Algorithm and implementation may strongly depend on data
only known at run-time
– Ex: Sparse matrix nonzero pattern determines both best data
structure and implementation of Sparse-matrix-vectormultiplication (SpMV)
– Part of search for best algorithm just be done (very quickly!)
at run-time
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Linear Programming Matrix
…
A Sparse Matrix You Encounter Every Day
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)

y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
Only 2 flops per
2 mem_refs:
Limited by getting
data from memory
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
• 8x8 dense
substructure:
exploit this to
limit #mem_refs
Speedups on Itanium 2: The Need for Search
Mflop/s
Best: 4x2
Reference
Mflop/s
Register Profile: Itanium 2
1190 Mflop/s
190 Mflop/s
Power3 - 17%
252 Mflop/s
Power4 - 16%
820 Mflop/s
Register Profiles: IBM and Intel IA-64
122 Mflop/s
Itanium 1 - 8%
247 Mflop/s
107 Mflop/s
459 Mflop/s
Itanium 2 - 33%
1.2 Gflop/s
190 Mflop/s
Ultra 2i - 11%
72 Mflop/s
Ultra 3 - 5%
90 Mflop/s
Register Profiles: Sun and Intel x86
35 Mflop/s
Pentium III - 21%
108 Mflop/s
42 Mflop/s
50 Mflop/s
Pentium III-M - 15%
122 Mflop/s
58 Mflop/s
Another example of tuning challenges
• More complicated
non-zero structure
in general
• N = 16614
• NNZ = 1.1M
Zoom in to top corner
• More complicated
non-zero structure
in general
• N = 16614
• NNZ = 1.1M
3x3 blocks look natural, but…
• More complicated non-zero
structure in general
• Example: 3x3 blocking
– Logical grid of 3x3 cells
• But would lead to lots of “fill-in”
Extra Work Can Improve Efficiency!
• More complicated non-zero
structure in general
• Example: 3x3 blocking
–
–
–
–
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
“Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!
– Actual mflop rate 1.52 = 2.25
higher
Automatic Register Block Size Selection
• Selecting the r x c block size
– Off-line benchmark
• Precompute Mflops(r,c) using dense A for each r x c
• Once per machine/architecture
– Run-time “search”
• Sample A to estimate Fill(r,c) for each r x c
– Run-time heuristic model
• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix
– Fraction of matrix to sample: s  [0,1]
– Control cost = O(s * nnz ) by controlling s
• Search at run-time: the constant matters!
• Control s automatically by computing statistical confidence
intervals, by monitoring variance
• Cost of tuning
– Lower bound: convert matrix in 5 to 40 unblocked SpMVs
– Heuristic: 1 to 11 SpMVs
• Tuning only useful when we do many SpMVs
– Common case, eg in sparse solvers
Accuracy of the Tuning Heuristics (1/4)
See p. 375 of Vuduc’s thesis for matrices
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
Accuracy of the Tuning Heuristics (2/4)
DGEMV
Accuracy of the Tuning Heuristics (2/4)
Upper Bounds on Performance for blocked SpMV
• P = (flops) / (time)
– Flops = 2 * nnz(A)
• Upper bound on speed: Two main assumptions
– 1. Count memory ops only (streaming)
– 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes
• Account for matrix size and nnz
• Charge minimum access “latency” ai at Li cache & amem
– e.g., Saavedra-Barrera and PMaC MAPS benchmarks

T ime a i  Hitsi  a mem  Hitsmem
i 1

 a1  Loads  (a i 1  a i )  Missesi  (a mem  a )  Misses
i 1
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Summary of Other Performance Optimizations
• Optimizations for SpMV
–
–
–
–
–
–
–
–
Register blocking (RB): up to 4x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Reordering to create dense structure + splitting: 2x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Cache blocking: 2.8x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
• Sparse triangular solve
– Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels
– A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB
– A2·x: 2x over CSR, 1.5x over RB
– [A·x, A2·x, A3·x, .. , Ak·x] …. more to say later
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]
• Relevant optimization techniques
– Symmetric storage
– Register blocking
– Reordering, to create more dense blocks
• Reverse Cuthill-McKee ordering to reduce bandwidth
– Do Breadth-First-Search, number nodes in reverse order visited
• Traveling Salesman Problem-based ordering to create blocks
–
–
–
–
–
Nodes = columns of A
Weights(u, v) = no. of nz u, v have in common
Tour = ordering of columns
Choose maximum weight tour
See [Pinar & Heath ’97]
• 2.1x speedup on IBM Power 4
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Post-RCM Reordering
100x100 Submatrix Along Diagonal
“Microscopic” Effect of RCM Reordering
Before: Green + Red
After: Green + Blue
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + Red
After: Green + Blue
(Omega3P)
Optimized Sparse Kernel Interface - OSKI
• Provides sparse kernels automatically tuned for user’s
matrix & machine
– BLAS-style functionality: SpMV, Ax & ATy, TrSV
– Hides complexity of run-time tuning
– Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations
– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For “advanced” users & solver library writers
– Available as stand-alone library (OSKI 1.0.1h, 6/07)
– Available as PETSc extension (OSKI-PETSc .1d, 3/06)
– Bebop.cs.berkeley.edu/oski
How the OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Matrix
from program
monitoring
History
1. Evaluate
Models
Heuristic
models
2. Select
Data Struct.
& Code
To user:
Matrix handle
for kernel
calls
Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);/* Step 2 */
How to Call OSKI: Tune with Explicit Hints
• User calls “tune” routine
– May provide explicit tuning hints (OPTIONAL)
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
/* Tell OSKI we will call SpMV 500 times (workload hint) */
oski_SetHintMatMult(A_tunable, OP_NORMAL, a, x_view, , y_view, 500);
/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */
oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
How the User Calls OSKI: Implicit Tuning
– Library profiles all kernel calls
– May periodically re-tune
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
for( i = 0; i < 500; i++ ) {
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
}
Multicore SMPs Used for Tuning SpMV
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
Multicore SMPs with Conventional cache-based memory hierarchy
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
Multicore SMPs with local store-based memory hierarchy
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
Multicore SMPs with CMT = Chip-MultiThreading
Intel Xeon E5345 (Clovertown)
Sun T2+ T5140 (Victoria Falls)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
*SPEs
only
Multicore SMPs: peak double precision flops
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
75 GFlop/s
74 Gflop/s
Sun T2+ T5140 (Victoria Falls)
19 GFlop/s
29* GFlop/s
*SPEs
only
Multicore SMPs: total DRAM bandwidth
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
10 GB/s (write)
21 GB/s
Sun T2+ T5140 (Victoria Falls)
21 GB/s (write)
51 GB/s
*SPEs
only
Multicore SMPs with Non-Uniform Memory Access - NUMA
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
*SPEs
only
Set of 14 test matrices
• All bigger than the caches of our SMPs
2K x 2K Dense matrix
stored in sparse format
Dense
Well Structured
(sorted by nonzeros/row)
Protein
FEM /
Spheres
FEM /
Cantilever
FEM /
Accelerator
Circuit
webbase
Poorly Structured
hodgepodge
Extreme Aspect Ratio
(linear programming)
LP
Wind
Tunnel
FEM /
Harbor
QCD
FEM /
Ship
Economics Epidemiology
SpMV Performance: Naive parallelization
•
•
Out-of-the box SpMV
performance on a suite of
14 matrices
Scalability isn’t great:
8
8
128 16
Naïve
SpMV Performance: NUMA and Software Prefetching
 NUMA-aware allocation is
essential on NUMA SMPs.
 Explicit software
prefetching can boost
bandwidth and change
cache replacement
policies
 used exhaustive search
SpMV Performance: “Matrix Compression”
 Compression includes
 register blocking
 other formats
 smaller indices
 Use heuristic rather
than search
SpMV Performance: cache and TLB blocking
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve
SpMV Performance: Architecture specific optimizations
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve
SpMV Performance: max speedup
•
2.7x
2.9x
4.0x
35x
•
•
Fully auto-tuned SpMV
performance across the suite
of matrices
Included SPE/local store
optimized version
Why do some optimizations
work better on some
architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve
Avoiding Communication in Sparse Linear Algebra
• k-steps of typical iterative solver for Ax=b or Ax=λx
– Does k SpMVs with starting vector (eg with b, if solving Ax=b)
– Finds “best” solution among all linear combinations of these k+1 vectors
– Many such “Krylov Subspace Methods”
• Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
• Goal: minimize communication in Krylov Subspace Methods
– Assume matrix “well-partitioned,” with modest surface-to-volume ratio
– Parallel implementation
• Conventional: O(k log p) messages, because k calls to SpMV
• New: O(log p) messages - optimal
– Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• New: O(1) moves of data – optimal
• Lots of speed up possible (modeled and measured)
– Price: some redundant computation
Locally Dependent Entries for
[x,Ax], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,A2x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A3x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A4x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A8x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
k=8 fold reuse of A
Remotely Dependent Entries for
[x,Ax,…,A8x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
One message to get data needed to compute remotely dependent entries, not k=8
Minimizes number of messages = latency cost
Price: redundant work  “surface/volume ratio”
Remotely Dependent Entries for [x,Ax,A2x,A3x],
A irregular, multiple processors
Sequential [x,Ax,…,A4x], with memory hierarchy
v
One read of matrix from slow memory, not k=4
Minimizes words moved = bandwidth cost
No redundant work
Performance results on 8-Core Clovertown
Optimizing Communication Complexity of Sparse Solvers
• Example: GMRES for Ax=b on “2D Mesh”
– x lives on n-by-n mesh
– Partitioned on p½ -by- p½ grid of p processors
– A has “5 point stencil” (Laplacian)
• (Ax)(i,j) = linear_combination(x(i,j), x(i,j±1), x(i±1,j))
– Ex: 18-by-18 mesh on 3-by-3 grid of 9 processors
Minimizing Communication of GMRES
• What is the cost = (#flops, #words, #mess)
of k steps of standard GMRES?
GMRES, ver.1:
for i=1 to k
w = A * v(i-1)
MGS(w, v(0),…,v(i-1))
update v(i), H
endfor
solve LSQ problem with H
n/p½
• Cost(A * v) = k * (9n2 /p, 4n / p½ , 4 )
• Cost(MGS) = k2/2 * ( 4n2 /p , log p , log p )
• Total cost ~ Cost( A * v ) + Cost (MGS)
• Can we reduce the latency?
n/p½
Minimizing Communication of GMRES
• Cost(GMRES, ver.1) = Cost(A*v) + Cost(MGS)
= ( 9kn2 /p, 4kn / p½ , 4k ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from A*v can you avoid? Almost all
GMRES, ver. 2:
W = [ v, Av, A2v, … , Akv ]
[Q,R] = MGS(W)
Build H from R, solve LSQ problem
k=3
• Cost(W) = ( ~ same, ~ same , 8 )
• Latency cost independent of k – optimal
• Cost (MGS) unchanged
• Can we reduce the latency more?
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Akv ]
[Q,R] = TSQR(W) … “Tall Skinny QR”
Build H from R, solve LSQ problem
W1
W = W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Latency cost independent of s - optimal
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Akv ]
[Q,R] = TSQR(W) … “Tall Skinny QR”
Build H from R, solve LSQ problem
W1
W = W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Oops
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Akv ]
[Q,R] = TSQR(W) … “Tall Skinny QR”
Build H from R, solve LSQ problem
W1
W = W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Oops – W from power method, precision lost!
Minimizing Communication of GMRES
• Cost(GMRES, ver. 3) = Cost(W) + Cost(TSQR)
= ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , log p )
• Latency cost independent of k, just log p – optimal
• Oops – W from power method, so precision lost – What to do?
• Use a different polynomial basis
• Not Monomial basis W = [v, Av, A2v, …], instead …
• Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] or
• Chebyshev Basis WC = [v, T1(v), T2(v), …]
Speed ups on 8-core Clovertown
Conclusions
• Fast code must minimize communication
– Especially for sparse matrix computations because
communication dominates
• Generating fast code for a single SpMV
– Design space of possible algorithms must be searched at
run-time, when sparse matrix available
– Design space should be searched automatically
• Biggest speedups from minimizing communication in
an entire sparse solver
– Many more opportunities to minimize communication in
multiple SpMVs than in one
– Requires transforming entire algorithm
– Lots of open problems
EXTRA SLIDES
Quick-and-dirty Parallelism: OSKI-PETSc
• Extend PETSc’s distributed memory SpMV (MATMPIAIJ)
p0
• PETSc
– Each process stores diag
(all-local) and off-diag
submatrices
p1
• OSKI-PETSc:
p2
p3
– Each submatrix tuned
independently
OSKI-PETSc Proof-of-Concept Results
• Matrix 1: Accelerator cavity design (R. Lee @ SLAC)
– N ~ 1 M, ~40 M non-zeros
– 2x2 dense block substructure
– Symmetric
• Matrix 2: Linear programming (Italian Railways)
– Short-and-fat: 4k x 1M, ~11M non-zeros
– Highly unstructured
– Big speedup from cache-blocking: no native PETSc format
• Evaluation machine: Xeon cluster
– Peak: 4.8 Gflop/s per node
Accelerator Cavity Matrix
OSKI-PETSc Performance: Accel. Cavity
Linear Programming Matrix
…
OSKI-PETSc Performance: LP Matrix
Performance Results
•
•
•
•
Measured Multicore (Clovertown) speedups up to 6.4x
Measured/Modeled sequential OOC speedup up to 3x
Modeled parallel Petascale speedup up to 6.9x
Modeled parallel Grid speedup up to 22x
• Sequential speedup due to bandwidth, works for many
problem sizes
• Parallel speedup due to latency, works for smaller problems
on many processors
Speedups on Intel Clovertown (8 core)
Extensions
• Other Krylov methods
– Arnoldi, CG, Lanczos, …
• Preconditioning
– Solve MAx=Mb where preconditioning matrix M chosen to make
system “easier”
• M approximates A-1 somehow, but cheaply, to accelerate convergence
– Cheap as long as contributions from “distant” parts of the system
can be compressed
• Sparsity
• Low rank
Design Space for [x,Ax,…,Akx] (1/3)
• Mathematical Operation
– Keep all vectors
• Krylov Subspace Methods
– Improving conditioning of basis
• W = [x, p1(A)x, p2(A)x,…,pk(A)x]
• pi(A) = degree i polynomial chosen to reduce cond(W)
– Preconditioning (Ay=b  MAy=Mb)
• [x,Ax,MAx,AMAx,MAMAx,…,(MA)kx]
– Keep last vector Akx only
• Jacobi, Gauss Seidel
Design Space for [x,Ax,…,Akx] (2/3)
• Representation of sparse A
– Zero pattern may be explicit or implicit
– Nonzero entries may be explicit or implicit
• Implicit  save memory, communication
Explicit pattern
Implicit pattern
Explicit nonzeros
General sparse matrix Image segmentation
Implicit nonzeros
Laplacian(graph),
for graph partitioning
“Stencil matrix”
Ex: tridiag(-1,2,-1)
• Representation of dense preconditioners M
– Low rank off-diagonal blocks (semiseparable)
Design Space for [x,Ax,…,Akx] (3/3)
• Parallel implementation
– From simple indexing, with redundant flops  surface/volume ratio
– To complicated indexing, with fewer redundant flops
• Sequential implementation
– Depends on whether vectors fit in fast memory
• Reordering rows, columns of A
– Important in parallel and sequential cases
– Can be reduced to pair of Traveling Salesmen Problems
• Plus all the optimizations for one SpMV!
Summary
• Communication-Avoiding Linear Algebra (CALA)
• Lots of related work
– Some going back to 1960’s
– Reports discuss this comprehensively, not here
• Our contributions
– Several new algorithms, improvements on old ones
– Unifying parallel and sequential approaches to avoiding
communication
– Time for these algorithms has come, because of growing
communication costs
– Why avoid communication just for linear algebra motifs?
Possible Class Projects
• Come to BEBOP meetings (T 9 – 10:30, 606 Soda)
• Incorporate multicore optimizations into OSKI
• Experiment with SpMV on GPU
– Which optimizations are most effective?
• Try to speed up particular matrices of interest
– Data mining
• Experiment with new [x,Ax,…,Akx] kernel
– GPU, multicore, distributed memory
– On matrices of interest
– Experiment with new solvers using this kernel
Extra Slides
Tuning Higher Level Algorithms
• So far we have tuned a single sparse matrix kernel
• y = AT*A*x motivated by higher level algorithm (SVD)
• What can we do by extending tuning to a higher level?
• Consider Krylov subspace methods for Ax=b, Ax = lx
• Conjugate Gradients (CG), GMRES, Lanczos, …
• Inner loop does y=A*x, dot products, saxpys, scalar ops
• Inner loop costs at least O(1) messages
• k iterations cost at least O(k) messages
• Our goal: show how to do k iterations with O(1) messages
• Possible payoff – make Krylov subspace methods much
faster on machines with slow networks
• Memory bandwidth improvements too (not discussed)
• Obstacles: numerical stability, preconditioning, …
Krylov Subspace Methods for Solving Ax=b
• Compute a basis for a subspace V by doing y = A*x
k times
• Find “best” solution in that Krylov subspace V
• Given starting vector x1, V spanned by x2 = A*x1, x3
= A*x2 , … , xk = A*xk-1
• GMRES finds an orthogonal basis of V by
Gram-Schmidt, so it actually does a different set of
SpMVs than in last bullet
Example: Standard GMRES
r = b - A*x1,  = length(r), v1 = r / 
… length(r) = sqrt(S ri2 )
for m= 1 to k do
w = A*vm … at least O(1) messages
for i = 1 to m do
… Gram-Schmidt
him = dotproduct(vi , w ) … at least O(1) messages, or log(p)
w = w – h im * vi
end for
hm+1,m = length(w) … at least O(1) messages, or log(p)
vm+1 = w / hm+1,m
end for
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Vk * y
… Vk = [v1 , v2 , … , vk ]
O(k2), or O(k2 log p), messages altogether
Latency-Avoiding GMRES (1)
r = b - A*x1,  = length(r), v1 = r /  … O(log p) messages
Wk+1 = [v1 , A * v1 , A2 * v1 , … , Ak * v1 ] … O(1) messages
[Q, R] = qr(Wk+1) … QR decomposition, O(log p) messages
Hk = R(:, 2:k+1) * (R(1:k,1:k))-1 … small, local problem
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Qk * y … local problem
O(log p) messages altogether
Independent of k
Latency-Avoiding GMRES (2)
[Q, R] = qr(Wk+1)
… QR decomposition,
O(log p) messages
Easy, but not so stable way to do it:
X(myproc) = Wk+1T(myproc) * Wk+1 (myproc)
… local computation
Y = sum_reduction(X(myproc)) … O(log p) messages
… Y = Wk+1T* Wk+1
R = (cholesky(Y))T … small, local computation
Q(myproc) = Wk+1 (myproc) * R-1 … local computation
Numerical example (1)
Diagonal matrix with n=1000, Aii from 1 down to 10-5
Instability as k grows, after many iterations
Numerical Example (2)
Partial remedy: restarting periodically (every 120 iterations)
Other remedies: high precision, different basis than [v , A * v , … , Ak * v ]
Operation Counts for [Ax,A2x,A3x,…,Akx] on p procs
Problem
Per-proc cost Standard
Optimized
1D mesh
#messages
2k
2
2k
2k
#flops
5kn/p
5kn/p + 5k2
memory
(k+4)n/p
(k+4)n/p + 8k
#messages
26k
26
6kn2p-2/3 + 12knp-1/3
+ O(k)
6kn2p-2/3 + 12k2np-1/3
+ O(k3)
#flops
53kn3/p
53kn3/p + O(k2n2p-2/3)
memory
(k+28)n3/p+ 6n2p-2/3
+ O(np-1/3)
(k+28)n3/p + 168kn2p-2/3
+ O(k2np-1/3)
(tridiagonal) #words sent
3D mesh
27 pt stencil #words sent
Summary and Future Work
• Dense
– LAPACK
– ScaLAPACK
• Communication primitives
• Sparse
– Kernels, Stencils
– Higher level algorithms
• All of the above on new architectures
– Vector, SMPs, multicore, Cell, …
• High level support for tuning
– Specification language
– Integration into compilers
Extra Slides
A Sparse Matrix You Encounter Every Day
Who am I?
I am a
Big Repository
Of useful
And useless
Facts alike.
Who am I?
inbox.)
– Approx. once a month: rank all pages using connectivity structure
• Find dominant eigenvector of a matrix
– At query-time: return list of pages ordered by rank
• Matrix: A = aG + (1-a)(1/n)uuT
– Markov model: Surfer follows link with probability a, jumps to a
random page with probability 1-a
– G is n x n connectivity matrix [n  billions]
• gij is non-zero if page i links to page j
• Normalized so each column sums to 1
• Very sparse: about 7—8 non-zeros per row (power law dist.)
– u is a vector of all 1 values
– Steady-state probability xi of landing on page i is solution to x = Ax
• Approximate x by power method: x = Akx0
– In practice, k  25
Current Work
• Public software release
• Impact on library designs: Sparse BLAS, Trilinos, PETSc, …
• Integration in large-scale applications
– DOE: Accelerator design; plasma physics
– Geophysical simulation based on Block Lanczos (ATA*X; LBL)
• Systematic heuristics for data structure selection?
• Evaluation of emerging architectures
– Revisiting vector micros
• Other sparse kernels
– Matrix triple products, Ak*x
• Parallelism
• Sparse benchmarks (with UTK) [Gahvari & Hoemmen]
• Automatic tuning of MPI collective ops [Nishtala, et al.]
Summary of High-Level Themes
• “Kernel-centric” optimization
– Vs. basic block, trace, path optimization, for instance
– Aggressive use of domain-specific knowledge
• Performance bounds modeling
– Evaluating software quality
– Architectural characterizations and consequences
• Empirical search
– Hybrid on-line/run-time models
• Statistical performance models
– Exploit information from sampling, measuring
Related Work
• My bibliography: 337 entries so far
• Sample area 1: Code generation
– Generative & generic programming
– Sparse compilers
– Domain-specific generators
• Sample area 2: Empirical search-based tuning
– Kernel-centric
• linear algebra, signal processing, sorting, MPI, …
– Compiler-centric
• profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)
• Composable code generators and search spaces
• New application domains
– PageRank: multilevel block algorithms for topic-sensitive search?
• New kernels: cryptokernels
– rich mathematical structure germane to performance; lots of
hardware
• New tuning environments
– Parallel, Grid, “whole systems”
• Statistical models of application performance
– Statistical learning of concise parametric models from traces for
architectural evaluation
– Compiler/automatic derivation of parametric models
Possible Future Work
• Different Architectures
– New FP instruction sets (SSE2)
– SMP / multicore platforms
– Vector architectures
• Different Kernels
• Higher Level Algorithms
– Parallel versions of kenels, with optimized communication
– Block algorithms (eg Lanczos)
– XBLAS (extra precision)
• Produce Benchmarks
– Augment HPCC Benchmark
• Make it possible to combine optimizations easily for any kernel
• Related tuning activities (LAPACK & ScaLAPACK)
Review of Tuning by Illustration
(Extra Slides)
Splitting for Variable Blocks and Diagonals
• Decompose A = A1 + A2 + … At
–
–
–
–
Detect “canonical” structures (sampling)
Split
Tune each Ai
Improve performance and save storage
• New data structures
– Unaligned block CSR
• Relax alignment in rows & columns
– Row-segmented diagonals
Example: Variable Block Row (Matrix #12)
2.1x
over CSR
1.8x
over RB
Example: Row-Segmented Diagonals
2x
over CSR
Mixed Diagonal and Block Structure
Summary
• Automated block size selection
– Empirical modeling and search
– Register blocking for SpMV, triangular solve, ATA*x
• Not fully automated
– Given a matrix, select splittings and transformations
• Lots of combinatorial problems
– TSP reordering to create dense blocks (Pinar ’97; Moon, et
al. ’04)
Extra Slides
A Sparse Matrix You Encounter Every Day
Who am I?
I am a
Big Repository
Of useful
And useless
Facts alike.
Who am I?
inbox.)
Problem Context
• Sparse kernels abound
– Models of buildings, cars, bridges, economies, …
• Historical trends
–
–
–
–
Sparse matrix-vector multiply (SpMV): 10% of peak
2x faster with “hand-tuning”
Tuning becoming more difficult over time
Promise of automatic tuning: PHiPAC/ATLAS, FFTW, …
• Challenges to high-performance
– Not dense linear algebra!
• Complex data structures: indirect, irregular memory access
• Performance depends strongly on run-time inputs
Key Questions, Ideas, Conclusions
• How to tune basic sparse kernels automatically?
– Empirical modeling and search
•
•
•
•
Up to 4x speedups for SpMV
1.8x for triangular solve
4x for ATA*x; 2x for A2*x
7x for multiple vectors
• What are the fundamental limits on performance?
– Kernel-, machine-, and matrix-specific upper bounds
• Achieve 75% or more for SpMV, limiting low-level tuning
• Consequences for architecture?
• General techniques for empirical search-based tuning?
– Statistical models of performance
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Statistical models of performance
Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)

y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Statistical models of performance
Historical Trends in SpMV Performance
• The Data
–
–
–
–
Uniprocessor SpMV performance since 1987
“Untuned” and “Tuned” implementations
Cache-based superscalar micros; some vectors
LINPACK
SpMV Historical Trends: Mflop/s
Example: The Difficulty of Tuning
• n = 21216
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
Still More Surprises
• More complicated non-zero
structure in general
Still More Surprises
• More complicated non-zero
structure in general
• Example: 3x3 blocking
– Logical grid of 3x3 cells
Historical Trends: Mixed News
• Observations
–
–
–
–
Good news: Moore’s law like behavior
Bad news: “Untuned” is 10% peak or less, worsening
Good news: “Tuned” roughly 2x better today, and improving
– (Not really news: SpMV is not LINPACK)
• Questions
– Application: Automatic tuning?
– Architect: What machines are good for SpMV?
• Sparse matrix-vector multiply (SpMV) in a nutshell
• Historical trends and the need for search
• Automatic tuning techniques
– SpMV [SC’02; IJHPCA ’04b]
– Sparse triangular solve (SpTS) [ICS/POHLL ’02]
– ATA*x [ICCS/WoPLA ’03]
• Upper bounds on performance
• Statistical models of performance
SPARSITY: Framework for Tuning SpMV
• SPARSITY: Automatic tuning for SpMV [Im & Yelick ’99]
– General approach
• Identify and generate implementation space
• Search space using empirical models & experiments
– Prototype library and heuristic for choosing register block size
• Also: cache-level blocking, multiple vectors
• What’s new?
– New block size selection heuristic
• Within 10% of optimal — replaces previous version
– Expanded implementation space
• Variable block splitting, diagonals, combinations
– New kernels: sparse triangular solve, ATA*x, Ar*x
Automatic Register Block Size Selection
• Selecting the r x c block size
– Off-line benchmark: characterize the machine
• Precompute Mflops(r,c) using dense matrix for each r x c
• Once per machine/architecture
– Run-time “search”: characterize the matrix
• Sample A to estimate Fill(r,c) for each r x c
– Run-time heuristic model
• Choose r, c to maximize Mflops(r,c) / Fill(r,c)
• Run-time costs
– Up to ~40 SpMVs (empirical worst case)
DGEMV
Accuracy of the Tuning Heuristics (1/4)
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
DGEMV
Accuracy of the Tuning Heuristics (2/4)
DGEMV
Accuracy of the Tuning Heuristics (3/4)
DGEMV
Accuracy of the Tuning Heuristics (4/4)
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
– SC’02
• Statistical models of performance
Motivation for Upper Bounds Model
• Questions
– Speedups are good, but what is the speed limit?
• Independent of instruction scheduling, selection
– What machines are “good” for SpMV?
Upper Bounds on Performance: Blocked SpMV
• P = (flops) / (time)
– Flops = 2 * nnz(A)
• Lower bound on time: Two main assumptions
– 1. Count memory ops only (streaming)
– 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes
• Account for matrix size and nnz
• Charge min access “latency” ai at Li cache & amem
– e.g., Saavedra-Barrera and PMaC MAPS benchmarks

T ime a i  Hitsi  a mem  Hitsmem
i 1

 a1  Loads  (a i 1  a i )  Missesi  (a mem  a )  Misses
i 1
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
• Machine balance [Callahan ’88; McCalpin ’95]
– Balance = Peak Flop Rate / Bandwidth (flops / double)
• Ideal balance for mat-vec:  2 flops / double
– For SpMV, even less
T ime a1  Loads  (ai 1  ai )  Missesi  (a mem  a )  Misses
i
• SpMV ~ streaming
– 1 / (avg load time to stream 1 array) ~ (bandwidth)
– “Sustained” balance = peak flops / model bandwidth
Where Does the Time Go?

T ime a i  Hitsi  a mem  Hitsmem
i 1
• Most time assigned to memory
• Caches “disappear” when line sizes are equal
– Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds
• What is the best we can do for SpMV?
– Limits to low-level tuning of blocked implementations
– Refinements?
• What machines are good for SpMV?
• Architectural consequences?
– Example: Strictly increasing line sizes
•
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Tuning other sparse kernels
Statistical models of performance
– FDO ’00; IJHPCA ’04a
Statistical Models for Automatic Tuning
• Idea 1: Statistical criterion for stopping a search
– A general search model
• Generate implementation
• Measure performance
• Repeat
– Stop when probability of being within e of optimal falls below
threshold
• Can estimate distribution on-line
• Idea 2: Statistical performance models
– Problem: Choose 1 among m implementations at run-time
– Sample performance off-line, build statistical model
Example: Select a Matmul Implementation
Example: Support Vector Classification
•
•
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Tuning other sparse kernels
Statistical models of performance
Summary and Future Work
Summary of High-Level Themes
• “Kernel-centric” optimization
– Vs. basic block, trace, path optimization, for instance
– Aggressive use of domain-specific knowledge
• Performance bounds modeling
– Evaluating software quality
– Architectural characterizations and consequences
• Empirical search
– Hybrid on-line/run-time models
• Statistical performance models
– Exploit information from sampling, measuring
Related Work
• My bibliography: 337 entries so far
• Sample area 1: Code generation
– Generative & generic programming
– Sparse compilers
– Domain-specific generators
• Sample area 2: Empirical search-based tuning
– Kernel-centric
• linear algebra, signal processing, sorting, MPI, …
– Compiler-centric
• profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)
• Composable code generators and search spaces
• New application domains
– PageRank: multilevel block algorithms for topic-sensitive search?
• New kernels: cryptokernels
– rich mathematical structure germane to performance; lots of
hardware
• New tuning environments
– Parallel, Grid, “whole systems”
• Statistical models of application performance
– Statistical learning of concise parametric models from traces for
architectural evaluation
– Compiler/automatic derivation of parametric models
Acknowledgements
• Undergraduate R.A.s: Attila, Ben, Jen, Jin, Michael,
Rajesh, Shoaib, Sriram, Tuyet-Linh
• See pages xvi—xvii of dissertation.
TSP-based Reordering: Before
(Pinar ’97;
Moon, et al ‘04)
TSP-based Reordering: After
(Pinar ’97;
Moon, et al ‘04)
Up to 2x
speedups
over CSR
Example: L2 Misses on Itanium 2
Misses measured using PAPI [Browne ’00]
Example: Distribution of Blocked Non-Zeros
Sparse/Dense Partitioning for SpTS
• Partition L into sparse (L1,L2) and dense LD:
 L1

 L2
 x1   b1 
    
LD  x2   b2 
• Perform SpTS in three steps:
L1 x1  b1
(2)
bˆ2  b2  L2 x1
(3) L x  bˆ
(1)
D 2
2
• Sparsity optimizations for (1)—(2); DTRSV for (3)
• Tuning parameters: block size, size of dense triangle
SpTS Performance: Power3
Summary of SpTS and AAT*x Results
• SpTS — Similar to SpMV
– 1.8x speedups; limited benefit from low-level tuning
• AATx, ATAx
– Cache interleaving only: up to 1.6x speedups
– Reg + cache: up to 4x speedups
• 1.8x speedup over register only
– Similar heuristic; same accuracy (~ 10% optimal)
– Further from upper bounds: 60—80%
• Opportunity for better low-level tuning a la PHiPAC/ATLAS
• Matrix triple products? Ak*x?
– Preliminary work
Register Blocking: Speedup
Register Blocking: Performance
Register Blocking: Fraction of Peak
Example: Confidence Interval Estimation
Costs of Tuning
Splitting + UBCSR: Pentium III
Splitting + UBCSR: Power4
Splitting+UBCSR Storage: Power4
Example: Variable Block Row (Matrix #13)
Preliminary Results (Matrix Set 2): Itanium 2
Web/IR
Dense
FEM
FEM (var)
Bio
Econ
Stat
LP
Multiple Vector Performance
– Approx. once a month: rank all pages using connectivity structure
• Find dominant eigenvector of a matrix
– At query-time: return list of pages ordered by rank
• Matrix: A = aG + (1-a)(1/n)uuT
– Markov model: Surfer follows link with probability a, jumps to a
random page with probability 1-a
– G is n x n connectivity matrix [n  3 billion]
• gij is non-zero if page i links to page j
• Normalized so each column sums to 1
• Very sparse: about 7—8 non-zeros per row (power law dist.)
– u is a vector of all 1 values
– Steady-state probability xi of landing on page i is solution to x = Ax
• Approximate x by power method: x = Akx0
– In practice, k  25
MAPS Benchmark Example: Power4
MAPS Benchmark Example: Itanium 2
Saavedra-Barrera Example: Ultra 2i
Execution Time Breakdown (PAPI): Matrix 40
Preliminary Results (Matrix Set 1): Itanium 2
Dense
FEM
FEM (var)
Assorted
LP
Tuning Sparse Triangular Solve (SpTS)
• Compute x=L-1*b where L sparse lower triangular, x
& b dense
• L from sparse LU has rich dense substructure
– Dense trailing triangle can account for 20—90% of matrix
non-zeros
• SpTS optimizations
– Split into sparse trapezoid and dense trailing triangle
– Use tuned dense BLAS (DTRSV) on dense triangle
– Use Sparsity register blocking on sparse part
• Tuning parameters
– Size of dense trailing triangle
– Register block size
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Cache Blocked SpMV on LSI Matrix: Ultra 2i
A
10k x 255k
3.7M non-zeros
Baseline:
16 Mflop/s
Best block size
& performance:
16k x 64k
28 Mflop/s
Cache Blocking on LSI Matrix: Pentium 4
A
10k x 255k
3.7M non-zeros
Baseline:
44 Mflop/s
Best block size
& performance:
16k x 16k
210 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium
A
10k x 255k
3.7M non-zeros
Baseline:
25 Mflop/s
Best block size
& performance:
16k x 32k
72 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium 2
A
10k x 255k
3.7M non-zeros
Baseline:
170 Mflop/s
Best block size
& performance:
16k x 65k
275 Mflop/s
Inter-Iteration Sparse Tiling (1/3)
x1
t1
y1
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
Inter-Iteration Sparse Tiling (2/3)
x1
t1
y1
x2
t2
y2
x3
x4
t3
t4
y3
y4
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
• Orange = everything needed
to compute y1
– Reuse a11, a12
x5
t5
y5
Inter-Iteration Sparse Tiling (3/3)
x1
t1
y1
x2
t2
y2
x3
x4
t3
t4
y3
y4
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
• Orange = everything needed
to compute y1
– Reuse a11, a12
x5
t5
y5
• Grey = y2, y3
– Reuse a23, a33, a43
Inter-Iteration Sparse Tiling: Issues
x1
t1
y1
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
• Tile sizes (colored regions)
grow with no. of iterations
and increasing out-degree
– G likely to have a few nodes
with high out-degree (e.g.,
Yahoo)
• Mathematical tricks to limit
tile size?
– Judicious dropping of edges
[Ng’01]
Summary and Questions
• Need to understand matrix structure and machine
– BeBOP: suite of techniques to deal with different sparse structures
and architectures
– Established techniques within an iteration
– Ideas for inter-iteration optimizations
– Mathematical structure of problem may help
• Questions
– Structure of G?
– What are the computational bottlenecks?
– Enabling future computations?
• E.g., topic-sensitive PageRank  multiple vector version [Haveliwala
’02]
including more complete Itanium 2 results.
Exploiting Matrix Structure
• Symmetry (numerical or structural)
– Reuse matrix entries
– Can combine with register blocking, multiple vectors, …
• Matrix splitting
– Split the matrix, e.g., into r x c and 1 x 1
• Large matrices with random structure
– E.g., Latent Semantic Indexing (LSI) matrices
– Technique: cache blocking
• Store matrix as 2i x 2j sparse submatrices
• Effective when x vector is large
• Currently, search to find fastest size
Symmetric SpMV Performance: Pentium 4
SpMV with Split Matrices: Ultra 2i
Cache Blocking on Random Matrices: Itanium
Speedup on four banded
random matrices.
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Register Blocked SpMV: Pentium III
Register Blocked SpMV: Ultra 2i
Register Blocked SpMV: Power3
Register Blocked SpMV: Itanium
Possible Optimization Techniques
• Within an iteration, i.e., computing (G+uuT)*x once
– Cache block G*x
• On linear programming matrices and matrices with random
structure (e.g., LSI), 1.5—4x speedups
• Best block size is matrix and machine dependent
– Reordering and/or splitting of G to separate dense structure
(rows, columns, blocks)
• Between iterations, e.g., (G+uuT)2x
– (G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTx
• Compute Gu, uTG, uTu once for all iterations
• G2x: Inter-iteration tiling to read G only once
Multiple Vector Performance: Itanium
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
SpTS Performance: Itanium
(See POHLL ’02 workshop paper, at ICS ’02.)
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Optimizing AAT*x
• Kernel: y=AAT*x, where A is sparse, x & y dense
– Arises in linear programming, computation of SVD
– Conventional implementation: compute z=AT*x, y=A*z
• Elements of A can be reused:
 a1T 
n
 
y  a1 an    x   ak (akT x)
k 1
 aT 
 n
• When ak represent blocks of columns, can apply register
blocking.
Optimized AAT*x Performance: Pentium III
Current Directions
• Applying new optimizations
– Other split data structures (variable block, diagonal, …)
– Matrix reordering to create block structure
– Structural symmetry
• New kernels (triple product RART, powers Ak, …)
• Tuning parameter selection
• Building an automatically tuned sparse matrix library
– Extending the Sparse BLAS
– Leverage existing sparse compilers as code generation
infrastructure
– More thoughts on this topic tomorrow
Related Work
• Automatic performance tuning systems
– PHiPAC [Bilmes, et al., ’97], ATLAS [Whaley & Dongarra ’98]
– FFTW [Frigo & Johnson ’98], SPIRAL [Pueschel, et al., ’00],
UHFFT [Mirkovic and Johnsson ’00]
– MPI collective operations [Vadhiyar & Dongarra ’01]
• Code generation
– FLAME [Gunnels & van de Geijn, ’01]
– Sparse compilers: [Bik ’99], Bernoulli [Pingali, et al., ’97]
– Generic programming: Blitz++ [Veldhuizen ’98], MTL [Siek &
Lumsdaine ’98], GMCL [Czarnecki, et al. ’98], …
• Sparse performance modeling
– [Temam & Jalby ’92], [White & Saddayappan ’97], [Navarro,
et al., ’96], [Heras, et al., ’99], [Fraguela, et al., ’99], …
More Related Work
• Compiler analysis, models
– CROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout
’01]
– TUNE [Chatterjee, et al.]
– Iterative compilation [O’Boyle, et al., ’98]
– Broadway compiler [Guyer & Lin, ’99]
– [Brewer ’95], ADAPT [Voss ’00]
• Sparse BLAS interfaces
–
–
–
–
BLAST Forum (Chapter 3)
NIST Sparse BLAS [Remington & Pozo ’94]; SparseLib++
Parallel Sparse BLAS [Fillipone, et al. ’96]
Context: Creating High-Performance Libraries
• Application performance dominated by a few
computational kernels
• Today: Kernels hand-tuned by vendor or user
• Performance tuning challenges
– Performance is a complicated function of kernel,
– Tedious and time-consuming
• Successful automated approaches
– Dense linear algebra: ATLAS/PHiPAC
– Signal processing: FFTW/SPIRAL/UHFFT
Cache Blocked SpMV on LSI Matrix: Itanium
Sustainable Memory Bandwidth
Multiple Vector Performance: Pentium 4
Multiple Vector Performance: Itanium
Multiple Vector Performance: Pentium 4
Optimized AAT*x Performance: Ultra 2i
Optimized AAT*x Performance: Pentium 4
High Precision GEMV (XBLAS)
High Precision Algorithms (XBLAS)
• Double-double (High precision word represented as pair of doubles)
– Many variations on these algorithms; we currently use Bailey’s
• Exploiting Extra-wide Registers
– Suppose s(1) , … , s(n) have f-bit fractions, SUM has F>f bit fraction
– Consider following algorithm for S = Si=1,n s(i)
• Sort so that |s(1)|  |s(2)|  …  |s(n)|
• SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM
– Theorem (D., Hida) Suppose F<2f (less than double precision)
• If n  2F-f + 1, then error  1.5 ulps
• If n = 2F-f + 2, then error  22f-F ulps (can be  1)
• If n  2F-f + 3, then error can be arbitrary (S  0 but sum = 0 )
– Examples
• s(i) double (f=53), SUM double extended (F=64)
– accurate if n  211 + 1 = 2049
• Dot product of single precision x(i) and y(i)
– s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) 
– accurate if n  216 + 1 = 65537
More Extra Slides
Awards
• Best Paper, Intern. Conf. Parallel Processing, 2004
– “Performance models for evaluation and automatic performance tuning of
symmetric sparse matrix-vector multiply”
• Best Student Paper, Intern. Conf. Supercomputing, Workshop
on Performance Optimization via High-Level Languages and
Libraries, 2003
– Best Student Presentation too, to Richard Vuduc
– “Automatic performance tuning and analysis of sparse triangular solve”
• Finalist, Best Student Paper, Supercomputing 2002
– To Richard Vuduc
– “Performance Optimization and Bounds for Sparse Matrix-vector Multiply”
• Best Presentation Prize, MICRO-33: 3rd ACM Workshop on
Feedback-Directed Dynamic Optimization, 2000
– To Richard Vuduc
– “Statistical Modeling of Feedback Data in an Automatic Tuning System”
Accuracy of the Tuning Heuristics (4/4)
DGEMV
Can Match DGEMV Performance
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
• Machine balance [Callahan ’88; McCalpin ’95]
– Balance = Peak Flop Rate / Bandwidth (flops / double)
– Lower is better (I.e. can hope to get higher fraction of peak
flop rate)
• Ideal balance for mat-vec:  2 flops / double
– For SpMV, even less
Where Does the Time Go?

T ime a i  Hitsi  a mem  Hitsmem
i 1
• Most time assigned to memory
• Caches “disappear” when line sizes are equal
– Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Execution Time Breakdown (PAPI): Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds
• What is the best we can do for SpMV?
– Limits to low-level tuning of blocked implementations
– Refinements?
• What machines are good for SpMV?
• Architectural consequences?
– Help to have strictly increasing line sizes
Evaluating algorithms and machines for SpMV
• Metrics
– Speedups
– Mflop/s (“fair” flops)
– Fraction of peak
• Questions
– Speedups are good, but what is “the best?”
• Independent of instruction scheduling, selection
• Can SpMV be further improved or not?
– What machines are “good” for SpMV?
Statistical Models for Automatic Tuning
• Idea 1: Statistical criterion for stopping a search
– A general search model
• Generate implementation
• Measure performance
• Repeat
– Stop when probability of being within e of optimal falls below
threshold
• Can estimate distribution on-line
• Idea 2: Statistical performance models
– Problem: Choose 1 among m implementations at run-time
– Sample performance off-line, build statistical model
SpMV Historical Trends: Fraction of Peak
Motivation for Automatic Performance Tuning of SpMV
• Historical trends
– Sparse matrix-vector multiply (SpMV): 10% of peak or less
• Performance depends on machine, kernel, matrix
– Matrix known at run-time
– Best data structure + implementation can be surprising
• Our approach: empirical performance modeling and
algorithm search
Sparse CALA Summary
• Optimizing one SpMV too limiting
• Take k steps of Krylov subspace method
– GMRES, CG, Lanczos, Arnoldi
– Assume matrix “well-partitioned,” with modest surface-to-volume
ratio
– Parallel implementation
• Conventional: O(k log p) messages
• CALA: O(log p) messages - optimal
– Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• CALA: O(1) moves of data – optimal
– Can incorporate some preconditioners
• Need to be able to “compress” interactions between distant i, j
• Hierarchical, semiseparable matrices …
– Lots of speed up possible (measured and modeled)
• Price: some redundant computation
Motivation for Automatic Performance Tuning
• Writing high performance software is hard
– Make programming easier while getting high speed
• Ideal: program in your favorite high level language
(Matlab, Python, PETSc…) and get a high fraction of
peak performance
• Reality: Best algorithm (and its implementation) can
depend strongly on the problem, computer
architecture, compiler,…
– Best choice can depend on knowing a lot of applied
mathematics and computer science
• How much of this can we teach?
• How much of this can we automate?
Examples of Automatic Performance Tuning (1)
• Dense BLAS
– Sequential
– PHiPAC (UCB), then ATLAS (UTK) (used in Matlab)
– math-atlas.sourceforge.net/
• Fast Fourier Transform (FFT) & variations
– Sequential and Parallel
– FFTW (MIT)
– www.fftw.org
• Digital Signal Processing
– SPIRAL: www.spiral.net (CMU)
• Communication Collectives (UCB, UTK)
• Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), …
• More projects, conferences, government reports, …
Potential Impact on Applications: T3P
• Application: accelerator design [Ko]
• 80% of time spent in SpMV
• Relevant optimization techniques
– Symmetric storage
– Register blocking
• On Single Processor Itanium 2
– 1.68x speedup
• 532 Mflops, or 15% of 3.6 GFlop peak
– 4.4x speedup with multiple (8) vectors
• 1380 Mflops, or 38% of peak
Examples of Automatic Performance Tuning (2)
• What do dense BLAS, FFTs, signal processing, MPI reductions
have in common?
– Can do the tuning off-line: once per architecture, algorithm
– Can take as much time as necessary (hours, a week…)
– At run-time, algorithm choice may depend only on few parameters
• Matrix dimension, size of FFT, etc.
Ultra 2i - 9%
63 Mflop/s
Ultra 3 - 5%
109 Mflop/s
SpMV Performance (Matrix #2): Generation 2
35 Mflop/s
Pentium III - 19%
96 Mflop/s
42 Mflop/s
53 Mflop/s
Pentium III-M - 15%
120 Mflop/s
58 Mflop/s
Power3 - 13%
195 Mflop/s
Power4 - 14%
703 Mflop/s
SpMV Performance (Matrix #2): Generation 1
100 Mflop/s
Itanium 1 - 7%
225 Mflop/s
103 Mflop/s
469 Mflop/s
Itanium 2 - 31%
1.1 Gflop/s
276 Mflop/s
Taking advantage of block structure in SpMV
• Bottleneck is time to get matrix from memory
• Don’t store each nonzero with index, instead store
each nonzero r-by-c block with index
– Storage drops by up to 2x, if rc >> 1, all 32-bit quantities
– Time to fetch matrix from memory decreases
• Change both data structure and algorithm
– Need to pick r and c
– Need to change algorithm accordingly
• In example, is r=c=8 best choice?
– Minimizes storage, so looks like a good idea…
Example: L2 Misses on Itanium 2
Misses measured using PAPI [Browne ’00]
Example: Sparse Triangular Factor
• Raefsky4 (structural
problem) + SuperLU +
colmmd
• N=19779, nnz=12.6 M
Dense trailing triangle:
dim=2268, 20% of total
nz
Can be as high as 90+%!
1.8x over CSR
Cache Optimizations for AAT*x
• Cache-level: Interleave multiplication by A, AT
– Only fetch A from memory once
T
1
T
…
1
n
T
n
…
a

AA  x  a  a  

a


n

T
  x   ai (ai x)

i 1

“axpy”
dot product
• Register-level: aiT to be rc block row, or diag row
Example: Combining Optimizations (1/2)
• Register blocking, symmetry, multiple (k) vectors
– Three low-level tuning parameters: r, c, v
X
k
v
*
r
c
+=
Y
A
Example: Combining Optimizations (2/2)
• Register blocking, symmetry, and multiple vectors
[Ben Lee @ UCB]
– Symmetric, blocked, 1 vector
• Up to 2.6x over nonsymmetric, blocked, 1 vector
– Symmetric, blocked, k vectors
• Up to 2.1x over nonsymmetric, blocked, k vecs.
• Up to 7.3x over nonsymmetric, nonblocked, 1, vector
– Symmetric Storage: up to 64.7% savings
How the OSKI Tunes (Overview)
• At library build/install-time
– Pre-generate and compile code variants into dynamic libraries
– Collect benchmark data
• Measures and records speed of possible sparse data structure and code
variants on target architecture
– Installation process uses standard, portable GNU AutoTools
• At run-time
– Library “tunes” using heuristic models
• Models analyze user’s matrix & benchmark data to choose optimized
data structure and code
– Non-trivial tuning cost: up to ~40 mat-vecs
• Library limits the time it spends tuning based on estimated workload
– provided by user or inferred by library
• User may reduce cost by saving tuning results for application on future
runs with same or similar matrix
Optimizations in OSKI, so far
•
Fully automatic heuristics for
– Sparse matrix-vector multiply
• Register-level blocking
• Register-level blocking + symmetry + multiple vectors
• Cache-level blocking
– Sparse triangular solve with register-level blocking and “switch-to-dense”
optimization
– Sparse ATA*x with register-level blocking
•
User may select other optimizations manually
•
“Plug-in” extensibility
– Diagonal storage optimizations, reordering, splitting; tiled matrix powers
kernel (Ak*x)
– All available in dynamic libraries
– Accessible via high-level embedded script language
– Very advanced users may write their own heuristics, create new data
structures/code variants and dynamically add them to the system
Performance Results
• Measured
– Sequential/OOC speedup up to 3x
• Modeled
– Sequential/multicore speedup up to 2.5x
– Parallel/Petascale speedup up to 6.9x
– Parallel/Grid speedup up to 22x
• See bebop.cs.berkeley.edu/#pubs
```