```Lecture 6:
Linear Algebra Algorithms
Jack Dongarra, U of Tennessee
Slides are adapted from Jim Demmel, UCB’s Lecture on Linear Algebra Algorithms
 /2
part1: (2 points) Write
(1) code produce correct result: 2/2.
(2) code run but could not give accurate result: 1.5/2.
(3) code could not run or run but nothing: 1/2
°
°
°
°
Note: If you only submit one program which can print "processor
contribution" and "total integral" correctly, I will give 1/2
for this part and 2/2 for part2. so you total score for part1
and part2 will be 3/4.
part2: (2 points) Modify
1  (cos x ) / 2
2
your program to sum the contributions to give the total value of the integral.
(1) code produce correct result: 2/2.
(2) code run but could not give accurate result: 1.5/2.
(3) code could not run or run but nothing: 1/2
part3: (4 points) Give an expression for the efficiency
° (1) give t_p(1), t_p(p), S(p) and E(p) correctly: 4/4.
° (2) give t_p(1), t_p(p) and S(p) correctly: 3/4.
° (3) give t_p(1) and t_p(p) correctly: 2/4.
° (4) give other: 1/4.
part4: (2 points) Comment
°
°
°
°
°
°
°
0
dx
an MPI program
°
°
°
°
°
°

of the parallel algorithm for m subintervals
on the scalability of the algorithm
(1) contain sentence similar as "if work being evenly divided
and the summation can be performed in a tree fashion, the algorithm
is scalable" or give very detail discussion correctly about: (a)
t_(p), (b) S(p) and (c) E(P). 2/2
(2) give correct discussion about any of: (a) t_(p), (b) S(p)
and (c) E(P). 1/2
(3) do not make sense. 0/2
2
° If we are adding an array of numbers and have a
choice do we add them from largest to smallest or
smallest to largest?
3
Parallel Performance Metrics
• Absolute: Elapsed (wall-clock) Time = T(n )
• Speedup = S(n ) = T(1) / T(n )
where T1 is the time for the best serial implementation.
=> Performance improvement due to parallelsm
• Parallel Efficiency = E(n ) = T(1) / n T(n )
• Ideal Speedup = SI (n ) = n
- Theoretical limit; obtainable rarely
- Ignores all of real life
• These definitions apply to a fixed-problem experiment.
4
Amdahl’s Law
Amdahl’s Law places a strict limit on the speedup that can be
realized by using multiple processors. Two equivalent
expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1
Effect of multiple processors on run time
S = 1/(fs + fp/N)
Effect of multiple processors on speedup
Where:
fs = serial fraction of code
fp = parallel fraction of code = 1 - fs
N = number of processors
5
Illustration of Amdahl’s Law
It takes only a small fraction of serial content in a code to degrade the
parallel performance. It is essential to determine the scaling behavior of
your code before doing production runs using large numbers of
processors
250
fp = 1.000
speedup
200
fp = 0.999
fp = 0.990
150
fp = 0.900
100
50
0
0
50
100
150
200
250
Number of processors
6
Amdahl’s Law Vs. Reality
Amdahl’s Law provides a theoretical upper limit on parallel
speedup assuming that there are no costs for
communications. In reality, communications (and I/O) will
result in a further degradation of performance.
80
fp = 0.99
70
speedup
60
50
Amdahl's Law
40
Reality
30
20
10
0
0
So what’s going on?
50
150
100
Number of processors
200
250
7
More on Amdahl’s Law
° Amdahl’s Law can be generalized to any two
processes of with different speeds
° Ex.: Apply to fprocessor and fmemory:
• The growing processor-memory performance gap will undermine
our efforts at achieving maximum possible speedup!
8
Gustafson’s Law
° Thus, Amdahl’s Law predicts that there is a
maximum scalability for an application, determined
by its parallel fraction, and this limit is generally not
large.
tN = (fp/N + fs)t1
S = 1/(fs + fp/N)
Effect of multiple processors on run time
Effect of multiple processors on speedup
Where:
fs = serial fraction of code
fp = parallel fraction of code = 1 - fs
N = number of processors
° There is a way around this: increase the problem
size
• bigger problems mean bigger grids or more particles: bigger
arrays
• number of serial operations generally remains constant; number
of parallel operations increases: parallel fraction increases
9
Fixed-Problem Size Scaling
• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling,
Problem-Constrained, constant-problem size (CPS),
variable subgrid
1
• Amdahl Limit: SA(n) = T(1) / T(n) = ---------------f/n+(1-f)
• This bounds the speedup based only on the fraction of the code
that cannot use parallelism ( 1- f ); it ignores all other factors
• SA --> 1 / ( 1- f ) as
n --> 
10
Fixed-Problem Size Scaling (Cont’d)
• Efficiency (n) = T(1) / [ T(n) * n]
• Memory requirements decrease with n
• Surface-to-volume ratio increases with n
• Superlinear speedup possible from cache effects
• Motivation: what is the largest # of procs I can use effectively and
what is the fastest time that I can solve a given problem?
• Problems:
- Sequential runs often not possible (large problems)
- Speedup (and efficiency) is misleading if processors are slow
11
Fixed-Problem Size Scaling: Examples
S. Goedecker and
Achieving High
Performance in
Numerical
Computations on
RISC Workstations
and Parallel
Systems,International
Conference on
Computational
Physics:
PC'97 Santa Cruz,
August 25-28 1997.
12
Fixed-Problem Size Scaling Examples
13
Scaled Speedup Experiments
• a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling.
• Motivation: Want to use a larger machine to solve a larger global
problem in the same amount of time.
• Memory and surface-to-volume effects remain constant.
14
Top500 Data
Rmax
[TF/s]
R peak
[TF/s]
Ratio
Max/Peak
# Proc
7.23
12
60.25
8192
4.06
6.05
NERSC/LBNL
3.05
5.00
61.00
3328
Sandia National Laboratory
2.38
3.207
74.21
9632
ASCI Blue
Pacific
SST, IBM SP
604E
Lawrence Livermore
National Laboratory
2.14
3.868
55.33
5808
Compaq
AlphaServer SC
ES45 1 GHz
Los Alamos
National Laboratory
2.10
3.04
69.08
1536
7
Hitachi
SR8000/MPP
1.71
2.69
63.57
1152
8
SGI
ASCI Blue
Mountain
1.61
2.52
63.89
6144
9
IBM
SP Power3
375Mhz
Naval Oceanographic Office
(NAVOCEANO)
1.42
1.97
72.08
1336
10
IBM
SP Power3
375Mhz
Deutscher Wetterdienst
1.29
1.83
70.49
Rank
Manufacturer
Computer
Installation Site
1
IBM
ASCI White
SP Power3
Lawrence Livermore
National Laboratory
2
Compaq
AlphaServer SC
ES45 1 GHz
3
IBM
SP Power3
375 MHz
4
Intel
ASCI Red
5
IBM
6
Pittsburgh
Supercomputing Center
University of Tokyo
Los Alamos
National Laboratory
67.11
3024
15
1280
Example of a Scaled Speedup Experiment
Processors NChains
1
2
4
8
16
32
64
128
256
512
Time
32
64
128
256
512
940
1700
2800
4100
5300
38.4
38.4
38.5
38.6
38.7
35.7
32.7
27.4
20.75
14.49
Natoms
2368
4736
9472
18944
37888
69560
125800
207200
303400
392200
Time per
Atom per PE
1.62E-02
8.11E-03
4.06E-03
2.04E-03
1.02E-03
5.13E-04
2.60E-04
1.32E-04
6.84E-05
3.69E-05
Time
Efficiency
per Atom
1.62E-02
1.000
1.62E-02
1.000
1.63E-02
0.997
1.63E-02
0.995
1.63E-02
0.992
1.64E-02
0.987
1.66E-02
0.975
1.69E-02
0.958
1.75E-02
0.926
1.89E-02
0.857
TBON on ASCI Red
Efficiency
1.040
0.940
0.840
0.740
0.640
0.540
0.440
0
200
400
600
16
Parallel Performance Metrics: Speedup
Relative performance:
Absolute performance:
60
2000
T3E
T3E Ideal
O2K
O2K Ideal
1750
O2K
50
MFLOPS
Speedup
T3E
Ideal
40
30
1500
1250
1000
750
20
500
10
250
0
0
0
10
20
30
40
50
Processors
60
0
8
16
24
32
40
48
Processors
Speedup is only one characteristic of a program - it is
not synonymous with performance. In this comparison of two
machines the code achieves comparable speedups but one of
the machines is faster.
17
18
Improving Ratio of Floating Point Operations
to Memory Accesses
subroutine mult(n1,nd1,n2,nd2,y,a,x)
implicit real*8 (a-h,o-z)
dimension a(nd1,nd2),y(nd2),x(nd1)
do 20, i=1,n1
t=0.d0
do 10, j=1,n2
t=t+a(j,i)*x(j)
10
continue
y(i)=t
**** 2 FLOPS
20 continue
return
end
Unroll the loops
19
Improving Ratio of Floating Point Operations to Memory Accesses
c works correctly when n1,n2 are multiples of 4
dimension a(nd1,nd2), y(nd2), x(nd1)
do i=1,n1-3,4
t1=0.d0
t2=0.d0
t3=0.d0
t4=0.d0
do j=1,n2-3,4
t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+
1
a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3)
t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+
1
a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3)
t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+
1
a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3)
t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+
1
a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3)
enddo
y(i+0)=t1
y(i+1)=t2
y(i+2)=t3
y(i+3)=t4
enddo
32 FLOPS
20
Summary of Single-Processor Optimization Techniques (I)
° Spatial and temporal data locality
° Loop unrolling
° Blocking
° Software pipelining
° Optimization of data structures
° Special functions, library subroutines
21
Summary of Optimization Techniques (II)
°
Achieving high-performance requires code restructuring.
Minimization of memory traffic is the single most important goal.
°
Compilers are getting better: good at software pipelining. But they
are not there yet: can do loop transformations only in simple
cases, usually fail to produce optimal blocking, heuristics for
unrolling may not match your code well, etc.
°
The optimization process is machine-specific and requires detailed
architectural knowledge.
22
° Dimension A(n,n), B(n,n), C(n,n)
° A, B, C stored by column (as in Fortran)
° Algorithm 1:
• for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)
° Algorithm 2:
• for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
° What is “memory access pattern” for Algs 1 and 2?
° Which is faster?
° What if A, B, C stored by row (as in C)?
23
Using a Simpler Model of Memory to Optimize
° Assume just 2 levels in the hierarchy, fast and
slow
° All data initially in slow memory
• m = number of memory elements (words) moved between fast
and slow memory
• tm = time per slow memory operation
• f = number of arithmetic operations
• tf = time per arithmetic operation < tm
• q = f/m average number of flops per slow element access
° Minimum possible Time = f*tf, when all data in fast
memory
° Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))
° Larger q means Time closer to minimum f*tf
• Want large q
24
Simple example using memory model
° To see results of changing q, consider simple
computation
s=0
° Assume tf=1 Mflop/s on fast
memory
for i = 1, n
° Assume moving data is tm = 10
s = s + h(X[i])
° Assume h takes q flops
° Assume array X is in slow memory
° So m = n and f = q*n
° Time = read X + compute = 10*n + q*n
° Mflop/s = f/t = q/(10 + q)
° As q increases, this approaches the “peak” speed
of 1 Mflop/s
25
Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:n
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
A(i,:)
+
=
y(i)
y(i)
*
x(:)
26
Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:n
{read row i of A into fast memory}
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow memory}
° m = number of slow memory refs = 3*n + n2
° f = number of arithmetic operations = 2*n2
° q = f/m ~= 2
° Matrix-vector multiplication limited by slow memory speed
27
Matrix Multiply C=C+A*B
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j)
28
Matrix Multiply C=C+A*B(unblocked, or untiled)
for i = 1 to n
{read row i of A into fast memory}
for j = 1 to n
{read column j of B into fast memory}
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory}
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j)
29
q=ops/slow mem ref
Matrix Multiply (unblocked, or untiled)
Number of slow memory references on unblocked matrix
multiply
m = n3
+ n2
read each column of B n times
read each column of A once for each i
+ 2*n2 read and write each element of C once
= n3 + 3*n2
So q = f/m = (2*n3)/(n3 + 3*n2)
~= 2 for large n, no improvement over matrix-vector mult
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j)
30
Matrix Multiply (blocked, or tiled)
Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is
called the blocksize
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
C(i,j)
A(i,k)
C(i,j)
=
+
*
B(k,j)
n is the size of the matrix, N blocks of size b; n = N*b
31
Matrix Multiply (blocked or tiled)
Why is this algorithm correct?
q=ops/slow mem ref
n is the size of the matrix, N blocks of size b; n = N*b
Number of slow memory references on blocked matrix multiply
m = N*n2
read each block of B N3 times (N3 * n/N * n/N)
+ N*n2
read each block of A N3 times
+ 2*n2
read and write each block of C once
= (2*N + 2)*n2
So q = f/m = 2*n3 / ((2*N + 2)*n2)
~= n/N = b for large n
So we can improve performance by increasing the blocksize b
Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we
cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)
Theorem (Hong, Kung, 1981): Any reorganization of this algorithm
(that uses only associativity) is limited to q =O(sqrt(M))
32
Model
° As much as possible will be overlapped
° Dot Product:
ACC = 0
do i = x,n
ACC = ACC + x(i) y(i)
end do
° Experiments done on an IBM RS6000/530
• 25 MHz
• 2 cycle to complete FMA can be pipelined
- => 50 Mflop/s peak
FMA
FMA
• one cycle from cache
FMA
cycles
FMA
FMA
FMA
33
DOT Operation - Data in Cache
Do 10 I = 1, n
T = T + X(I)*Y(I)
10 CONTINUE
° Theoretically, 2 loads for X(I) and Y(I), one
FMA operation, no re-use of data
° Pseudo-assembler
fp0,T
label:
FMA fp0,fp0,fp1,fp2
BRANCH label:
FMA
FMA
1 result per cycle = 25 Mflop/s
34
Matrix-Vector Product
° DOT version
DO 20 I = 1, M
DO 10 J = 1, N
Y(I) = Y(I) + A(I,J)*X(J)
10
CONTINUE
20 CONTINUE
° From Cache = 22.7 Mflops
° From Memory = 12.4 Mflops
35
Loop Unrolling
DO 20 I = 1, M, 2
T1 = Y(I )
10
T2 = Y(I+1)
Depth
1
2
3
4

DO 10 J = 1, N
Speed
25
33.3
37.5
40
50
T1 = T1 + A(I,J )*X(J)
Measured 22.7
30.5 34.3 36.5
T2 = T2 + A(I+1,J)*X(J)
Memory
12.7 12.7 12.6
12.4
CONTINUE
Y(I ) = T1
°
unroll 1: 2 loads : 2 ops per 2 cycles
Y(I+1) = T2
°
unroll 2: 3 loads : 4 ops per 3 cycles
°
unroll 3: 4 loads : 6 ops per 4 cycles
°
…
°
unroll n: n+1 loads : 2n ops per n+1 cycles
° Speed of y=y+ATx,
N=48
°
problem: only so many registers
20 CONTINUE
36
Matrix Multiply
° DOT version - 25 Mflops in cache
DO 30 J = 1, M
DO 20 I = 1, M
DO 10 K = 1, L
C(I,J) = C(I,J) + A(I,K)*B(K,J)
10
20
CONTINUE
CONTINUE
30 CONTINUE
37
How to Get Near Peak
° Inner loop:
DO 30 J = 1, M, 2
• 4 loads, 8 operations, optimal.
DO 20 I + 1, M, 2
T11 = C(I, J )
T12 = C(I, J+1)
T21 = C(I+1,J )
T22 = C(I+1,J+1)
° In practice we have
measured 48.1 out of a peak
of 50 Mflop/s when in cache
DO 10 K = 1, L
T11 = T11 + A(I, K) *B(K,J )
T12 = T12 + A(I, K) *B(K,J+1)
T21 = T21 + A(I+1,K)*B(K,J )
T22 = T22 + A(I+1,K)*B(K,J+1)
10
CONTINUE
C(I, J ) = T11
C(I, J+1) = T12
C(I+1,J ) = T21
C(I+1,J+1) = T22
20
CONTINUE
30 CONTINUE
38
BLAS -- Introduction
° Clarity: code is shorter and easier to read,
° Modularity: gives programmer larger building
blocks,
° Performance: manufacturers will provide tuned
machine-specific BLAS,
° Program portability: machine dependencies are
confined to the BLAS
39
Memory Hierarchy
° Key to high performance in
effective use of memory
hierarchy
° True on all architectures
Register
s
L1
Cache
L2
Cache
Local
Memory
Remote
Memory
Secondary
Memory
40
Level 1, 2 and 3 BLAS
° Level 1 BLAS
Vector-Vector
operations
+
*
° Level 2 BLAS MatrixVector operations
° Level 3 BLAS MatrixMatrix operations
*
+
*
41
More on BLAS (Basic Linear Algebra Subroutines)
° Industry standard interface(evolving)
° Vendors, others supply optimized implementations
° History
• BLAS1 (1970s):
- vector operations: dot product, saxpy (y=*x+y), etc
- m=2*n, f=2*n, q ~1 or less
• BLAS2 (mid 1980s)
- matrix-vector operations: matrix vector multiply, etc
- m=n2, f=2*n2, q~2, less overhead
- somewhat faster than BLAS1
• BLAS3 (late 1980s)
- matrix-matrix operations: matrix matrix multiply, etc
- m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is
potentially much faster than BLAS2
° Good algorithms used BLAS3 when possible (LAPACK)
° www.netlib.org/blas, www.netlib.org/lapack
42
Why Higher Level BLAS?
° Can only do arithmetic on data at the top of the
hierarchy
° Higher level BLAS lets us do this
BLAS
M emor y Flops
Flops/
Ref s
M emor y
Ref s
Level 1
3n
2n
2/ 3
y= y+  x
Level 2
n2
2 n2
2
4 n2
2 n3
n/ 2
y= y+ Ax
Level 3
C= C+ AB
Registers
L1
Cache
L2
Cache
Local
Memory
Remote
Memory
Secondary43
Memory
BLAS for Performance
Intel Pentium 4 w/SSE2 1.7 GHz
Level 3 BLAS
2000
Mflop/s
1500
1000
Level 2
BLAS
Level
1 BLAS
500
0
10
100
200
300
400
500
Order of vector/Matrices
° Development of blocked algorithms important for
performance
44
BLAS for Performance
IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)
250
Level 3 BLAS
Mflop/s
200
150
Level 2 BLAS
100
50
Level 1 BLAS
0
10
100
200
300
400
500
Order of vector/Matrices
45
BLAS for Performance
Mflop/s
Alpha EV 5/6 500MHz (1Gflop/s peak)
700
600
500
400
300
200
100
0
Level 3 BLAS
Level 2 BLAS
Level 1 BLAS
10
100
200
300
400
500
Order of vector/Matrices
BLAS 3 (n-by-n matrix matrix multiply) vs
BLAS 2 (n-by-n matrix vector multiply) vs
BLAS 1 (saxpy of n vectors)
46
Fast linear algebra kernels: BLAS
° Simple linear algebra kernels such as matrixmatrix multiply
° More complicated algorithms can be built from
these basic kernels.
° The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS).
° Early agreement on standard interface (~1980)
° Led to portable libraries for vector and shared
memory parallel machines.
° On distributed memory, there is a less-standard
interface called the PBLAS
47
Level 1 BLAS
° Operate on vectors or pairs of vectors
• perform O(n) operations;
• return either a vector or a scalar.
° saxpy
• y(i) = a * x(i) + y(i), for i=1 to n.
• s stands for single precision, daxpy is for double
precision, caxpy for complex, and zaxpy for double
complex,
° sscal y = a * x, for scalar a and vectors x,y
° sdot computes s = S
n
i=1
x(i)*y(i)
48
Level 2 BLAS
° Operate on a matrix and a vector;
• return a matrix or a vector;
• O(n2) operations
° sgemv: matrix-vector multiply
• y = y + A*x
• where A is m-by-n, x is n-by-1 and y is m-by-1.
° sger: rank-one update
•
•
•
•
A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j)
where A is m-by-n, y is m-by-1, x is n-by-1,
strsv: triangular solve
solves y=T*x for x, where T is triangular
49
Level 3 BLAS
° Operate on pairs or triples of matrices
• returning a matrix;
• complexity is O(n3).
° sgemm: Matrix-matrix multiplication
• C = C +A*B,
• where C is m-by-n, A is m-by-k, and B is k-by-n
° strsm: multiple triangular solve
• solves Y = T*X for X,
• where T is a triangular matrix, and X is a rectangular matrix.
50
Optimizing in practice
° Tiling for registers
• loop unrolling, use of named “register” variables
° Tiling for multiple levels of cache
° Exploiting fine-grained parallelism within the
processor
• super scalar
• pipelining
° Complicated compiler interactions
° Hard to do by hand (but you’ll try)
° Automatic optimization an active research area
• PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac
•
www.cs.berkeley.edu/~iyer/asci_slides.ps
• ATLAS: www.netlib.org/atlas/index.html
51
BLAS -- References
° BLAS software and documentation can
be obtained via:
• WWW: http://www.netlib.org/blas,
• (anonymous) ftp ftp.netlib.org: cd blas; get index
• email netlib@www.netlib.org with the message: send index from
blas
° Comments and questions can be
52
BLAS Papers
° C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear
Algebra Subprograms for Fortran Usage, ACM Transactions
on Mathematical Software, 5:308--325, 1979.
° J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An
Extended Set of Fortran Basic Linear Algebra Subprograms,
ACM Transactions on Mathematical Software, 14(1):1--32,
1988.
° J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level
3 Basic Linear Algebra Subprograms, ACM Transactions on
Mathematical Software, 16(1):1--17, 1990.
53
Performance of BLAS
° BLAS are specially optimized by the vendor
• Sun BLAS uses features in the Ultrasparc
° Big payoff for algorithms that can be
expressed in terms of the BLAS3 instead of
BLAS2 or BLAS1.
° The top speed of the BLAS3
° Algorithms like Gaussian elimination organized
so that they use BLAS3
54
How To Get Performance From Commodity Processors?
° Today’s processors can achieve high-performance, but
this requires extensive machine-specific hand tuning.
° Routines have a large design space w/many parameters
• blocking sizes, loop nesting permutations, loop unrolling depths,
software pipelining strategies, register allocations, and instruction
schedules.
• Complicated interactions with the increasingly sophisticated
microarchitectures of new microprocessors.
° A few months ago no tuned BLAS for Pentium for Linux.
° Need for quick/dynamic deployment of optimized routines.
° ATLAS - Automatic Tuned Linear Algebra Software
•
PhiPac from Berkeley
55
° Do a parameter study of the operation on the
target machine, done once.
° Only generated code is on-chip multiply
° BLAS operation written in terms of generated onchip multiply
° All tranpose cases coerced through data copy to
1 case of on-chip multiply
• Only 1 case generated per platform
N
M
NB
C
K
M
A
N
*
B
K
56
Code Generation
Strategy
° On-chip multiply optimizes
for:
• TLB access
• L1 cache reuse
• FP unit usage
• Memory fetch
° Code is iteratively
generated & timed until
optimal case is found. We
try:
• Register reuse
• Differing NBs
• Breaking false dependencies
° Takes a couple of hours to
run.
• M, N and K loop unrolling
57
System
Sun Ultra2 Model
2200
Sun Darwin-270
Sun Microsparc II
Model 70
SGI R10000ip27
SGI R8000ip21
SGI R5000
Vendor Matrix Multiply
SGI R4600
Pentium II-266
Pentium Pro-200
Pentium MMX-150
IBM PowerPC
604e-332
IBM Power2-135
HP9000/735/125
HP PA8000
180Mhz
DEC Alpha
21164a-433
Mflops
500x500 Double Precision Matrix-Matrix Multiply Across Multiple
Architectures
ATLAS Matrix Multiply
700.0
600.0
500.0
400.0
300.0
200.0
100.0
0.0
58
Sun Darwin-270
SGI R10000ip27
SGI R5000
59
Sun Ultra2 Model
2200
LU w/Vendor BLAS
Pentium II-266
Pentium Pro-200
IBM PowerPC
604e-332
IBM Power2-135
HP PA8000
DEC Alpha 21164a433
DCG LX 21164a533
MFLOPS
500 x 500 Double Precision LU Factorization Performance Across
Multiple Architectures
LU w/ATLAS & GEMM-based BLAS
600.0
500.0
400.0
300.0
200.0
100.0
0.0
500x500 gemm-based BLAS on SGI
R10000ip28
300
250
MFLOPS
200
150
100
50
0
DGEMM
DSYMM
Vendor BLAS
DSYR2K
DSYRK
ATLAS/SSBLAS
DTRMM
Reference BLAS
DTRSM
60
500x500 gemm-based BLAS on
UltraSparc 2200
Vendor BLAS
ATLAS/GEMM-based BLAS
Reference BLAS
300
250
MFLOPS
200
150
100
50
0
DGEMM
DSYMM
DSYR2K
DSYRK
Level 3 BLAS Routine
DTRMM
DTRSM
61
Recursive Approach for
Other Level 3 BLAS
° Recur down to L1 cache
block size
° Need kernel at bottom of
recursion
• Use gemm-based kernel for
portability
Recursive TRMM
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
62
Architectures
Ul
tr
00
00
0i
0
0
0
20
27
20
2-
0-
rc
p3
8-
6
50
26
-5
II-
III
p2
m
0i
t iu
m
0
0
00
20
16
2
35
11
-2
3-
Pr
o
er
2-
0
33
/1
4-
35
er
aS
pa
IR
12
Su
n
SG
IR
10
Pe
n
m
t iu
t iu
Po
w
Po
w
Pe
n
Pe
n
IB
M
/7
PP
C6
0
00
-5
60
Vendor NoTrans
IB
M
IB
M
HP
90
n-
56
lo
ev
At
h
250
SG
D
DE
C
AM
M FLOPS
500x500 Level 2 BLAS DGEMV
300
ATLAS NoTrans
F77 NoTrans
200
150
100
50
0
63
800
700
600
500
400
300
200
100
0
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
90
0
10
00
Mflop/s
Intel PIII 550 MHz
Intel BLAS 1 proc
ATLAS 1proc
Size Intel BLAS 2 proc
ATLAS 264proc
ATLAS
° Keep a repository of kernels for specific machines.
° Extend work to allow sparse matrix operations
° Extend work to include arbitrary code segments
° See: http://www.netlib.org/atlas/
65
Algorithms and Architecture
°The key to performance is to
understand the algorithm and
architecture interaction.
°A significant improvement in
performance can be obtained by
matching algorithm to the architecture
or vice versa.
Algorithm Issues
°Use of memory hierarchy
°Algorithm pre-fetching
°Loop unrolling
°Simulating higher precision
arithmetic
Blocking
° TLB Blocking - minimize TLB misses
° Cache Blocking - minimize cache misses
° Register Blocking - minimize load/stores
Registers
L1
Cache
L2
Cache
Local
Memory
Remote
Memory
Secondary
Memory
° The general idea of blocking is to get the information
to a high-speed storage and use it multiple times so as
to amortize the cost of moving the data
° Cache blocking reduced traffic between memory and
cache
° Register blocking reduces traffic between cache and
CPU
Loop Unrolling
° Reduces data dependency delay
° Exploits multiple functional units and quad
° Gives more flexibility to compiler in scheduling
° Facilitates algorithm pre-fetching.
What’s Wrong With Speedup T1/Tp ?
° Can lead to very false conclusions.
° Speedup in isolation without taking into account the
speed of the processor is unrealistic and pointless.
° Speedup over what?
° T1/Tp
• There is usually no doubt about Tp
• Often considerable dispute over the meaning of T1
- Serial code? Same algorithm?
Speedup
° Can be used to:
• Study, in isolation, the scaling of one algorithm on one computer.
• As a dimensionless variable in the theory of scaling.
° Should not be used to compare:
• Different algorithms on the same computer
• The same algorithm on different computers.
• Different interconnection structures.
Strassen’s Algorithm for Matrix Multiply
 C 11

 C 21
C 12   A11
  
C 22   A 21
A12   B11
 
A 22   B 21
Usual Matrix Multiply
C 11  A11 B11  A12 B 21
C 12  A11 B12  A12 B 22
C 21  A 21 B11  A 22 B 21
C 22  A 21 B12  A 22 B 22
B12 

B 22 
Strassen’s Algorithm
P1  ( A1 1  A 2 2 )( B1 1  B 2 2 )
C 11  P1  P4  P5  P7
P2  ( A 2 1  A 2 2 ) B1 1
C 12  P3  P5
P3  A1 1 ( B1 2  B 2 2 )
P4  A 2 2 ( B 2 1  B1 1 )
P5  ( A1 1  A1 2 ) B 2 2
P6  ( A 2 1  A1 1 )( B1 1  B1 2 )
P7  ( A1 2  A 2 2 )( B 2 1  B 2 2 )
C 21  P2  P5
C 22  P1  P3  P2  P6
Strassen’s Algorithm
° The count of arithmetic operations is:
Regular
M ult
Complexit y
8
4
2 n3 +O(n2 )
18
4 .7 n 2 .8 + O(n2 )
St r as s en 7
One matrix multiply is replaced by 14 matrix additions
Strassen’s Algorithm
° In reality the use of Strassen’s Algorithm is limited
by
• Additional memory required for storing the P matrices.
• More memory accesses are needed.
Outline
° Motivation for Dense Linear Algebra
• Ax=b: Computational Electromagnetics
• Ax = lx: Quantum Chemistry
° Review Gaussian Elimination (GE) for solving Ax=b
° Optimizing GE for caches on sequential machines
• using matrix-matrix multiplication (BLAS)
° LAPACK library overview and performance
° Data layouts on parallel machines
° Parallel matrix-matrix multiplication
° Parallel Gaussian Elimination
° ScaLAPACK library overview
° Eigenvalue problem
77
Parallelism in Sparse Matrix-vector multiplication
° y = A*x, where A is sparse and n x n
° Questions
• which processors store
-
y[i], x[i], and A[i,j]
• which processors compute
-
y[i] = sum (from 1 to n) A[i,j] * x[j]
= (row i of A) . x
… a sparse dot product
° Partitioning
• Partition index set {1,…,n} = N1 u N2 u … u Np
• For all i in Nk, Processor k stores y[i], x[i], and row i of A
• For all i in Nk, Processor k computes y[i] = (row i of A) . x
-
“owner computes” rule: Processor k compute the y[i]s it owns
° Goals of partitioning
• balance storage (how much does each processor store?)
• minimize communication (how much is communicated?)
78
Graph Partitioning and Sparse Matrices
° Relationship between matrix and graph
1
2
3
1 1
1
2 1
1
3
4
1
1
5
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5 1
6
4
1
° A “good” partition of the graph has
3
2
4
1
6
5
• equal (weighted) number of nodes in each part (load and storage balance)
• minimum number of edges crossing between (minimize communication)
° Can reorder the rows/columns of the matrix by putting all the
nodes in one partition together
79
More on Matrix Reordering via Graph Partitioning
° “Ideal” matrix structure for parallelism: (nearly) block diagonal
• p (number of processors) blocks
• few non-zeros outside these blocks, since these require communication
P0
P1
=
*
P2
P3
P4
80
What about implicit methods and eigenproblems?
° Direct methods (Gaussian elimination)
• Called LU Decomposition, because we factor A = L*U
• Future lectures will consider both dense and sparse cases
• More complicated than sparse-matrix vector multiplication
° Iterative solvers
• Will discuss several of these in future
-
Jacobi, Successive overrelaxiation (SOR) , Conjugate
• Most have sparse-matrix-vector multiplication in kernel
° Eigenproblems
• Future lectures will discuss dense and sparse cases
• Also depend on sparse-matrix-vector multiplication, direct
methods
° Graph partitioning
81
Partial Differential Equations
PDEs
82
82
Continuous Variables, Continuous Parameters
Examples of such systems include
° Heat flow: Temperature(position, time)
° Diffusion: Concentration(position, time)
° Electrostatic or Gravitational Potential:
Potential(position)
° Fluid flow: Velocity,Pressure,Density(position,time)
° Quantum mechanics: Wave-function(position,time)
° Elasticity: Stress,Strain(position,time)
83
Example: Deriving the Heat Equation
x-h
x
0
Consider a simple problem
x+h
1
° A bar of uniform material, insulated except at ends
° Let u(x,t) be the temperature at position x at time t
° Heat travels from x-h to x+h at rate proportional to:
d u(x,t)
=C*
dt
° As h
(u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h
h
0, we get the heat equation:
d u(x,t)
d2 u(x,t)
=C*
dt
dx2
84
Explicit Solution of the Heat Equation
° For simplicity, assume C=1
° Discretize both time and position
° Use finite differences with u[j,i] as the heat at
• time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h)
• initial conditions on u[j,0]
• boundary conditions on u[0,i] and u[N,i]
t=5
° At each timestep i = 0,1,2,...
For j=0 to N
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+
z*u[j+1,i]
where z =
dt/h2
° This corresponds to
• matrix vector multiply (what is matrix?)
• nearest neighbors on grid
t=4
t=3
t=2
t=1
t=0
u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]
85
Parallelism in Explicit Method for PDEs
° Partitioning the space (x) into p largest chunks
• good load balance (assuming large number of points relative to p)
• minimized communication (only p chunks)
° Generalizes to
• multiple dimensions
• arbitrary graphs (= sparse matrices)
° Problem with explicit approach
• numerical instability
• solution blows up eventually if z = dt/h 2> .5
2
• need to make the timesteps very small when h is small: dt < .5*h
86
Instability in solving the heat equation explicitly
87
Implicit Solution
° As with many (stiff) ODEs, need an implicit method
° This turns into solving the following equation
(I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i]
° Here I is the identity matrix and T is:
T=
2
-1
-1
2
-1
-1
2
-1
-1
2
-1
-1
2
Graph and “stencil”
-1
2
-1
° I.e., essentially solving Poisson’s equation in 1D
88
2D Implicit Method
° Similar to the 1D case, but the matrix T is now
4
-1
-1
4
-1
-1
4
-1
T=
Graph and “stencil”
-1
-1
-1
-1
-1
4
-1
-1
4
-1
-1
4
-1
-1
-1
-1
-1
-1
-1
-1
-1
4
-1
4
-1
-1
4
-1
-1
4
° Multiplying by this matrix (as in the explicit case) is
simply nearest neighbor computation on 2D grid
° To solve this system, there are several techniques
89
Algorithms for 2D Poisson Equation with N unknowns
Algorithm
Serial
PRAM
Memory
#Procs
° Dense LU
N3
N
N2
N2
° Band LU
N2
N
N3/2
N
° Jacobi
N2
N
N
N
° Explicit Inv.
N2
log N
N2
N2
N 3/2
N 1/2 *log N
N
N
° RB SOR
N 3/2
N 1/2
N
N
° Sparse LU
N 3/2
N 1/2
N*log N
N
° FFT
N*log N
log N
N
N
° Multigrid
N
log2 N
N
N
log N
N
° Lower bound N
PRAM is an idealized parallel model with zero cost communication
(see next slide for explanation)
90
Short explanations of algorithms on previous slide
° Sorted in two orders (roughly):
• from slowest to fastest on sequential machines
• from most general (works on any matrix) to most specialized (works on matrices “like” T)
° Dense LU: Gaussian elimination; works on any N-by-N matrix
° Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main
diagonal, so faster
° Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative
algorithm
° Explicit Inverse: assume we want to solve many systems with T, so we can
• It’s still expensive!
° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but
exploits mathematical properies of T that Jacobi does not
° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits
yet different mathematical properties of T
• Used in Multigrid
° Sparse LU: Gaussian elimination exploiting particular zero structure of T
° FFT (Fast Fourier Transform): works only on matrices very like T
° Multigrid: also works on matrices like T, that come from elliptic PDEs
° Lower Bound: serial (time to print answer); parallel (time to combine N inputs)
91
Composite mesh from a mechanical structure
92
Converting the mesh to a matrix
93
Effects of Ordering Rows and Columns on Gaussian Elimination
94
Irregular mesh: NASA Airfoil in 2D (direct solution)
95
Irregular mesh: Tapered Tube (multigrid)
96
°John Bell and Phil Colella at LBL (see class web page for URL)
°Goal of Titanium is to make these algorithms easier to implement
in parallel
97
Computational Electromagnetics
•Developed during 1980s, driven by defense
applications
•Determine the RCS (radar cross section) of airplane
•Reduce signature of plane (stealth technology)
•Other applications are antenna design, medical
equipment
•Two fundamental numerical approaches:
•MOM methods of moments ( frequency domain),
and
•Finite differences (time domain)
98
Computational Electromagnetics
- Discretize surface into triangular facets using
standard modeling tools
- Amplitude of currents on surface are
unknowns
- Integral equation is discretized into a set of linear
equations
image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/
99
Computational Electromagnetics (MOM)
After discretization the integral equation has the
form
A x = b
where
A is the (dense) impedance matrix,
x is the unknown vector of amplitudes, and
b is the excitation vector.
(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta,
IEEE Supercomputing ‘92, pp 538 - 542)
100
Computational Electromagnetics (MOM)
The main steps in the solution process are
Fill:
computing the matrix elements of A
Factor:
factoring the dense matrix A
Solve:
solving for one or more excitations b
Field Calc: computing the fields scattered from the
object
101
Analysis of MOM for Parallel Implementation
Work
Parallelism
Parallel Speed
Fill
O(n**2)
embarrassing
Factor
O(n**3)
moderately diff.
very high
Solve
O(n**2)
moderately diff.
high
O(n)
embarrassing
high
Field Calc.
low
102
Results for Parallel Implementation on Delta
Time (hours)
Fill
9.20
Factor
8.25
Solve
2 .17
Field Calc.
0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
103
Current Records for Solving Dense Systems
Year
1950's
1995
1996
1998
1998
1999
1999
2000
System Size
O(100)
128,600
215,000
148,000
235,000
374,000
362,880
430,000
Machine
Intel Paragon
Intel ASCI Red
Cray T3E
Intel ASCI Red
SGI ASCI Blue
Intel ASCI Red
IBM ASCI White
# Procs Gflops (Peak)
6768
7264
1488
9152
5040
9632
8192
281 ( 338)
1068 (1453)
1127 (1786)
1338 (1830)
1608 (2520)
2379 (3207)
4928 (12000)
source: Alan Edelman http://www-math.mit.edu/~edelman/records.html
LINPACK Benchmark: http://www.netlib.org/performance/html/PDSreports.html
104
Computational Chemistry
° Seek energy levels of a molecule, crystal, etc.
• Solve Schroedinger’s Equation for energy levels = eigenvalues
• Discretize to get Ax = lBx, solve for eigenvalues l and eigenvectors x
• A and B large, symmetric or Hermitian matrices (B positive definite)
• May want some or all eigenvalues/eigenvectors
° MP-Quest (Sandia NL)
• Si and sapphire crystals of up to 3072 atoms
• Local Density Approximation to Schroedinger Equation
• A and B up to n=40000, Hermitian
• Need all eigenvalues and eigenvectors
• Need to iterate up to 20 times (for self-consistency)
° Implemented on Intel ASCI Red
• 9200 Pentium Pro 200 processors (4600 Duals, a CLUMP)
• Overall application ran at 605 Gflops (out of 1800 Glops peak),
• Eigensolver ran at 684 Gflops
• www.cs.berkeley.edu/~stanley/gbell/index.html
• Runner-up for Gordon Bell Prize at Supercomputing 98
105
Review of Gaussian Elimination (GE) for solving Ax=b
° Add multiples of each row to later rows to make A upper
triangular
° Solve resulting triangular system Ux = c by substitution
… for each column i
… zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
… for each row j below row i
for j = i+1 to n
… add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
106
Refine GE Algorithm (1)
° Initial Version
… for each column i
… zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
… for each row j below row i
for j = i+1 to n
… add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
° Remove computation of constant A(j,i)/A(i,i) from
inner loop
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
107
Refine GE Algorithm (2)
° Last version
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
° Don’t compute what we already know:
zeros below diagonal in column i
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
108
Refine GE Algorithm (3)
° Last version
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
° Store multipliers m below diagonal in zeroed entries
for later use
for i = 1 to n-1
for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
109
Refine GE Algorithm (4)
° Last version
for i = 1 to n-1
for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
° Express using matrix operations (BLAS)
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)
110
What GE really computes
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
° Call the strictly lower triangular matrix of multipliers
M, and let L = I+M
° Call the upper triangle of the final matrix U
° Lemma (LU Factorization): If the above algorithm
terminates (does not divide by zero) then A = L*U
° Solving A*x=b using GE
• Factorize A = L*U using GE
(cost = 2/3 n3 flops)
• Solve L*y = b for y, using substitution (cost = n2 flops)
• Solve U*x = y for x, using substitution (cost = n2 flops)
° Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired
111
Problems with basic GE algorithm
° What if some A(i,i) is zero? Or very small?
• Result may not exist, or be “unstable”, so need to pivot
° Current computation all BLAS 1 or BLAS 2, but we know that
BLAS 3 (matrix multiply) is fastest (Lecture 2)
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
… BLAS 1 (scale a vector)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)
IBM RS/6000 Power 3 (200 MHz, 800 Mflop/s Peak)
Peak
BLAS 3
800
Mflop/s
600
400
BLAS 2
200
BLAS 1
0
10
100
200
300
400
Order of vector/Matrices
500
112
Pivoting in Gaussian Elimination
°
A= [0 1]
[1 0]
fails completely, even though A is “easy”
° Illustrate problems in 3-decimal digit arithmetic:
A = [ 1e-4 1 ] and
[ 1 1 ]
b = [ 1 ], correct answer to 3 places is x = [ 1 ]
[2]
[1]
° Result of LU decomposition is
L=[ 1
[ fl(1/1e-4)
U = [ 1e-4
[0
0] = [ 1
1]
[ 1e4
1
] = [ 1e-4
fl(1-1e4*1) ]
[ 0
Check if A = L*U = [ 1e-4
[ 1
1]
0]
… No roundoff error yet
0 ]
1 ]
1 ]
-1e4 ]
… Error in 4th decimal place
… (2,2) entry entirely wrong
° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5
° Numerical instability
° Computed solution x totally inaccurate
° Cure: Pivot (swap rows of A) so entries of L and U bounded
113
Gaussian Elimination with Partial Pivoting (GEPP)
° Partial Pivoting: swap rows so that each multiplier
|L(i,j)| = |A(j,i)/A(i,i)| <= 1
for i = 1 to n-1
find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|
… i.e. largest entry in rest of column i
if |A(k,i)| = 0
exit with a warning that A is singular, or nearly so
elseif k != i
swap rows i and k of A
end if
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
… each quotient lies in [-1,1]
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
° Lemma: This algorithm computes A = P*L*U, where P is a
permutation matrix
° Since each entry of |L(i,j)| <= 1, this algorithm is considered
numerically stable
° For details see LAPACK code at www.netlib.org/lapack/single/sgetf2.f
114
Converting BLAS2 to BLAS3 in GEPP
° Blocking
• Used to optimize matrix-multiplication
• Harder here because of data dependencies in GEPP
• Save updates to “trailing matrix” from several consecutive BLAS2
• Apply many saved updates simultaneously in one BLAS3
operation
° Same idea works for much of dense linear algebra
• Open questions remain
° Need to choose a block size b
• Algorithm will save and apply b updates
• b must be small enough so that active submatrix consisting of b
columns of A fits in cache
• b must be large enough to make BLAS3 fast
115
Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f)
for ib = 1 to n-1 step b … Process matrix b columns at a time
end = ib + b-1
… Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’
… let LL denote the strict lower triangular part of A(ib:end , ib:end) + I
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n)
… update next b rows of U
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
- A(end+1:n , ib:end) * A(ib:end , end+1:n)
… apply delayed updates with single matrix-multiply
… with inner dimension b
(For a correctness proof,
see on-lines notes.)
116
Efficiency of Blocked GEPP
117
Overview of LAPACK
° Standard library for dense/banded linear algebra
• Linear systems: A*x=b
• Least squares problems: minx || A*x-b ||2
• Eigenvalue problems: Ax = lx, Ax = lBx
• Singular value decomposition (SVD): A = USVT
° Algorithms reorganized to use BLAS3 as much as
possible
° Basis of math libraries on many computers
° Many algorithmic innovations remain
• Projects available
118
Performance of LAPACK (n=1000)
119
Performance of LAPACK (n=100)
120
Parallelizing Gaussian Elimination
° parallelization steps
• Decomposition: identify enough parallel work, but not too much
• Orchestrate: communication and synchronization
• Mapping: which processors execute which threads
° Decomposition
• In BLAS 2 algorithm nearly each flop in inner loop can be done in
parallel, so with n2 processors, need 3n parallel steps
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
… BLAS 1 (scale a vector)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)
• This is too fine-grained, prefer calls to local matmuls instead
121
Assignment of parallel work in GE
° Think of assigning submatrices to threads, where
each thread responsible for updating submatrix it
owns
• “owner computes” rule natural because of locality
° What should submatrices look like to achieve load
balance?
122
Different Data Layouts for Parallel GE (on 4 procs)
P0 idle after first
n/4 steps
and BLAS2/3
performance by
choosing b, but
factorization of block
column is a bottleneck
use BLAS2 or BLAS3
The winner!
123
Blocked Partitioned Algorithms
° Orthogonal reduction to:
• (upper) Hessenberg form
• symmetric tridiagonal form
° LU Factorization
° Cholesky factorization
° Symmetric indefinite
factorization
• bidiagonal form
° Block QR iteration for
nonsymmetric eigenvalue
problems
° Matrix inversion
° QR, QL, RQ, LQ factorizations
° Form Q or QTC
Memory Hierarchy
and LAPACK
° ijk - implementations
for _ = 1: n;
for _ = 1: n;
fo r _ = 1: n;
a i , j  a i , j  bi ,k c k , j
Effects order in which
data referenced; some better
at allowing data to keep in
higher levels of memory
hierarchy.
en d
end
end
° Applies for matrix multiply, reductions to
condensed form
• May do slightly more
• Up to 3 times faster
flops
Derivation of Blocked Algorithms
Cholesky Factorization A = UTU
A
 11
T
 aj
 T
 A13
aj
a jj
j
U T
A13 

 11
T
T
j    uj

 T
A 33 
 U 13
0
u jj
j
0   U 11

0  0
T 
U 33   0
uj
u jj
0
U 13 

T
j 

U 33 
Equating coefficient of the jth column, we obtain
a j  U 11 u j
T
a jj  u j u j  u jj
T
2
Hence, if U11 has already been
computed, we can
T
compute uj and
U uujj from
 a the equations:
11
j
j
u jj  a jj  u j u j
2
T
126
LINPACK Implementation
° Here is the body of the LINPACK routine SPOFA which
implements the method:
DO 30 J = 1, N
INFO = J
S = 0.0E0
JM1 = J - 1
IF( JM1.LT.1 ) GO TO 20
DO 10 K = 1, JM1
T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 )
T = T / A( K, K )
A( K, J ) = T
S = S + T*T
10
CONTINUE
20
CONTINUE
S = A( J, J ) - S
C
...EXIT
IF( S.LE.0.0E0 ) GO TO 40
A( J, J ) = SQRT( S )
30 CONTINUE
127
LAPACK Implementation
DO 10 J = 1, N
CALL STRSV( 'Upper', 'Transpose', 'Non-Unit’, J-1, A, LDA, A( 1, J ), 1 )
S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 )
IF( S.LE.ZERO ) GO TO 20
A( J, J ) = SQRT( S )
10 CONTINUE
° This change by itself is sufficient to significantly improve the
performance on a number of machines.
° From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7
GHz.
° However on peak is 1,700 Mflop/s.
° Suggest further work needed.
128
Derivation of Blocked Algorithms
 A11
 T
 A12
 T
 A13
A12
A 22
T
A12
 U 11T
A13 

 T
A12    U 12

 T
A33 
 U 13
0
T
U 22
T
U 23
0   U 11

0  0
T 
U 33   0
U 12
U 22
0
U 13 

T
U 23 

U 33 
Equating coefficient of second block of columns, we obta
A12  U 11U 12
T
A22  U 12U 12  U 22U 22
T
T
Hence, if U11 has already been computed, we
can
compute U12 as the solution of the following
T
U
U  A12 Level 3 BLAS routine
equations by a call
11 12to the
STRSM:
T
T
U 22U 22  A22  U 12U 12
129
LAPACK Blocked Algorithms
DO 10 J = 1, N, NB
CALL STRSM( 'Left', 'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A, LDA,
\$
A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE, A( 1, J ), LDA, ONE,
\$
A( J, J ), LDA )
CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO )
IF( INFO.NE.0 ) GO TO 20
10 CONTINUE
•On Pentium 4, L3 BLAS squeezes a lot more out of 1 proc
Intel Pentium 4 1.7 GHz
Rate of Execution
N = 500
Linpack variant (L1B)
238 Mflop/s
Level 2 BLAS Variant
312 Mflop/s
Level 3 BLAS Variant
1262 Mflop/s
130
LAPACK Contents
° Combines algorithms from LINPACK and EISPACK
into a single package. User interface similar to
LINPACK.
° Built on the L 1, 2 and 3 BLAS, for high performance
(manufacturers optimize BLAS)
° LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column, diagonal, skyline ...).
LAPACK Ongoing Work
•
updating/downdating, divide and conquer least squares,bidiagonal bisection,
bidiagonal inverse iteration, band SVD, Jacobi methods, ...
° Move to new generation of high performance machines
•
IBM SPs, CRAY T3E, SGI Origin, clusters of workstations
° New challenges
•
New languages: FORTRAN 90, HP FORTRAN, ...
•
(CMMD, MPL, NX ...)
- many flavors of message passing, need standard (PVM, MPI): BLACS
° Highly varying ratio
C om putational speed
C om m unication speed
° Many ways to layout data,
° Fastest parallel algorithm sometimes less stable numerically.
History of Block Partitioned Algorithms
° Early algorithms involved use of small main memory
using tapes as secondary storage.
° Recent work centers on use of vector registers, level
1 and 2 cache, main memory, and “out of core”
memory.
Blocked Partitioned Algorithms
° Orthogonal reduction to:
• (upper) Hessenberg form
• symmetric tridiagonal form
° LU Factorization
° Cholesky factorization
° Symmetric indefinite
factorization
• bidiagonal form
° Block QR iteration for
nonsymmetric eigenvalue
problems
° Matrix inversion
° QR, QL, RQ, LQ factorizations
° Form Q or QTC
```