Lecture 6: Linear Algebra Algorithms Jack Dongarra, U of Tennessee Slides are adapted from Jim Demmel, UCB’s Lecture on Linear Algebra Algorithms /2 Homework #3 – Grading Rules part1: (2 points) Write (1) code produce correct result: 2/2. (2) code run but could not give accurate result: 1.5/2. (3) code could not run or run but nothing: 1/2 ° ° ° ° Note: If you only submit one program which can print "processor contribution" and "total integral" correctly, I will give 1/2 for this part and 2/2 for part2. so you total score for part1 and part2 will be 3/4. part2: (2 points) Modify 1 (cos x ) / 2 2 your program to sum the contributions to give the total value of the integral. (1) code produce correct result: 2/2. (2) code run but could not give accurate result: 1.5/2. (3) code could not run or run but nothing: 1/2 part3: (4 points) Give an expression for the efficiency ° (1) give t_p(1), t_p(p), S(p) and E(p) correctly: 4/4. ° (2) give t_p(1), t_p(p) and S(p) correctly: 3/4. ° (3) give t_p(1) and t_p(p) correctly: 2/4. ° (4) give other: 1/4. part4: (2 points) Comment ° ° ° ° ° ° ° 0 dx an MPI program ° ° ° ° ° ° of the parallel algorithm for m subintervals on the scalability of the algorithm (1) contain sentence similar as "if work being evenly divided and the summation can be performed in a tree fashion, the algorithm is scalable" or give very detail discussion correctly about: (a) t_(p), (b) S(p) and (c) E(P). 2/2 (2) give correct discussion about any of: (a) t_(p), (b) S(p) and (c) E(P). 1/2 (3) do not make sense. 0/2 2 Adding numbers together ° If we are adding an array of numbers and have a choice do we add them from largest to smallest or smallest to largest? 3 Parallel Performance Metrics • Absolute: Elapsed (wall-clock) Time = T(n ) • Speedup = S(n ) = T(1) / T(n ) where T1 is the time for the best serial implementation. => Performance improvement due to parallelsm • Parallel Efficiency = E(n ) = T(1) / n T(n ) • Ideal Speedup = SI (n ) = n - Theoretical limit; obtainable rarely - Ignores all of real life • These definitions apply to a fixed-problem experiment. 4 Amdahl’s Law Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below: tN = (fp/N + fs)t1 Effect of multiple processors on run time S = 1/(fs + fp/N) Effect of multiple processors on speedup Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors 5 Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors 250 fp = 1.000 speedup 200 fp = 0.999 fp = 0.990 150 fp = 0.900 100 50 0 0 50 100 150 200 250 Number of processors 6 Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance. 80 fp = 0.99 70 speedup 60 50 Amdahl's Law 40 Reality 30 20 10 0 0 So what’s going on? 50 150 100 Number of processors 200 250 7 More on Amdahl’s Law ° Amdahl’s Law can be generalized to any two processes of with different speeds ° Ex.: Apply to fprocessor and fmemory: • The growing processor-memory performance gap will undermine our efforts at achieving maximum possible speedup! 8 Gustafson’s Law ° Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large. tN = (fp/N + fs)t1 S = 1/(fs + fp/N) Effect of multiple processors on run time Effect of multiple processors on speedup Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors ° There is a way around this: increase the problem size • bigger problems mean bigger grids or more particles: bigger arrays • number of serial operations generally remains constant; number of parallel operations increases: parallel fraction increases 9 Fixed-Problem Size Scaling • a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS), variable subgrid 1 • Amdahl Limit: SA(n) = T(1) / T(n) = ---------------f/n+(1-f) • This bounds the speedup based only on the fraction of the code that cannot use parallelism ( 1- f ); it ignores all other factors • SA --> 1 / ( 1- f ) as n --> 10 Fixed-Problem Size Scaling (Cont’d) • Efficiency (n) = T(1) / [ T(n) * n] • Memory requirements decrease with n • Surface-to-volume ratio increases with n • Superlinear speedup possible from cache effects • Motivation: what is the largest # of procs I can use effectively and what is the fastest time that I can solve a given problem? • Problems: - Sequential runs often not possible (large problems) - Speedup (and efficiency) is misleading if processors are slow 11 Fixed-Problem Size Scaling: Examples S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC'97 Santa Cruz, August 25-28 1997. 12 Fixed-Problem Size Scaling Examples 13 Scaled Speedup Experiments • a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling. • Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time. • Memory and surface-to-volume effects remain constant. 14 Top500 Data Rmax [TF/s] R peak [TF/s] Ratio Max/Peak # Proc 7.23 12 60.25 8192 4.06 6.05 NERSC/LBNL 3.05 5.00 61.00 3328 Sandia National Laboratory 2.38 3.207 74.21 9632 ASCI Blue Pacific SST, IBM SP 604E Lawrence Livermore National Laboratory 2.14 3.868 55.33 5808 Compaq AlphaServer SC ES45 1 GHz Los Alamos National Laboratory 2.10 3.04 69.08 1536 7 Hitachi SR8000/MPP 1.71 2.69 63.57 1152 8 SGI ASCI Blue Mountain 1.61 2.52 63.89 6144 9 IBM SP Power3 375Mhz Naval Oceanographic Office (NAVOCEANO) 1.42 1.97 72.08 1336 10 IBM SP Power3 375Mhz Deutscher Wetterdienst 1.29 1.83 70.49 Rank Manufacturer Computer Installation Site 1 IBM ASCI White SP Power3 Lawrence Livermore National Laboratory 2 Compaq AlphaServer SC ES45 1 GHz 3 IBM SP Power3 375 MHz 4 Intel ASCI Red 5 IBM 6 Pittsburgh Supercomputing Center University of Tokyo Los Alamos National Laboratory 67.11 3024 15 1280 Example of a Scaled Speedup Experiment Processors NChains 1 2 4 8 16 32 64 128 256 512 Time 32 64 128 256 512 940 1700 2800 4100 5300 38.4 38.4 38.5 38.6 38.7 35.7 32.7 27.4 20.75 14.49 Natoms 2368 4736 9472 18944 37888 69560 125800 207200 303400 392200 Time per Atom per PE 1.62E-02 8.11E-03 4.06E-03 2.04E-03 1.02E-03 5.13E-04 2.60E-04 1.32E-04 6.84E-05 3.69E-05 Time Efficiency per Atom 1.62E-02 1.000 1.62E-02 1.000 1.63E-02 0.997 1.63E-02 0.995 1.63E-02 0.992 1.64E-02 0.987 1.66E-02 0.975 1.69E-02 0.958 1.75E-02 0.926 1.89E-02 0.857 TBON on ASCI Red Efficiency 1.040 0.940 0.840 0.740 0.640 0.540 0.440 0 200 400 600 16 Parallel Performance Metrics: Speedup Relative performance: Absolute performance: 60 2000 T3E T3E Ideal O2K O2K Ideal 1750 O2K 50 MFLOPS Speedup T3E Ideal 40 30 1500 1250 1000 750 20 500 10 250 0 0 0 10 20 30 40 50 Processors 60 0 8 16 24 32 40 48 Processors Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster. 17 18 Improving Ratio of Floating Point Operations to Memory Accesses subroutine mult(n1,nd1,n2,nd2,y,a,x) implicit real*8 (a-h,o-z) dimension a(nd1,nd2),y(nd2),x(nd1) do 20, i=1,n1 t=0.d0 do 10, j=1,n2 t=t+a(j,i)*x(j) 10 continue y(i)=t **** 2 FLOPS **** 2 LOADS 20 continue return end Unroll the loops 19 Improving Ratio of Floating Point Operations to Memory Accesses c works correctly when n1,n2 are multiples of 4 dimension a(nd1,nd2), y(nd2), x(nd1) do i=1,n1-3,4 t1=0.d0 t2=0.d0 t3=0.d0 t4=0.d0 do j=1,n2-3,4 t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+ 1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3) t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+ 1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3) t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+ 1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3) t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+ 1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3) enddo y(i+0)=t1 y(i+1)=t2 y(i+2)=t3 y(i+3)=t4 enddo 32 FLOPS 20 LOADS 20 Summary of Single-Processor Optimization Techniques (I) ° Spatial and temporal data locality ° Loop unrolling ° Blocking ° Software pipelining ° Optimization of data structures ° Special functions, library subroutines 21 Summary of Optimization Techniques (II) ° Achieving high-performance requires code restructuring. Minimization of memory traffic is the single most important goal. ° Compilers are getting better: good at software pipelining. But they are not there yet: can do loop transformations only in simple cases, usually fail to produce optimal blocking, heuristics for unrolling may not match your code well, etc. ° The optimization process is machine-specific and requires detailed architectural knowledge. 22 Optimizing Matrix Addition for Caches ° Dimension A(n,n), B(n,n), C(n,n) ° A, B, C stored by column (as in Fortran) ° Algorithm 1: • for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j) ° Algorithm 2: • for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j) ° What is “memory access pattern” for Algs 1 and 2? ° Which is faster? ° What if A, B, C stored by row (as in C)? 23 Using a Simpler Model of Memory to Optimize ° Assume just 2 levels in the hierarchy, fast and slow ° All data initially in slow memory • m = number of memory elements (words) moved between fast and slow memory • tm = time per slow memory operation • f = number of arithmetic operations • tf = time per arithmetic operation < tm • q = f/m average number of flops per slow element access ° Minimum possible Time = f*tf, when all data in fast memory ° Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q)) ° Larger q means Time closer to minimum f*tf • Want large q 24 Simple example using memory model ° To see results of changing q, consider simple computation s=0 ° Assume tf=1 Mflop/s on fast memory for i = 1, n ° Assume moving data is tm = 10 s = s + h(X[i]) ° Assume h takes q flops ° Assume array X is in slow memory ° So m = n and f = q*n ° Time = read X + compute = 10*n + q*n ° Mflop/s = f/t = q/(10 + q) ° As q increases, this approaches the “peak” speed of 1 Mflop/s 25 Warm up: Matrix-vector multiplication y = y + A*x for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) A(i,:) + = y(i) y(i) * x(:) 26 Warm up: Matrix-vector multiplication y = y + A*x {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} ° m = number of slow memory refs = 3*n + n2 ° f = number of arithmetic operations = 2*n2 ° q = f/m ~= 2 ° Matrix-vector multiplication limited by slow memory speed 27 Matrix Multiply C=C+A*B for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) C(i,j) A(i,:) C(i,j) = + * B(:,j) 28 Matrix Multiply C=C+A*B(unblocked, or untiled) for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} C(i,j) A(i,:) C(i,j) = + * B(:,j) 29 q=ops/slow mem ref Matrix Multiply (unblocked, or untiled) Number of slow memory references on unblocked matrix multiply m = n3 + n2 read each column of B n times read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2 So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 for large n, no improvement over matrix-vector mult C(i,j) A(i,:) C(i,j) = + * B(:,j) 30 Matrix Multiply (blocked, or tiled) Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} C(i,j) A(i,k) C(i,j) = + * B(k,j) n is the size of the matrix, N blocks of size b; n = N*b 31 Matrix Multiply (blocked or tiled) Why is this algorithm correct? q=ops/slow mem ref n is the size of the matrix, N blocks of size b; n = N*b Number of slow memory references on blocked matrix multiply m = N*n2 read each block of B N3 times (N3 * n/N * n/N) + N*n2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2 So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3) Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M)) 32 Model ° As much as possible will be overlapped ° Dot Product: ACC = 0 do i = x,n ACC = ACC + x(i) y(i) end do ° Experiments done on an IBM RS6000/530 • 25 MHz • 2 cycle to complete FMA can be pipelined - => 50 Mflop/s peak FMA FMA • one cycle from cache FMA cycles FMA FMA FMA 33 DOT Operation - Data in Cache Do 10 I = 1, n T = T + X(I)*Y(I) 10 CONTINUE ° Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data ° Pseudo-assembler LOAD fp0,T label: LOAD fp1,X(I) LOAD fp2,Y(I) FMA fp0,fp0,fp1,fp2 BRANCH label: Load x Load y FMA Load x Load y FMA 1 result per cycle = 25 Mflop/s 34 Matrix-Vector Product ° DOT version DO 20 I = 1, M DO 10 J = 1, N Y(I) = Y(I) + A(I,J)*X(J) 10 CONTINUE 20 CONTINUE ° From Cache = 22.7 Mflops ° From Memory = 12.4 Mflops 35 Loop Unrolling DO 20 I = 1, M, 2 T1 = Y(I ) 10 T2 = Y(I+1) Depth 1 2 3 4 DO 10 J = 1, N Speed 25 33.3 37.5 40 50 T1 = T1 + A(I,J )*X(J) Measured 22.7 30.5 34.3 36.5 T2 = T2 + A(I+1,J)*X(J) Memory 12.7 12.7 12.6 12.4 CONTINUE Y(I ) = T1 ° unroll 1: 2 loads : 2 ops per 2 cycles Y(I+1) = T2 ° unroll 2: 3 loads : 4 ops per 3 cycles ° unroll 3: 4 loads : 6 ops per 4 cycles ° … ° 3 loads, 4 flops ° unroll n: n+1 loads : 2n ops per n+1 cycles ° Speed of y=y+ATx, N=48 ° problem: only so many registers 20 CONTINUE 36 Matrix Multiply ° DOT version - 25 Mflops in cache DO 30 J = 1, M DO 20 I = 1, M DO 10 K = 1, L C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 20 CONTINUE CONTINUE 30 CONTINUE 37 How to Get Near Peak ° Inner loop: DO 30 J = 1, M, 2 • 4 loads, 8 operations, optimal. DO 20 I + 1, M, 2 T11 = C(I, J ) T12 = C(I, J+1) T21 = C(I+1,J ) T22 = C(I+1,J+1) ° In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache DO 10 K = 1, L T11 = T11 + A(I, K) *B(K,J ) T12 = T12 + A(I, K) *B(K,J+1) T21 = T21 + A(I+1,K)*B(K,J ) T22 = T22 + A(I+1,K)*B(K,J+1) 10 CONTINUE C(I, J ) = T11 C(I, J+1) = T12 C(I+1,J ) = T21 C(I+1,J+1) = T22 20 CONTINUE 30 CONTINUE 38 BLAS -- Introduction ° Clarity: code is shorter and easier to read, ° Modularity: gives programmer larger building blocks, ° Performance: manufacturers will provide tuned machine-specific BLAS, ° Program portability: machine dependencies are confined to the BLAS 39 Memory Hierarchy ° Key to high performance in effective use of memory hierarchy ° True on all architectures Register s L1 Cache L2 Cache Local Memory Remote Memory Secondary Memory 40 Level 1, 2 and 3 BLAS ° Level 1 BLAS Vector-Vector operations + * ° Level 2 BLAS MatrixVector operations ° Level 3 BLAS MatrixMatrix operations * + * 41 More on BLAS (Basic Linear Algebra Subroutines) ° Industry standard interface(evolving) ° Vendors, others supply optimized implementations ° History • BLAS1 (1970s): - vector operations: dot product, saxpy (y=*x+y), etc - m=2*n, f=2*n, q ~1 or less • BLAS2 (mid 1980s) - matrix-vector operations: matrix vector multiply, etc - m=n2, f=2*n2, q~2, less overhead - somewhat faster than BLAS1 • BLAS3 (late 1980s) - matrix-matrix operations: matrix matrix multiply, etc - m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 ° Good algorithms used BLAS3 when possible (LAPACK) ° www.netlib.org/blas, www.netlib.org/lapack 42 Why Higher Level BLAS? ° Can only do arithmetic on data at the top of the hierarchy ° Higher level BLAS lets us do this BLAS M emor y Flops Flops/ Ref s M emor y Ref s Level 1 3n 2n 2/ 3 y= y+ x Level 2 n2 2 n2 2 4 n2 2 n3 n/ 2 y= y+ Ax Level 3 C= C+ AB Registers L1 Cache L2 Cache Local Memory Remote Memory Secondary43 Memory BLAS for Performance Intel Pentium 4 w/SSE2 1.7 GHz Level 3 BLAS 2000 Mflop/s 1500 1000 Level 2 BLAS Level 1 BLAS 500 0 10 100 200 300 400 500 Order of vector/Matrices ° Development of blocked algorithms important for performance 44 BLAS for Performance IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak) 250 Level 3 BLAS Mflop/s 200 150 Level 2 BLAS 100 50 Level 1 BLAS 0 10 100 200 300 400 500 Order of vector/Matrices 45 BLAS for Performance Mflop/s Alpha EV 5/6 500MHz (1Gflop/s peak) 700 600 500 400 300 200 100 0 Level 3 BLAS Level 2 BLAS Level 1 BLAS 10 100 200 300 400 500 Order of vector/Matrices BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) 46 Fast linear algebra kernels: BLAS ° Simple linear algebra kernels such as matrixmatrix multiply ° More complicated algorithms can be built from these basic kernels. ° The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS). ° Early agreement on standard interface (~1980) ° Led to portable libraries for vector and shared memory parallel machines. ° On distributed memory, there is a less-standard interface called the PBLAS 47 Level 1 BLAS ° Operate on vectors or pairs of vectors • perform O(n) operations; • return either a vector or a scalar. ° saxpy • y(i) = a * x(i) + y(i), for i=1 to n. • s stands for single precision, daxpy is for double precision, caxpy for complex, and zaxpy for double complex, ° sscal y = a * x, for scalar a and vectors x,y ° sdot computes s = S n i=1 x(i)*y(i) 48 Level 2 BLAS ° Operate on a matrix and a vector; • return a matrix or a vector; • O(n2) operations ° sgemv: matrix-vector multiply • y = y + A*x • where A is m-by-n, x is n-by-1 and y is m-by-1. ° sger: rank-one update • • • • A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) where A is m-by-n, y is m-by-1, x is n-by-1, strsv: triangular solve solves y=T*x for x, where T is triangular 49 Level 3 BLAS ° Operate on pairs or triples of matrices • returning a matrix; • complexity is O(n3). ° sgemm: Matrix-matrix multiplication • C = C +A*B, • where C is m-by-n, A is m-by-k, and B is k-by-n ° strsm: multiple triangular solve • solves Y = T*X for X, • where T is a triangular matrix, and X is a rectangular matrix. 50 Optimizing in practice ° Tiling for registers • loop unrolling, use of named “register” variables ° Tiling for multiple levels of cache ° Exploiting fine-grained parallelism within the processor • super scalar • pipelining ° Complicated compiler interactions ° Hard to do by hand (but you’ll try) ° Automatic optimization an active research area • PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac • www.cs.berkeley.edu/~iyer/asci_slides.ps • ATLAS: www.netlib.org/atlas/index.html 51 BLAS -- References ° BLAS software and documentation can be obtained via: • WWW: http://www.netlib.org/blas, • (anonymous) ftp ftp.netlib.org: cd blas; get index • email [email protected] with the message: send index from blas ° Comments and questions can be addressed to: [email protected] 52 BLAS Papers ° C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979. ° J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988. ° J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990. 53 Performance of BLAS ° BLAS are specially optimized by the vendor • Sun BLAS uses features in the Ultrasparc ° Big payoff for algorithms that can be expressed in terms of the BLAS3 instead of BLAS2 or BLAS1. ° The top speed of the BLAS3 ° Algorithms like Gaussian elimination organized so that they use BLAS3 54 How To Get Performance From Commodity Processors? ° Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning. ° Routines have a large design space w/many parameters • blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules. • Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors. ° A few months ago no tuned BLAS for Pentium for Linux. ° Need for quick/dynamic deployment of optimized routines. ° ATLAS - Automatic Tuned Linear Algebra Software • PhiPac from Berkeley 55 Adaptive Approach for Level 3 ° Do a parameter study of the operation on the target machine, done once. ° Only generated code is on-chip multiply ° BLAS operation written in terms of generated onchip multiply ° All tranpose cases coerced through data copy to 1 case of on-chip multiply • Only 1 case generated per platform N M NB C K M A N * B K 56 Code Generation Strategy ° On-chip multiply optimizes for: • TLB access • L1 cache reuse • FP unit usage • Memory fetch ° Code is iteratively generated & timed until optimal case is found. We try: • Register reuse • Differing NBs • Loop overhead minimization • Breaking false dependencies ° Takes a couple of hours to run. • M, N and K loop unrolling 57 System Sun Ultra2 Model 2200 Sun Darwin-270 Sun Microsparc II Model 70 SGI R10000ip27 SGI R8000ip21 SGI R5000 Vendor Matrix Multiply SGI R4600 Pentium II-266 Pentium Pro-200 Pentium MMX-150 IBM PowerPC 604e-332 IBM Power2-135 HP9000/735/125 HP PA8000 180Mhz DEC Alpha 21164a-433 Mflops 500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures ATLAS Matrix Multiply 700.0 600.0 500.0 400.0 300.0 200.0 100.0 0.0 58 Sun Darwin-270 SGI R10000ip27 SGI R5000 59 Sun Ultra2 Model 2200 LU w/Vendor BLAS Pentium II-266 Pentium Pro-200 IBM PowerPC 604e-332 IBM Power2-135 HP PA8000 DEC Alpha 21164a433 DCG LX 21164a533 MFLOPS 500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures LU w/ATLAS & GEMM-based BLAS 600.0 500.0 400.0 300.0 200.0 100.0 0.0 500x500 gemm-based BLAS on SGI R10000ip28 300 250 MFLOPS 200 150 100 50 0 DGEMM DSYMM Vendor BLAS DSYR2K DSYRK ATLAS/SSBLAS DTRMM Reference BLAS DTRSM 60 500x500 gemm-based BLAS on UltraSparc 2200 Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS 300 250 MFLOPS 200 150 100 50 0 DGEMM DSYMM DSYR2K DSYRK Level 3 BLAS Routine DTRMM DTRSM 61 Recursive Approach for Other Level 3 BLAS ° Recur down to L1 cache block size ° Need kernel at bottom of recursion • Use gemm-based kernel for portability Recursive TRMM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 62 Architectures Ul tr 00 00 0i 0 0 0 20 27 20 2- 0- rc p3 8- 6 50 26 -5 II- III p2 m 0i t iu m 0 0 00 20 16 2 35 11 -2 3- Pr o er 2- 0 33 /1 4- 35 er aS pa IR 12 Su n SG IR 10 Pe n m t iu t iu Po w Po w Pe n Pe n IB M /7 PP C6 0 00 -5 60 Vendor NoTrans IB M IB M HP 90 n- 56 lo ev At h 250 SG D DE C AM M FLOPS 500x500 Level 2 BLAS DGEMV 300 ATLAS NoTrans F77 NoTrans 200 150 100 50 0 63 800 700 600 500 400 300 200 100 0 10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 Mflop/s Multi-Threaded DGEMM Intel PIII 550 MHz Intel BLAS 1 proc ATLAS 1proc Size Intel BLAS 2 proc ATLAS 264proc ATLAS ° Keep a repository of kernels for specific machines. ° Develop a means of dynamically downloading code ° Extend work to allow sparse matrix operations ° Extend work to include arbitrary code segments ° See: http://www.netlib.org/atlas/ 65 Algorithms and Architecture °The key to performance is to understand the algorithm and architecture interaction. °A significant improvement in performance can be obtained by matching algorithm to the architecture or vice versa. Algorithm Issues °Use of memory hierarchy °Algorithm pre-fetching °Loop unrolling °Simulating higher precision arithmetic Blocking ° TLB Blocking - minimize TLB misses ° Cache Blocking - minimize cache misses ° Register Blocking - minimize load/stores Registers L1 Cache L2 Cache Local Memory Remote Memory Secondary Memory ° The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data ° Cache blocking reduced traffic between memory and cache ° Register blocking reduces traffic between cache and CPU Loop Unrolling ° Reduces data dependency delay ° Exploits multiple functional units and quad load/stores effectively. ° Minimizes load/stores ° Reduces loop overheads ° Gives more flexibility to compiler in scheduling ° Facilitates algorithm pre-fetching. ° What about vector computing? What’s Wrong With Speedup T1/Tp ? ° Can lead to very false conclusions. ° Speedup in isolation without taking into account the speed of the processor is unrealistic and pointless. ° Speedup over what? ° T1/Tp • There is usually no doubt about Tp • Often considerable dispute over the meaning of T1 - Serial code? Same algorithm? Speedup ° Can be used to: • Study, in isolation, the scaling of one algorithm on one computer. • As a dimensionless variable in the theory of scaling. ° Should not be used to compare: • Different algorithms on the same computer • The same algorithm on different computers. • Different interconnection structures. Strassen’s Algorithm for Matrix Multiply C 11 C 21 C 12 A11 C 22 A 21 A12 B11 A 22 B 21 Usual Matrix Multiply C 11 A11 B11 A12 B 21 C 12 A11 B12 A12 B 22 C 21 A 21 B11 A 22 B 21 C 22 A 21 B12 A 22 B 22 B12 B 22 Strassen’s Algorithm P1 ( A1 1 A 2 2 )( B1 1 B 2 2 ) C 11 P1 P4 P5 P7 P2 ( A 2 1 A 2 2 ) B1 1 C 12 P3 P5 P3 A1 1 ( B1 2 B 2 2 ) P4 A 2 2 ( B 2 1 B1 1 ) P5 ( A1 1 A1 2 ) B 2 2 P6 ( A 2 1 A1 1 )( B1 1 B1 2 ) P7 ( A1 2 A 2 2 )( B 2 1 B 2 2 ) C 21 P2 P5 C 22 P1 P3 P2 P6 Strassen’s Algorithm ° The count of arithmetic operations is: Regular M ult Add Complexit y 8 4 2 n3 +O(n2 ) 18 4 .7 n 2 .8 + O(n2 ) St r as s en 7 One matrix multiply is replaced by 14 matrix additions Strassen’s Algorithm ° In reality the use of Strassen’s Algorithm is limited by • Additional memory required for storing the P matrices. • More memory accesses are needed. Outline ° Motivation for Dense Linear Algebra • Ax=b: Computational Electromagnetics • Ax = lx: Quantum Chemistry ° Review Gaussian Elimination (GE) for solving Ax=b ° Optimizing GE for caches on sequential machines • using matrix-matrix multiplication (BLAS) ° LAPACK library overview and performance ° Data layouts on parallel machines ° Parallel matrix-matrix multiplication ° Parallel Gaussian Elimination ° ScaLAPACK library overview ° Eigenvalue problem 77 Parallelism in Sparse Matrix-vector multiplication ° y = A*x, where A is sparse and n x n ° Questions • which processors store - y[i], x[i], and A[i,j] • which processors compute - y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) . x … a sparse dot product ° Partitioning • Partition index set {1,…,n} = N1 u N2 u … u Np • For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) . x - “owner computes” rule: Processor k compute the y[i]s it owns ° Goals of partitioning • balance load (how is load measured?) • balance storage (how much does each processor store?) • minimize communication (how much is communicated?) 78 Graph Partitioning and Sparse Matrices ° Relationship between matrix and graph 1 2 3 1 1 1 2 1 1 3 4 1 1 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 6 4 1 ° A “good” partition of the graph has 3 2 4 1 6 5 • equal (weighted) number of nodes in each part (load and storage balance) • minimum number of edges crossing between (minimize communication) ° Can reorder the rows/columns of the matrix by putting all the nodes in one partition together 79 More on Matrix Reordering via Graph Partitioning ° “Ideal” matrix structure for parallelism: (nearly) block diagonal • p (number of processors) blocks • few non-zeros outside these blocks, since these require communication P0 P1 = * P2 P3 P4 80 What about implicit methods and eigenproblems? ° Direct methods (Gaussian elimination) • Called LU Decomposition, because we factor A = L*U • Future lectures will consider both dense and sparse cases • More complicated than sparse-matrix vector multiplication ° Iterative solvers • Will discuss several of these in future - Jacobi, Successive overrelaxiation (SOR) , Conjugate Gradients (CG), Multigrid,... • Most have sparse-matrix-vector multiplication in kernel ° Eigenproblems • Future lectures will discuss dense and sparse cases • Also depend on sparse-matrix-vector multiplication, direct methods ° Graph partitioning 81 Partial Differential Equations PDEs 82 82 Continuous Variables, Continuous Parameters Examples of such systems include ° Heat flow: Temperature(position, time) ° Diffusion: Concentration(position, time) ° Electrostatic or Gravitational Potential: Potential(position) ° Fluid flow: Velocity,Pressure,Density(position,time) ° Quantum mechanics: Wave-function(position,time) ° Elasticity: Stress,Strain(position,time) 83 Example: Deriving the Heat Equation x-h x 0 Consider a simple problem x+h 1 ° A bar of uniform material, insulated except at ends ° Let u(x,t) be the temperature at position x at time t ° Heat travels from x-h to x+h at rate proportional to: d u(x,t) =C* dt ° As h (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h h 0, we get the heat equation: d u(x,t) d2 u(x,t) =C* dt dx2 84 Explicit Solution of the Heat Equation ° For simplicity, assume C=1 ° Discretize both time and position ° Use finite differences with u[j,i] as the heat at • time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) • initial conditions on u[j,0] • boundary conditions on u[0,i] and u[N,i] t=5 ° At each timestep i = 0,1,2,... For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+ z*u[j+1,i] where z = dt/h2 ° This corresponds to • matrix vector multiply (what is matrix?) • nearest neighbors on grid t=4 t=3 t=2 t=1 t=0 u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] 85 Parallelism in Explicit Method for PDEs ° Partitioning the space (x) into p largest chunks • good load balance (assuming large number of points relative to p) • minimized communication (only p chunks) ° Generalizes to • multiple dimensions • arbitrary graphs (= sparse matrices) ° Problem with explicit approach • numerical instability • solution blows up eventually if z = dt/h 2> .5 2 • need to make the timesteps very small when h is small: dt < .5*h 86 Instability in solving the heat equation explicitly 87 Implicit Solution ° As with many (stiff) ODEs, need an implicit method ° This turns into solving the following equation (I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i] ° Here I is the identity matrix and T is: T= 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 -1 2 Graph and “stencil” -1 2 -1 ° I.e., essentially solving Poisson’s equation in 1D 88 2D Implicit Method ° Similar to the 1D case, but the matrix T is now 4 -1 -1 4 -1 -1 4 -1 T= Graph and “stencil” -1 -1 -1 -1 -1 4 -1 -1 4 -1 -1 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 4 -1 4 -1 -1 4 -1 -1 4 ° Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation on 2D grid ° To solve this system, there are several techniques 89 Algorithms for 2D Poisson Equation with N unknowns Algorithm Serial PRAM Memory #Procs ° Dense LU N3 N N2 N2 ° Band LU N2 N N3/2 N ° Jacobi N2 N N N ° Explicit Inv. N2 log N N2 N2 ° Conj.Grad. N 3/2 N 1/2 *log N N N ° RB SOR N 3/2 N 1/2 N N ° Sparse LU N 3/2 N 1/2 N*log N N ° FFT N*log N log N N N ° Multigrid N log2 N N N log N N ° Lower bound N PRAM is an idealized parallel model with zero cost communication (see next slide for explanation) 90 Short explanations of algorithms on previous slide ° Sorted in two orders (roughly): • from slowest to fastest on sequential machines • from most general (works on any matrix) to most specialized (works on matrices “like” T) ° Dense LU: Gaussian elimination; works on any N-by-N matrix ° Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal, so faster ° Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative algorithm ° Explicit Inverse: assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it • It’s still expensive! ° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but exploits mathematical properies of T that Jacobi does not ° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits yet different mathematical properties of T • Used in Multigrid ° Sparse LU: Gaussian elimination exploiting particular zero structure of T ° FFT (Fast Fourier Transform): works only on matrices very like T ° Multigrid: also works on matrices like T, that come from elliptic PDEs ° Lower Bound: serial (time to print answer); parallel (time to combine N inputs) 91 Composite mesh from a mechanical structure 92 Converting the mesh to a matrix 93 Effects of Ordering Rows and Columns on Gaussian Elimination 94 Irregular mesh: NASA Airfoil in 2D (direct solution) 95 Irregular mesh: Tapered Tube (multigrid) 96 Adaptive Mesh Refinement (AMR) °Adaptive mesh around an explosion °John Bell and Phil Colella at LBL (see class web page for URL) °Goal of Titanium is to make these algorithms easier to implement in parallel 97 Computational Electromagnetics •Developed during 1980s, driven by defense applications •Determine the RCS (radar cross section) of airplane •Reduce signature of plane (stealth technology) •Other applications are antenna design, medical equipment •Two fundamental numerical approaches: •MOM methods of moments ( frequency domain), and •Finite differences (time domain) 98 Computational Electromagnetics - Discretize surface into triangular facets using standard modeling tools - Amplitude of currents on surface are unknowns - Integral equation is discretized into a set of linear equations image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/ 99 Computational Electromagnetics (MOM) After discretization the integral equation has the form A x = b where A is the (dense) impedance matrix, x is the unknown vector of amplitudes, and b is the excitation vector. (see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542) 100 Computational Electromagnetics (MOM) The main steps in the solution process are Fill: computing the matrix elements of A Factor: factoring the dense matrix A Solve: solving for one or more excitations b Field Calc: computing the fields scattered from the object 101 Analysis of MOM for Parallel Implementation Task Work Parallelism Parallel Speed Fill O(n**2) embarrassing Factor O(n**3) moderately diff. very high Solve O(n**2) moderately diff. high O(n) embarrassing high Field Calc. low 102 Results for Parallel Implementation on Delta Task Time (hours) Fill 9.20 Factor 8.25 Solve 2 .17 Field Calc. 0.12 The problem solved was for a matrix of size 48,672. (The world record in 1991.) 103 Current Records for Solving Dense Systems Year 1950's 1995 1996 1998 1998 1999 1999 2000 System Size O(100) 128,600 215,000 148,000 235,000 374,000 362,880 430,000 Machine Intel Paragon Intel ASCI Red Cray T3E Intel ASCI Red SGI ASCI Blue Intel ASCI Red IBM ASCI White # Procs Gflops (Peak) 6768 7264 1488 9152 5040 9632 8192 281 ( 338) 1068 (1453) 1127 (1786) 1338 (1830) 1608 (2520) 2379 (3207) 4928 (12000) source: Alan Edelman http://www-math.mit.edu/~edelman/records.html LINPACK Benchmark: http://www.netlib.org/performance/html/PDSreports.html 104 Computational Chemistry ° Seek energy levels of a molecule, crystal, etc. • Solve Schroedinger’s Equation for energy levels = eigenvalues • Discretize to get Ax = lBx, solve for eigenvalues l and eigenvectors x • A and B large, symmetric or Hermitian matrices (B positive definite) • May want some or all eigenvalues/eigenvectors ° MP-Quest (Sandia NL) • Si and sapphire crystals of up to 3072 atoms • Local Density Approximation to Schroedinger Equation • A and B up to n=40000, Hermitian • Need all eigenvalues and eigenvectors • Need to iterate up to 20 times (for self-consistency) ° Implemented on Intel ASCI Red • 9200 Pentium Pro 200 processors (4600 Duals, a CLUMP) • Overall application ran at 605 Gflops (out of 1800 Glops peak), • Eigensolver ran at 684 Gflops • www.cs.berkeley.edu/~stanley/gbell/index.html • Runner-up for Gordon Bell Prize at Supercomputing 98 105 Review of Gaussian Elimination (GE) for solving Ax=b ° Add multiples of each row to later rows to make A upper triangular ° Solve resulting triangular system Ux = c by substitution … for each column i … zero it out below the diagonal by adding multiples of row i to later rows for i = 1 to n-1 … for each row j below row i for j = i+1 to n … add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k) 106 Refine GE Algorithm (1) ° Initial Version … for each column i … zero it out below the diagonal by adding multiples of row i to later rows for i = 1 to n-1 … for each row j below row i for j = i+1 to n … add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k) ° Remove computation of constant A(j,i)/A(i,i) from inner loop for i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k) 107 Refine GE Algorithm (2) ° Last version for i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k) ° Don’t compute what we already know: zeros below diagonal in column i for i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k) 108 Refine GE Algorithm (3) ° Last version for i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k) ° Store multipliers m below diagonal in zeroed entries for later use for i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k) 109 Refine GE Algorithm (4) ° Last version for i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k) ° Express using matrix operations (BLAS) for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) 110 What GE really computes for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) ° Call the strictly lower triangular matrix of multipliers M, and let L = I+M ° Call the upper triangle of the final matrix U ° Lemma (LU Factorization): If the above algorithm terminates (does not divide by zero) then A = L*U ° Solving A*x=b using GE • Factorize A = L*U using GE (cost = 2/3 n3 flops) • Solve L*y = b for y, using substitution (cost = n2 flops) • Solve U*x = y for x, using substitution (cost = n2 flops) ° Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired 111 Problems with basic GE algorithm ° What if some A(i,i) is zero? Or very small? • Result may not exist, or be “unstable”, so need to pivot ° Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (Lecture 2) for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n) IBM RS/6000 Power 3 (200 MHz, 800 Mflop/s Peak) Peak BLAS 3 800 Mflop/s 600 400 BLAS 2 200 BLAS 1 0 10 100 200 300 400 Order of vector/Matrices 500 112 Pivoting in Gaussian Elimination ° A= [0 1] [1 0] fails completely, even though A is “easy” ° Illustrate problems in 3-decimal digit arithmetic: A = [ 1e-4 1 ] and [ 1 1 ] b = [ 1 ], correct answer to 3 places is x = [ 1 ] [2] [1] ° Result of LU decomposition is L=[ 1 [ fl(1/1e-4) U = [ 1e-4 [0 0] = [ 1 1] [ 1e4 1 ] = [ 1e-4 fl(1-1e4*1) ] [ 0 Check if A = L*U = [ 1e-4 [ 1 1] 0] … No roundoff error yet 0 ] 1 ] 1 ] -1e4 ] … Error in 4th decimal place … (2,2) entry entirely wrong ° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5 ° Numerical instability ° Computed solution x totally inaccurate ° Cure: Pivot (swap rows of A) so entries of L and U bounded 113 Gaussian Elimination with Partial Pivoting (GEPP) ° Partial Pivoting: swap rows so that each multiplier |L(i,j)| = |A(j,i)/A(i,i)| <= 1 for i = 1 to n-1 find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)| … i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so elseif k != i swap rows i and k of A end if A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1] A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) ° Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix ° Since each entry of |L(i,j)| <= 1, this algorithm is considered numerically stable ° For details see LAPACK code at www.netlib.org/lapack/single/sgetf2.f 114 Converting BLAS2 to BLAS3 in GEPP ° Blocking • Used to optimize matrix-multiplication • Harder here because of data dependencies in GEPP ° Delayed Updates • Save updates to “trailing matrix” from several consecutive BLAS2 updates • Apply many saved updates simultaneously in one BLAS3 operation ° Same idea works for much of dense linear algebra • Open questions remain ° Need to choose a block size b • Algorithm will save and apply b updates • b must be small enough so that active submatrix consisting of b columns of A fits in cache • b must be large enough to make BLAS3 fast 115 Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f) for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension b (For a correctness proof, see on-lines notes.) 116 Efficiency of Blocked GEPP 117 Overview of LAPACK ° Standard library for dense/banded linear algebra • Linear systems: A*x=b • Least squares problems: minx || A*x-b ||2 • Eigenvalue problems: Ax = lx, Ax = lBx • Singular value decomposition (SVD): A = USVT ° Algorithms reorganized to use BLAS3 as much as possible ° Basis of math libraries on many computers ° Many algorithmic innovations remain • Projects available 118 Performance of LAPACK (n=1000) 119 Performance of LAPACK (n=100) 120 Parallelizing Gaussian Elimination ° parallelization steps • Decomposition: identify enough parallel work, but not too much • Assignment: load balance work among threads • Orchestrate: communication and synchronization • Mapping: which processors execute which threads ° Decomposition • In BLAS 2 algorithm nearly each flop in inner loop can be done in parallel, so with n2 processors, need 3n parallel steps for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n) • This is too fine-grained, prefer calls to local matmuls instead 121 Assignment of parallel work in GE ° Think of assigning submatrices to threads, where each thread responsible for updating submatrix it owns • “owner computes” rule natural because of locality ° What should submatrices look like to achieve load balance? 122 Different Data Layouts for Parallel GE (on 4 procs) Bad load balance: P0 idle after first n/4 steps Can trade load balance and BLAS2/3 performance by choosing b, but factorization of block column is a bottleneck Load balanced, but can’t easily use BLAS2 or BLAS3 The winner! Complicated addressing 123 Blocked Partitioned Algorithms ° Orthogonal reduction to: • (upper) Hessenberg form • symmetric tridiagonal form ° LU Factorization ° Cholesky factorization ° Symmetric indefinite factorization • bidiagonal form ° Block QR iteration for nonsymmetric eigenvalue problems ° Matrix inversion ° QR, QL, RQ, LQ factorizations ° Form Q or QTC Memory Hierarchy and LAPACK ° ijk - implementations for _ = 1: n; for _ = 1: n; fo r _ = 1: n; a i , j a i , j bi ,k c k , j Effects order in which data referenced; some better at allowing data to keep in higher levels of memory hierarchy. en d end end ° Applies for matrix multiply, reductions to condensed form • May do slightly more • Up to 3 times faster flops Derivation of Blocked Algorithms Cholesky Factorization A = UTU A 11 T aj T A13 aj a jj j U T A13 11 T T j uj T A 33 U 13 0 u jj j 0 U 11 0 0 T U 33 0 uj u jj 0 U 13 T j U 33 Equating coefficient of the jth column, we obtain a j U 11 u j T a jj u j u j u jj T 2 Hence, if U11 has already been computed, we can T compute uj and U uujj from a the equations: 11 j j u jj a jj u j u j 2 T 126 LINPACK Implementation ° Here is the body of the LINPACK routine SPOFA which implements the method: DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF( JM1.LT.1 ) GO TO 20 DO 10 K = 1, JM1 T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 ) T = T / A( K, K ) A( K, J ) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A( J, J ) - S C ...EXIT IF( S.LE.0.0E0 ) GO TO 40 A( J, J ) = SQRT( S ) 30 CONTINUE 127 LAPACK Implementation DO 10 J = 1, N CALL STRSV( 'Upper', 'Transpose', 'Non-Unit’, J-1, A, LDA, A( 1, J ), 1 ) S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 ) IF( S.LE.ZERO ) GO TO 20 A( J, J ) = SQRT( S ) 10 CONTINUE ° This change by itself is sufficient to significantly improve the performance on a number of machines. ° From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7 GHz. ° However on peak is 1,700 Mflop/s. ° Suggest further work needed. 128 Derivation of Blocked Algorithms A11 T A12 T A13 A12 A 22 T A12 U 11T A13 T A12 U 12 T A33 U 13 0 T U 22 T U 23 0 U 11 0 0 T U 33 0 U 12 U 22 0 U 13 T U 23 U 33 Equating coefficient of second block of columns, we obta A12 U 11U 12 T A22 U 12U 12 U 22U 22 T T Hence, if U11 has already been computed, we can compute U12 as the solution of the following T U U A12 Level 3 BLAS routine equations by a call 11 12to the STRSM: T T U 22U 22 A22 U 12U 12 129 LAPACK Blocked Algorithms DO 10 J = 1, N, NB CALL STRSM( 'Left', 'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A, LDA, $ A( 1, J ), LDA ) CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE, A( 1, J ), LDA, ONE, $ A( J, J ), LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10 CONTINUE •On Pentium 4, L3 BLAS squeezes a lot more out of 1 proc Intel Pentium 4 1.7 GHz Rate of Execution N = 500 Linpack variant (L1B) 238 Mflop/s Level 2 BLAS Variant 312 Mflop/s Level 3 BLAS Variant 1262 Mflop/s 130 LAPACK Contents ° Combines algorithms from LINPACK and EISPACK into a single package. User interface similar to LINPACK. ° Built on the L 1, 2 and 3 BLAS, for high performance (manufacturers optimize BLAS) ° LAPACK does not provide routines for structured problems or general sparse matrices (i.e sparse storage formats such as compressed-row, -column, diagonal, skyline ...). LAPACK Ongoing Work ° Add functionality • updating/downdating, divide and conquer least squares,bidiagonal bisection, bidiagonal inverse iteration, band SVD, Jacobi methods, ... ° Move to new generation of high performance machines • IBM SPs, CRAY T3E, SGI Origin, clusters of workstations ° New challenges • New languages: FORTRAN 90, HP FORTRAN, ... • (CMMD, MPL, NX ...) - many flavors of message passing, need standard (PVM, MPI): BLACS ° Highly varying ratio C om putational speed C om m unication speed ° Many ways to layout data, ° Fastest parallel algorithm sometimes less stable numerically. History of Block Partitioned Algorithms ° Early algorithms involved use of small main memory using tapes as secondary storage. ° Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory. Blocked Partitioned Algorithms ° Orthogonal reduction to: • (upper) Hessenberg form • symmetric tridiagonal form ° LU Factorization ° Cholesky factorization ° Symmetric indefinite factorization • bidiagonal form ° Block QR iteration for nonsymmetric eigenvalue problems ° Matrix inversion ° QR, QL, RQ, LQ factorizations ° Form Q or QTC

Descargar
# CS267: Introduction - EECS User Home Pages