Plan of lectures 1. Development of data parallel programming. Historical view of languages up to HPF. 2. Issues in translation of HPF. Simple examples. Distributed array descriptors. 3. Communication in data parallel languages. Communication patterns. Runtime libraries. 4. An “HPspmd” programming model. Motivations. Introduction to HPJava. 5. Java for HP computing. Java Grande. mpiJava. Lecture materials http://www.npac.syr.edu/projects/pcrc/HPJava/beijing.html Development of Data-Parallel Programming Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 [email protected] Goals of this lecture Review the historical development of data-parallel programming, up to and including emergence of High Performance Fortran. Illustrate evolution of ideas by describing some major practical programming languages for data parallelism. Contents of Lecture Overview. Early data-parallel languages. ILLIAC IV CFD and DAP FORTRAN. Later SIMD languages. Need for massive parallelism: SIMD, MIMD, SPMD. Fortran 90 Connection Machine Fortran, related languages. High Performance Fortran. Multi-processing and distributed data Processor arrangements and templates, alignment Programming examples. Moore’s law and massive parallelism Microprocessors double in speed every eighteen months. To outpace growth in power of sequential computers, parallelism better offer orders of magnitude better performance. Moderate parallelism cannot give this speedup—massive parallelism or bust! Niches for massive parallelism today Dedicated clusters of commodity processors. Government labs, internet servers, Pixar Renderfarm, … Harvesting cycles of processors distributed over the internet. Two principal forms of massive parallelism: task farming, and data parallelism. Data parallelism Large data structures, typically arrays, split across computational nodes. Each node computes local patch of arrays. Typically need some intermediate results from other nodes, but communication should not dominate. Logical model of fine-grained parallelism—at level of array elements, rather than tasks. Programming languages for massive parallelism Task farming adequately expressed in conventional languages task = function call libraries handle communication + load-balancing Data parallelism can be expressed in conventional languages + MPI (say), but historically higher level languages have been quite successful Potentially avoid the problems of general concurrent programming (non-determinism, synchronization, deadlock, …) Why special languages for data-parallelism? Originally devised for Single Instruction, Multiple Data (SIMD) computers. Specialized architectures demanded special language features. New abstraction: distributed array. Central control avoided problems of concurrent programming—concentrate on expressing parallel algorithms. Advent of MIMD computers. By 90s, SIMD lost ground—general purpose microprocessors too cheap, powerful. Multiple Instruction, Multiple Data (MIMD) computers could be built from commodity components. Still want to avoid complexity of general concurrent programming. Single Program, Multiple Data (SPMD)—now a programming model rather than computer architecture. Similarities to SIMD programming suggest similar language ideas should apply. The road to HPF High Performance Fortran took ideas from SIMD languages—embedded them in a standardized language for programming MIMD computers in SPMD style. Next section of this lecture will review historically important SIMD languages— learn where the ideas came from. ILLIAC IV ILLIAC IV development started late 60s. Fully operational 1975. Illinois then NASA Ames. SIMD computer for array processing. Control Unit + 64 Processing elements. 2K words memory per PE. CU can access all memory. PEs can access local memory and communicate with neighbours. CU reads program text—broadcasts instructions to PEs. Architecture of the ILLIAC IV ILLIAC IV CFD Software problematic for ILLIAC. CFD language developed at Computational Fluid Dynamics Branch of Ames. “Fortran-like” rather than a true FORTRAN dialect. Deliberately no attempt to hide hardware peculiarities—on contrary, tried to give programmer direct access to all features of the hardware. Data types in CFD Basic types included CU REAL, CU INTEGER, PE REAL, PE INTEGER. Ordinary arrays can be in PE or CU memory: CU REAL A, B(100) PE INTEGER I PE REAL D(25), E(1000) Vector-aligned arrays in PE memory: PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8) Vector aligned arrays in CFD Early language instantiation of Distributed Array concept. Only first dimension can be distributed. Extent of distributed dimension exactly 64. J(1) stored on first PE, J(2) on second PE, etc. X(1,1),…,X(1,4) on first PE, X(2,1),…, X(2,4) on second PE, etc. Parallel computation in CFD Parallel computation only in vector assignments. Subscript is *. Communication between PEs by adding shift to *: DIFP(*) = P(* + 1) – P(* - 1) Conditional vector assignment: IF(A(*) .LT. 0) A(*) = -A(*) PEs can locally access different locations: DIAG(*) = RHO(*, X(*)) Vector subscript must be in sequential dimension. Other ILLIAC languages Glypnir Algol-like rather than Fortran-like. swords (“super-words”) rather than explicit parallel arrays. IVTRAN Higher-level Fortran dialect. Scheme for mapping general arrays to PEs. Parallel loop similar to FORALL. Compiler never fully debugged. ICL DAP FORTRAN ICL DAP (Distributed Array Processor)—early commercial parallel computer. Available early 80s. Trend to many, simple PEs: 64 × 64 grid of 4096 bit-serial processors. FP performance much lower than ILLIAC, but good performance in other domains. Module of ICL 2900 mainframe computer— shared memory and other facilities with host. Programmed in DAP FORTRAN. Constrained arrays in DAP FORTRAN First one or two extents could be omitted: DIMENSION A(), BB(,) INTEGER II(,) REAL AA(,3), BBB(,,5) A is a vector. BB and II are matrices. AA an array of vectors. BBB an array of matrices. Extents of constrained dimensions 64 again (vectors of 4096 also possible). Array assignments in DAP Fortran Leave subscripts empty for array expression. + or – subscript used for nearest neighbour access: U(,) = 0.25 * (U(,-) + U(,+) + U(-,) + U(+,)) Masked assignment by using logical matrix as subscript: LOGICAL L(,) ... L(,) = BB(,) .LT. 0 B(L) = -B(,) Standardized array processing: Fortran 90 By the mid 80s SIMD and vector computers looked well-established. The Fortran 8X committee saw a need to make array assignments standard language features. Fortran 90 (as it eventually became) added many new features to support modern software engineering. It also added new “array syntax”, including array assignments, array sections, transformational intrinsics. Fortran 90 array assignments Rank-2 arrays, shape (5, 10): REAL A(5, 10), B(5, 10), C(5, 10) Array expressions involve conforming arrays: A + B, B * C, A + (B * C) Arrays conform if they have same shape. B * C, for example, stands for B(1,1) * C(1,1) . . . · · B(5,1) * C(5,1) . . . B(1,10) * C(1,10) · B(5,10) * C(5,10) Scalars in array expressions Scalars conform with any array: REAL D Expression C + D stands for: C(1,1) + D . . . · · C(5,1) + D . . . C(1,10) + D · C(5,10) + D Array sections Array subscripts can be triplets: X(1:10:3) stands for the subarray: (X(1), X(4), X(7), X(10)) Special cases of triplets: X(1:4), stride defaults to 1: (X(1), X(2), X(3), X(4)) X(:) selects whole of array dimension, as originally declared. Most useful for multi-dimensional arrays—here X(:) just equivalent to X. Array assignments Simple assignment: U(2:N-1) = 0.5 * (U(1:N-2) + U(3:N)) Right-hand-side expression must conform with left-hand-side variable. Conditional assignment: WHERE (A .LT. 0.0) A = -A The condition must be a logical array that conforms with the left-hand-side of the assignment. Array intrinsics Fortran 90 added many transformational intrinsics—operations on whole arrays. CSHIFT, EOSHIFT shift arrays in a dimension. TRANSPOSE transposes array. RESHAPE: general reshaping of array. SUM, PRODUCT, MAXVAL, MINVAL, . . . : reduction operations that take an array and return scalar or reduced-rank array. The Connection Machine Thinking Machines Corp founded 1983 to supply connectionist computers for AI. In practise, their Connection Machine series was used largely for scientific computing. CM-2 launched 1986: SIMD architecture, bit-serial PE like DAP. Up to 65536 PEs connected on a hypercube network. Floating point coprocessors could give peak performance 28 GFLOPS. Programmed initially in *LISP, then C*. By 1990 there was a CM Fortran compiler. CM Fortran Extended from FORTRAN 77. Included all the new array syntax of Fortran 90. FORALL statement. Array distribution directives. Mapping arrays in CM Fortran Arrays can be allocated on front-end (VAX, Sun,etc) or on the CM itself. By default an array is a CM array if it ever used in a F90 array assignment in the current procedure; it is a front-end array if it is only ever used in F77 style. Distributed arrays (“CM arrays”) can have any shape. Layout of CM arrays Arrays mapped to Virtual Processor sets (VP sets), one array element per VP. By default VP set has minimal shape s.t.: VP grid is large enough to hold the array, Extents of VP grid are powers of 2, Total number VPs is a multiple of number PEs. Map first array element to first VP. Disable or ignore VPs unneeded to hold elements. LAYOUTdirectives in CM Fortran Layout directive can override defaults. Serial dimensions can be elected: REAL A(10, 64, 100) CMF$ LAYOUT A(:SERIAL, :NEWS, :NEWS) A aligned with B: REAL B(64, 100) CMF$ LAYOUT B(:NEWS, :NEWS) All dimensions serial implies front-end array. LAYOUT directives can appear in procedure interface blocks. ALIGN directives in CM Fortran Can explicitly specify alignment relations: REAL V(100), B(64, 100) CMF$ ALIGN V(I) WITH B(1, I) Offset alignments also allowed: REAL C(32, 50) CMF$ ALIGN C(I, J) WITH B(I+5, J+2) More general alignments, eg transposed, not allowed. Layouts and alignments in CM Fortran FORALL Parallelism must be explicit. Eg in F90 style: U(2:N-1, 2:N-1) = & 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N) & + U(1:N-2, 2:N-1) + U(3:N, 2:N-1)) When array syntax unwieldy, can use FORALL statement: FORALL (I = 2:N-1, J = 2:N-1) & U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) + & U(I-1, J) + U(I+1, J)) Languages related to CM Fortran MasPar Fortran. Contemporary SIMD computer corporation. Language very similar to CM Fortran. Slightly different syntax for directives. *LISP. Original language of CM. Data parallel dialect of Common LISP. pvars rather than distributed arrays. C*. CM version of C, extended with syntax for distributed arrays. Introduced shape concept, similar to templates in later HPF. The High Performance Fortran Forum By early 90s, value of portable, standardized languages universally acknowledged. Goal of HPF Forum—a single language for High Performance programming. Effective across architectures—vector, SIMD, MIMD, though SPMD a focus. Supported by most major vendors then: Cray, DEC, Fujitsu, HP, IBM, Intel, MasPar, Meiko, nCube, Sun, Thinking Machines HPF 1.0 standard published 1993. Multiprocessing and Distributed Data Contemporary parallel computers are built from autonomous processors with some local memory. Processors access their local memory rapidly, or other processors memory much more slowly. Programs should be arranged to minimize non-local accesses. Processors Memory areas Excessive communication Computation divided into parallel subcomputations, each computation involving operands from multiple processors. Bad load balancing All parallel subcomputations access operands on the same processor. Ideal decomposition All operands of individual subcomputation on single processor; operands for different tasks on different processors. Placement of data and computation HPF allows intricate control over placement of array elements, through mapping directives. HPF 1.0 allows no direct control over placement of computation. Compiler should infer good place to compute, according to location of operands (eg, “owner computes” heuristic). Stages of data mapping in HPF Processor arrangements Abstract, program-level representation of processor set over which data is distributed. Directive syntax similar to Fortran array declaration: !HPF$ PROCESSORS P(10) Set P contains 10 processors. Processor arrangements can be multidimensional: !HPF$ PROCESSORS Q(4, 4) Templates Abstract array shape, to which actual data can be aligned. (Usually much finer grain than processor arrangements.) Declaration syntax: !HPF$ TEMPLATE T(50, 50, 50) Templates distributed over processor arrangements using DISTRIBUTE directive. Following examples assume: !HPF$ PROCESSORS P1(4) !HPF$ TEMPLATE T1(17) Simple block distribution !HPF$ DISTRIBUTE T1(BLOCK) ONTO P1 Simple cyclic distribution !HPF$ DISTRIBUTE T1(CYCLIC) ONTO P1 Block distribution with specified block-size !HPF$ DISTRIBUTE T1(BLOCK(6)) ONTO P1 Block-cyclic distribution !HPF$ DISTRIBUTE T1(CYCLIC(3)) ONTO P1 Distributing a multidimensional template Distribution formats can be mixed: !HPF$ PROCESSORS P2(4, 3) !HPF$ TEMPLATE T2(17, 20) !HPF$ DISTRIBUTE T2(CYCLIC, BLOCK) ONTO P2 Some dimensions may be serial, or collapsed: !HPF$ DISTRIBUTE T2(BLOCK, *) ONTO P1 LU Decomposition algorithm REAL A(N, N) DO K = 1, N – 1 DO J = K + 1, N A(K, J) = A(K, J) / A(K, K) DO I = K + 1, N A(I, J) = A(I, J) – A(I, K) * A(K, J) ENDDO ENDDO ENDDO Parallel LU Decomposition REAL A(N, N) REAL COL(N), ROW(N) DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO Simple align directive Create template matching principal array (A), then ALIGN the array to the template: !HPF$ TEMPLATE T(N, N) !HPF$ ALIGN A(I, J) WITH T(I, J) Aligning auxiliary arrays Want to minimize communication in largest parallel loop: FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) Suggests replicated alignments for COL and ROW: !HPF$ ALIGN COL(I) WITH T(I, *) !HPF$ ALIGN COL(J) WITH T(*, J) Alignment of arrays in LU example Communication in LU example The statement COL(K:N) = A(K:N, K) broadcasts Kth column of A. The statement A(K, K+1:N) = A(K, K+1:N) / COL(K) needs no communication (reason for using COL(K) rather than A(K, K)). The statement ROW(K+1:N) = A(K, K+1:N) broadcasts Kth row of A. Distribution in LU example Block-wise distribution unattractive due to poor load balancing. Cyclic distribution preferable: !HPF$ PROCESSORS P(NP, NP) !HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P LU example with directives !HPF$ PROCESSORS P(NP, NP) !HPF$ TEMPLATE T(N, N) !HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P REAL A(N, N) !HPF$ ALIGN A(I, J) WITH T(I, J) REAL COL(N), ROW(N) !HPF$ ALIGN COL(I) WITH T(I, *) !HPF$ ALIGN COL(J) WITH T(*, J) DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO Other alignment options Permuted dimensions: DIMENSION B(N, N) !HPF$ ALIGN B(I, J) WITH T(J, I) Affine intra-dimensional alignment: DIMENSION C(N/2, N/2) !HPF$ ALIGN C(I, J) WITH T(N/2 + I, 2*J) Intra-dimensional alignment example Next Lecture: Issues in translation of High Performance Fortran Translation of simple HPF programs to SPMD (MPI) programs. Design of a runtime Distributed Array Descriptor.