Parallel Programming Yang Xianchun Department of Computer Science and Technology Nanjing University Introduction Agenda – – – – – – – – About the Course Evolution of Computing Technology Grand Challenge Problem Examples Motivations for Parallelism Parallel Computation A Road Map of Topics Parallel Computing Terminology An Illustrative Example • Control-Parallel Approach • Data-Parallel Approach • Performance Analysis About the course • Description – An introduction course to parallel programming concepts and techniques – No prior parallel computation background is necessary • Prerequisites – Working knowledge of computer systems – Adequate programming skill in C or C++ • Textbook – Harry Jordan and Gita Alaghband, “Fundamentals of Parallel Processing,” Prentice Hall, 2003. About the course (cont.) • Topics – Introduction – Parallel architectures – Data parallel computing • Fortran 90 / 95 – Shared memory computing • Pthreads • OpenMP – Message passing programming • PVM • MPI – Applications • Sorting • Numerical algorithms • Graph algorithms – Interconnection network – Performance – Models About the course (cont.) • Organization – Lectures – Programming projects • using parallel programming tools PVM and MPI on Linux / Unix machines – Homeworks – Class Tests and Final Exam (Open book) Evolution of Computing Technology • Hardware – – – – Vacuum tubes, relay memory Discrete transistors, core memory Integrated circuits, pipelined CPU VLSI microprocessors, solid state memory • Languages and Software – – – – Machine / Assembly languages Algol / Fortran with compilers, batch processing OS C, multiprocessing, timesharing OS C++ / Java, parallelizing compilers, distributed OS Evolution of Computing Technology (cont.) • The Driving Force Behind the Technology Advances – The ever-increasing demands on computing power • Scientific computing (e.g. Large-scale simulations) • Commercial computing (e.g. Databases) • 3D graphics and realistic animation • Multimedia internet applications Grand Challenge Problem Examples • Simulations of the earth’s climate • Resolution: 10 kilometers • Period: 1 year • Ocean and biosphere models: simple – Total requirements: 1016 floating-point operations per second – With a supercomputer capable of 10 Giga FLOPS, it will take 10 days to execute • Real-time processing of 3D graphics • Number of data elements: 109 (1024 in each dimension) • Number of operations per element : 200 • Update rate: 30 times per second – Total requirements: 6.4 x 1012 operations per second – With processor capable of 10 Giga IOPS, we need 640 of them Motivations for Parallelism • Conventional computers and sequential • a single CPU • a single stream of instructions • executing one instruction at a time (not completely true) – Single-CPU processor has a performance limit • Moore’s Law can’t go on forever • How to increase computing power? • Better processor design – More transistors, larger caches, advanced architectures • Better system design – Faster / larger memory, faster buses, better OS • Scale up the computer (parallelism) – Replicate hardware at component or whole computer levels – Parallel processor’s power is virtually unlimited • • • • 10 processor @ 500 Mega FLOPS each = 5 Giga FLOPS 100 processor @ 500 Mega FLOPS each = 50 Giga FLOPS 1,000 processor @ 500 Mega FLOPS each = 500 Giga FLOPS ... Motivations for Parallelism (cont.) • Additional Motivations – Solving bigger problems – Lowering cost Parallel Computation • Parallel computation means – multiple CPUs – single or multiple streams of instructions – executing multiple instructions at a time • Typical process – Breaking a problem into pieces and arranging for all pieces to be solved simultaneously on a multi-CPU computer system • Requirements – Parallel algorithms • only parallelizable applications can benefit from parallel implementation – Parallel languages and/or constructs • expressing parallelism – Parallel architectures • provide hardware support A Road Map of Topics Parallel Architectures Machine Models • • • • Vector / SIMD / MIMD design issues PRAM LogP Programming Models Parallel Algorithms • • • • • • • data parallel shared memory message passing master-slave divide-conquer control vs. Data complexity analysis Programming Languages Applications • • • • • • Fortran 90 / 95 Pthreads, OpenMP PVM, MPI Scientific computations data-intensive problems performance measurement Parallel Computing Terminology (1) • Hardware – Multicomputers • tightly networked, multiple uniform computers – Multiprocessors • tightly networked, multiple uniform processors with additional memory units – Supercomputers • general purpose and high-performance, nowadays almost always parallel – Clusters • Loosely networked commodity computers Parallel Computing Terminology (2) • Programming – Pipelining • divide computation into stages (segments) • assign separate functional units to each stage – Data Parallelism • multiple (uniform) functional units • apply same operation simultaneously to different elements of data set – Control Parallelism • multiple (specialized) functional units • apply distinct operations to data elements concurrently Parallel Computing Terminology • Performance (3) – Throughput • number of results per unit time – Speedup Time needed for the most efficient sequential algorithm S= —————————————————————————— — Time needed on a pipelined / parallel machine – Scalability • An algorithm is scalable if the available parallelism increases at least linearly with problem size • An architecture is scalable if it gives same performance per processor, as the number of processors and the size of the problem are both increased • Data-parallel algorithms tend to be more scalable than control-parallel algorithms An Illustrative Example • Problem – Find all primes less than or equal to some positive integer n • Method (the sieve algorithm) – Write down all th integers from 1 to n – Cross out from the list all multiples of 2, 3, 5, 7, … up to sqrt (n) An Illustrative Example (cont.) • sequential Implementation • Boolean array representing the integers from 1 to n • Buffer for holding current prime • Index for loop iterating through the array An Illustrative Example (cont.) • Control-Parallel Approach • Different processors strike out multiples of different primes • The boolean array and the current prime is shared; each processor has its own private copy of loop index An Illustrative Example (cont.) • Control-Parallel Approach (cont.) – Potential Problem — Race Conditions • Race 1: More than one processor may sieve multiples of the same prime – a processor reads the current prime, p, and goes off to sieve multiples of p ; later it finds a new prime and updates the current prime buffer – before the current prime is updated, another processor comes and reads p and goes off to do the same thing • Race 2: A processor may sieve multiples of a composite number – processor A start marking multiples of 2 – before it can mark any cells, processor B finds an unmarked cell, 3, and starts marking multiples of 3 – then processor C comes in and finds the next unmarked cell, 4, and starts marking multiples of 4 • These two race conditions would not cause incorrect result, but they will cause inefficiency An Illustrative Example (cont.) • Data-Parallel Approach • Each processor responsible for a unique range of the integers, it does all the striking in that range • Processor 1 is responsible for broadcasting its findings to other processors • Potential Problem – If [n/p] < sqrt(n), more than one processor need to broadcast their findings An Illustrative Example (cont.) • Performance Analysis – Sequential Algorithm • • • • • Cost of sieving multiples of 2: [(n-3)/2] Cost of sieving multiples of 3: [(n-8)/3] Cost of sieving multiples of 5: [(n-24)/5] ... For n=1,000, T=1,411 – Control-Parallel Algorithm • For p=2, n=1,000, T=706 • For p=3, n=1,000, T=499 • For p=4, n=1,000, T=499 An Illustrative Example (cont.) • Performance Analysis (cont.) – Data-Parallel Algorithm • Cost of broadcasting: k(P-1)l • Cost of striking: ([(n/p)/2]+ [(n/p)/3]+ … + [(n/p)/pk]) • For p=2, n=1,000, T≈781 • For p=3, n=1,000, T≈ 471 • For p=4, n=1,000, T≈ 337

Descargar
# Parallel Computing - Nanjing University