```Parallel Processing
A Perspective
Hardware and Software
1
Introduction to PP
 From “Efficient Linked List Ranking
Algorithms and Parentheses Matching
as a New Strategy for Parallel
Algorithm Design”, R. Halverson
 Chapter 1 – Introduction to Parallel
Processing
2
Parallel Processing Research
 1980’s – Great Deal of Research &
Publications
 1990’s – Hardware not too successful
so the research area “dies” – Why???
 Early 2000’s – Begins Resurgence?
Why??? Will it continue to be
successful this time ???
 2010 – Multicore, Graphics processing
units
3
Goal of PP
Why bother with Parallel Processing?
Goal: Solve problems faster!
In reality, faster but efficient!
Work-Optimal: parallel algorithm runs
faster than sequential algorithm in
proportion to the number of processors
used.
 Sometimes work-optimal is not
possible




4
Performance Metrics




Running time
Speedup = T1 /Tp
Work = p * Tp
Work Optimal: Work = O(T1)
 Linear speedup
 Scalable: if algorithm performance
increases linearly with the number of
processors utilized
5
PP Issues
 Processors: number, connectivity,
communication
 Memory: shared vs. local
 Data structures
 Data Distribution
 Problem Solving Strategies
6
Parallel Problems
 One Approach: Try to develop a
parallel solution to a problem without
consideration of the hardware.
 Apply: Apply the solution to the
specific hardware and determine
the extra cost, if any
 If not acceptably efficient, try again!
7
Parallel Problems
 Another Approach: Armed with the
knowledge of strategies, data
structures, etc. that work well for a
particular hardware, develop a
solution with a specific hardware in
mind.
 Third Approach: Modify a solution for
one hardware configuration for
another
8
Real World Problems
 Inherently Parallel – nature or
structure of the problem lends itself to
parallelism
 Examples
 Mowing a lawn
 Cleaning a house
 Problems are easily divided into subproblems; very little overhead
9
Real World Problems
 Not Inherently Parallel – parallelism is
possible but more complex to define
 Examples




Balancing a checkbook
Giving a haircut
Wallpapering a room
Prefix sums of an array
10
Some Computer Problems
 Are these “inherently parallel” or not?
 Processing customers’ monthly bills
 Payroll checks
 Building student grade reports from class
 Searching for an item in a linked list
 A video game program
 Searching a state driver’s license database
 Is problem hw, sw, data? Assumptions?
11
General Observations
 What characteristics make a problem
inherently parallel?
 What characteristics make a problem
difficult to parallelize?
* Consider hw, sw, data structures.
12
Payroll Problem
 Consider 10 PUs (processing unit) with
each employee’s information stored in
a row of an array, A.
 Label P0, P1,…P9
 A[100] – 0 to 99
For i = 0 to 9
Pi process A[i*10] to A[((i+1)*10)-1]
13
Code for Payroll
For i = 0 to 9
Pi process A[i*10] to A[((i+1)*10)-1]
Each PU runs a process in parallel
For each Pi , i = 0 to 9 do //separate
process
For j = 0 to 9
Process A[i*10 + j]
14
Time Complexity of Payroll
Algorithm
Consider P processors
Consider N data items
Each PC has N/P data items
Assume data is accessible &
writeable to each PC
 Time for each P: O(N/P)
 Work = P * O(N/P) = O(N)




15
Payroll Questions??
 Now we have a solution, must be
applied to hardware. Which
hardware?
 Main question: Where is the array
and how is it accessed by each
processor?
 One shared memory or many local
memories?
 Where are the results placed?
16
 Generally, in parallel algorithms, I/O
is disregarded.
 Assumption: Data is stored in the
available memory.
 Assumption: Results are written back
to memory.
 Data input and output are generally
independent of the processing
algorithm.
17
Balancing a Checkbook
 Consider same hardware & data array
 Can still distribute and process in the
same manner as the payroll
 Each block computes deposits as addition &
checks a subtraction; totals the page (10
totals)
 BUT then must combine the 10 totals to
the final total
18
Complexity of Checkbook







Consider P processors
Consider N data items
Each PC has N/P data items
Assume data is accessible & writeable
Time for each section: O(N/P)
Combination of P subtotals
Time for combining: O(P) to O(log P)
 Depends on strategy used
 Total: O(N/P + P) to O(N/P + log P)
19
Performance
Complexity - Perfect Parallel Algorithm
 If the best sequential algorithm for a
problem is O(f(x)) then the parallel
algorithm would be O(f(x)/P)
 This happens if little or no overhead
Actual Run Time
 Typically, takes 4 processors to
achieve ½ the actual run time
20
Performance Measures
 Run Time: not a practical measurement
Assume T1 & Tp are run times using 1 & p
processors, respectively
 Speedup: S = T1/Tp
 Work: W = p * Tp (aka Cost)
 If W = O(T1) the it is Work (Cost)
Optimal & achieves Linear Speedup
21
Scalability
 An algorithm is said to be Scalable
if performance increases linearly
with the number of processors
 i.e. double the processors, cut time in
half
 Implication: Algorithm sustains
good performance over a wide
range of processors.
22
Scalability
 At what point does adding more processors
stop improving the run time?
 Does adding processors ever cause the
algorithm to take more time?
 What is the optimal number of processors?
 Consider W = p * Tp = O(T1)
 Solve for p
 Want W to be optimal for different numbers
of processors
23
Models of Computation
 Two major categories
 Shared memory
 e.g. PRAM
 Fixed connection
 e.g. Hypercube
 There are numerous versions of
each
 Not all are totally realizable in hw
24
Sidenote: Models
 Distributed Computing
 Use of 2 or more separate
computers used to solve a single
problem
 A version of a network
 Clusters
 Not really a topic for this course
25
Shared Memory Model
 PRAM – parallel random access
machine
 A category with 4 variants
 EREW-CREW-ERCW-CRCW
 All communication through a
shared global memory
 Each PC has a small local memory
26
Variants of PRAM
 EREW-CREW-ERCW-CRCW
 Concurrent read: 2 or more
processors may read the same (or
different) memory location
simultaneously
 Exclusive read: 2 or more processors
may access global memory location
only if each is accessing a unique
 Similarly defined for write
27
Shared Memory Model
M
P0
M
P1
M
P2
M
P3
Shared Global Memory
28
Shared Memory
 What are some implications of
the variants in memory access of
the PRAM model?
 What is the strongest model?
29
Fixed Connection Models
 Each PU contains a Local Memory
 Distributed memory
 No shared memory
 PUs are connected through some type
of Interconnection Network
 Interconnection network defines the model
 Communication is via Message Passing
 Can be synchronous or asynchronous
30
Interconnection Networks
 Bus Network (Linear)
 Ring
 Mesh
 Torus
 Hypercube
31
Hypercube Model
 Distributed memory, message passing,
fixed connection, parallel computer
 N = 2r number of nodes
 E = r 2r-1 number of edges
 Nodes are numbered 0 – N in binary
such that any 2 nodes differing in one
bit are connected by an edge
 Dimension is r
32
Hypercube Examples
N = 2, 4
10
0
11
1
00
01
N=2
Dimension = 1
N=4
Dimension = 2
33
Hypercube Example
N=8
111
110
N=8
Dimension = 3
010
011
100
000
101
001
34
Hypercube Considerations
 Message Passing
 Communication
 Possible Delays
 Each PC has same work load
 Data Distribution
35
Consider Checkbook Problem
 How about distribution of data?
 Often initial distribution is
disregarded
 What about the combination of the
subtotals?
 Reduction is by dimension
 O(log P) = r
36
Design Strategies
used to aid in the development of
the solution to a problem
37
Extended from Sequential Use
 Extended from sequential use
 Divide-and-Conquer
Always used in parallel algorithms
Divide data vs. Divide code
 Branch-and-Bound
 Dynamic Programming
38
Developed for Parallel Use








Deterministic coin tossing
Symmetry breaking
Tree contraction
Euler Tours
All Nearest Smaller Values (ANSV)
Parentheses Matching
39
Divide-and-Conquer
 Most basic parallel strategy
 Used in virtually every parallel
algorithm
 Problem is divided into several subproblems that can be solved
independently; results of sub-problems
are combined into the final solution
 Example: Checkbook Problem
40
Dynamic Programming
 Divide & Conquer technique used when
sub-problems are not independent;
share common sub-problems
 Sub-problem solutions are stored in
table for use by other processes
 Often used for optimization problems
 Minimum or Maximum
 Fibonacci Numbers
41
Branch-and-Bound
technique
 Uses a bounding function that
allows some branches of the tree to
be pruned (i.e. eliminated)
 Example: Game programming
42
Symmetry Breaking
 Strategy that breaks a linked
structure (e.g. linked list) into disjoint
pieces for processing
 Deterministic Coin Tossing
 Using a binary representation of index,
processing
 Often used in Linked List Ranking
Algorithms
43
 Applying 2 or more algorithms to a
single problem,
 Change from one to another based on
the ratio of the problem size to the
number of processors – Threshold
 This “fine tuning” sometimes allows
for better performance
44
Tree Contraction
(aka Contraction)
 Nodes of a tree are removed;
information removed is combined with
remaining nodes
 Multiple processors are assigned to
independent nodes
 Tree is reduced to a single node which
contains the solution
 E.G. Arithmetic Expression computation;
Addition of a series of numbers
45
Euler Tour
 Create duplicate nodes in a tree or
graph with edge in opposite direction
to create a circuit
 Allows tree or graph to be processed
46
 Halverson’s area of dissertation
research
 ProblemDefinition:
 Given a linked list of elements, number
the elements in order (or in reverse
order)
 For list of length 20+
47
Applied to a wide range of problems





Euler Tours
Tree Searches
List packing
Connectivity
Tree Traversals
 Spanning Trees &
Forests
 Connected
Components
 Graph
Decomposition
48
All Nearest Smaller Values
 Given a sequence of values, for each
value x, which predecessor elements
are smaller than x
 Successfully applied to




Depth first search of interval graph
Parentheses matching
Line Packing
Triangulating a monotone polynomial
49
Parentheses Matching
 In a properly formed string of
parentheses, find the index of each
parentheses mate
 Applied to solve




Heights of all nodes in a tree
Extreme values in a tree
Lowest common ancestor
Balancing binary trees
50
Parentheses Matching
1
2 3
4
5
6 7 8 9
10 11 12
( ( ) ( ( ( ) ) )
(
12 3 2
11 10
9
8
7 6
5
4
)
)
1
51
Parallel Algorithm Design
 Identify problems and/or classes
of problems for which a particular
strategy will work
 Apply to the appropriate hardware
 Most of the paradigms have been
optimized for a variety of parallel
architectures
52
 Not a paradigm, but an operation used
in many parallel algorithms
 Provide one or more items of data to all
the processors (individual memories)
 Let P be the number of processors.
For most models, broadcast operation is
O(log P) time complexity
53
 Shared Memory (EREW)
 P0 writes for P1; P0 & P1 write for P2 &
P3; P0 – P3 write for P4 – P7
 Then each PU has a copy to be read in one
time unit
 Hypercube
 P0 sends to P1; P0 & P1 send to P2 & P3,
etc.
 Both are O(log P)
54
Remainder of this Course
 Principles & Techniques of parallel
processing & parallel algorithm
development on a variety of models
 Application using CUDA
 nVidia Graphics card processors
 Overview of new models & languages
 Student presentations
55
```