Chapter 2
Parallel Architectures
Outline
• Some chapter references
• Brief review of complexity
– Terminology for comparisons
•
•
•
•
•
Interconnection networks
Processor arrays
Multiprocessors
Multicomputers
Flynn’s Taxonomy – moved to Chpt 1
2
Some Chapter References
• Selim Akl, The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989 (earlier textbook).
• G. C. Fox, What Have We Learnt from Using Real
Parallel Machines to Solve Real Problems? Technical
Report C3P-522, Cal Tech, December 1989. (Included in
part in more recent books co-authored by Fox.)
• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction
to Parallel Computing, Second Edition, 2003 (first edition
1994), Addison Wesley.
• Harry Jordan, Gita Alaghband, Fundamentals of Parallel
Processing: Algorithms, Architectures, Languages,
Prentice Hall, 2003, Ch 1, 3-5.
• F. Thomson Leighton; Introduction to Parallel Algorithms
and Architectures: Arrays, Trees, Hypercubes; 1992;
Morgan Kaufmann Publishers.
3
References - continued
• Gregory Pfsiter, In Search of Clusters: The ongoing
Battle in Lowly Parallelism, 2nd Edition, Ch 2. (Discusses
details of some serious problems that MIMDs incur).
• Michael Quinn, Parallel Programming in C with MPI and
OpenMP, McGraw Hill,2004 (Current Textbook), Chapter
2.
• Michael Quinn, Parallel Computing: Theory and Practice,
McGraw Hill, 1994, Ch. 1,2
• Sayed H. Roosta, “Parallel Processing & Parallel
Algorithms: Theory and Computation”, Springer Verlag,
2000, Chpt 1.
• Wilkinson & Allen, Parallel Programming: Techniques
and Applications, Prentice Hall, 2nd Edition, 2005, Ch 12.
4
Brief Review
Complexity Concepts Needed for Comparisons
• Whenever we define a counting function, we usually
characterize the growth rate of that function in terms of
complexity classes.
• Technical Definition: We say a function f(n) is in O(g(n)),
if (and only if) there are positive constants c and n0 such
that
0 ≤ f(n)  cg(n) for n  n0
• O(n) is read as big-oh of n.
• This notation can be used to separate counting functions
into complexity classes that characterize the size of the
count.
• We can use it for any kind of counting functions such as
timings, bisection widths, etc.
5
Big-Oh and Asymptotic Growth Rate
• The big-Oh notation gives an upper bound on the
(asymptotic) growth rate of a function
• The statement “f(n) is O(g(n))” means that the growth
rate of f(n) is not greater than the growth rate of g(n)
• We can use the big-Oh notation to rank functions
according to their growth rate
Assume:
g(n) grows faster
f(n) is O(g(n))
g(n) is O(f(n))
Yes
No
f(n) grows faster
No
Yes
Same growth
Yes
Yes
6
Relatives of Big-Oh
• big-Omega
– f(n) is (g(n)) if there is a constant c > 0
and an integer constant n0  1 such that
f(n)  cg(n) ≥ for n  n0
Intuitively, this says up to a constant factor, f(n)
asymptotically is greater than or equal to g(n)
• big-Theta
– f(n) is (g(n)) if there are constants c’ > 0 and c’’ > 0
and an integer constant n0  1 such that 0 ≤ c’g(n)  f(n)
 c’’•g(n) for n  n0
Intuitively, this says up to a constant factor, f(n) and g(n)
are asymptotically the same.
Note: These concepts are covered in algorithm courses7
Relatives of Big-Oh
• little-oh
– f(n) is o(g(n)) if, for any constant c > 0, there is an
integer constant n0  0 such that 0  f(n) < cg(n) for n
 n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly less than g(n), so f(n) ≠ (g(n)).
• little-omega
– f(n) is (g(n)) if, for any constant c > 0, there is an
integer constant n0  0 such that f(n) > cg(n) ≥ 0 for n
 n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly greater than g(n), so f(n) ≠
(g(n)).
These are not used as much as the earlier definitions,
but they round out the picture.
8
Summary for Intuition for
Asymptotic Notation
big-Oh
– f(n) is O(g(n)) if f(n) is asymptotically less than or equal to
g(n)
big-Omega
– f(n) is (g(n)) if f(n) is asymptotically greater than or
equal to g(n)
big-Theta
– f(n) is (g(n)) if f(n) is asymptotically equal to g(n)
little-oh
– f(n) is o(g(n)) if f(n) is asymptotically strictly less than g(n)
little-omega
– f(n) is (g(n)) if is asymptotically strictly greater than g(n)
9
A CALCULUS DEFINITION OF O, 
(often easier to use)
Definition: Let f and g be functions defined on
the positive integers with nonnegative values.
We say g is in O(f) if and only if
lim
g(n)/f(n) = c
n -> 
for some nonnegative real number c--- i.e. the
limit exists and is not infinite.
Definition: We say f is in (g) if and only if
f is in O(g) and g is in O(f)
Note: Often use L'Hopital's Rule to calculate the
limits you need.
10
Why Asymptotic Behavior is Important
• 1) Allows us to compare counts on large sets.
• 2) Helps us understand the maximum size of
input that can be handled in a given time,
provided we know the environment in which we
are running.
• 3) Stresses the fact that even dramatic
speedups in hardware can not overcome the
handicap of an asymptotically slow algorithm.
11
Recall: ORDER WINS OUT
(Example from Baase’s Algorithms Text)
The TRS-80
Main language support: BASIC - typically a slow running
interpreted language
For more details on TRS-80 see:
http://mate.kjsl.com/trs80/
The CRAY-YMP
Language used in example: FORTRAN- a fast running language
For more details on CRAY-YMP see:
http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaq
12
p1.html#TOC3
CRAY YMP
with FORTRAN
complexity is 3n3
n is:
TRS-80
with BASIC
complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a second.
millisecond (abbr msec) One-thousandth of a second.
10
3 microsec
100
3 millisec
200 millisec
2 sec
1000
3 sec
20 sec
2500
50 sec
50 sec
10000
49 min
3.2 min
1000000
95 years
5.4 hours
13
Interconnection Networks
• Uses of interconnection networks
– Connect processors to shared memory
– Connect processors to each other
• Interconnection media types
– Shared medium
– Switched medium
• Different interconnection networks define
different parallel machines.
• The interconnection network’s properties
influence the type of algorithm used for various
machines as it effects how data is routed.
14
Shared versus Switched Media
a. With shared medium, one message is sent & all processors listen
b. With switched medium, multiple messages are possible.
15
Shared Medium
•
•
•
•
Allows only message at a time
Messages are broadcast
Each processor “listens” to every message
Before sending a message, a processor
“listen” until medium is unused
• Collisions require resending of messages
• Ethernet is an example
16
Switched Medium
• Supports point-to-point messages
between pairs of processors
• Each processor is connected to one switch
• Advantages over shared media
– Allows multiple messages to be sent
simultaneously
– Allows scaling of the network to
accommodate the increase in processors
17
Switch Network Topologies
• View switched network as a graph
– Vertices = processors or switches
– Edges = communication paths
• Two kinds of topologies
– Direct
– Indirect
18
Direct Topology
• Ratio of switch nodes to processor nodes
is 1:1
• Every switch node is connected to
– 1 processor node
– At least 1 other switch node
Indirect Topology
• Ratio of switch nodes to processor nodes is
greater than 1:1
• Some switches simply connect to other
19
switches
Terminology for Evaluating
Switch Topologies
• We need to evaluate 4 characteristics of a
network in order to help us understand their
effectiveness in implementing efficient parallel
algorithms on a machine with a given network.
• These are
–
–
–
–
The diameter
The bisection width
The edges per node
The constant edge length
• We’ll define these and see how they affect
algorithm choice.
• Then we will investigate several different
topologies and see how these characteristics
are evaluated.
20
Terminology for Evaluating
Switch Topologies
• Diameter – Largest distance between two
switch nodes.
– A low diameter is desirable
– It puts a lower bound on the complexity of
parallel algorithms which requires
communication between arbitrary pairs of
nodes.
21
Terminology for Evaluating
Switch Topologies
• Bisection width – The minimum number of edges
between switch nodes that must be removed in
order to divide the network into two halves
(within 1 node, if the number of processors is
odd.)
• High bisection width is desirable.
• In algorithms requiring large amounts of data
movement, the size of the data set divided by
the bisection width puts a lower bound on the
complexity of an algorithm,
• Actually proving what the bisection width of a
network is can be quite difficult.
22
Terminology for Evaluating
Switch Topologies
• Number of edges per node
– It is best if the maximum number of edges/node is a
constant independent of network size, as this allows
the processor organization to scale more easily to a
larger number of nodes.
– Degree is the maximum number of edges per node.
• Constant edge length? (yes/no)
– Again, for scalability, it is best if the nodes and edges
can be laid out in 3D space so that the maximum
edge length is a constant independent of network
size.
23
Evaluating Switch Topologies
• Many have been proposed and analyzed. We
will consider several well known ones:
–
–
–
–
–
–
–
2-D mesh
linear network
binary tree
hypertree
butterfly
hypercube
shuffle-exchange
• Those in yellow have been used in commercial
parallel computers.
24
2-D Meshes
Note: Circles represent switches and squares
represent processors in all these slides.
25
2-D Mesh Network
• Direct topology
• Switches arranged into a 2-D lattice or grid
• Communication allowed only between
neighboring switches
• Torus: Variant that includes wraparound
connections between switches on edge of
mesh
26
Evaluating 2-D Meshes
(Assumes mesh is a square)
n = number of processors
• Diameter:
– (n1/2)
– Places a lower bound on algorithms that require
processing with arbitrary nodes sharing data.
• Bisection width:
– (n1/2)
– Places a lower bound on algorithms that require
distribution of data to all nodes.
• Max number of edges per switch:
– 4 is the degree
• Constant edge length?
– Yes
• Does this scale well?
– Yes
27
Linear Network
•
•
•
•
Switches arranged into a 1-D mesh
Direct topology
Corresponds to a row or column of a 2-D mesh
Ring : A variant that allows a wraparound
connection between switches on the end.
• The linear and ring networks have many
applications
• Essentially supports a pipeline in both directions
• Although these networks are very simple, they
support many optimal algorithms.
28
Evaluating Linear and Ring Networks
• Diameter
•
•
•
•
– Linear : n-1 or Θ(n)
– Ring: n/2 or Θ(n)
Bisection width:
– Linear: 1 or Θ(1)
– Ring: 2 or Θ(1)
Degree for switches:
– 2
Constant edge length?
– Yes
Does this scale well?
– Yes
29
Binary Tree Network
• Indirect topology
• n = 2d processor nodes, 2n-1 switches,
where d= 0,1,... is the number of levels
i.e. 23 = 8 processors
on bottom and
2(n) – 1 = 2(8) – 1 =
15 switches
30
Evaluating Binary Tree Network
• Diameter:
– 2 log n or O(log n).
– Note- this is small
• Bisection width:
– 1, the lowest possible number
• Degree:
–3
• Constant edge length?
– No
• Does this scale well?
– No
31
Hypertree Network (of degree 4 and
depth 2)
(a) Front view: 4-ary tree of height 2
(b) Side view: upside down binary tree of
height d
(c) Complete network
32
Hypertree Network
• Indirect topology
• Note- the degree k and the depth d must be
specified.
• This gives from the front a k-ary tree of height d.
• From the side, the same network looks like an
upside down binary tree of height d.
• Joining the front and side views yields the
complete network.
33
Evaluating 4-ary Hypertree with
Depth d
• A 4-ary hypertree has n = 4d processors
– General formula for k-ary hypertree is n = kd
• Diameter is 2d = 2 log n
– shares the low diameter of binary tree
• Bisection width = 2d+1
– Note here, 2d+1 = 23 = 8
– Large value - much better than binary tree
• Constant edge length?
– No
• Degree = 6
34
Butterfly Network
• Indirect topology
• n = 2d processor
nodes connected
by n(log n + 1)
switching nodes
As complicated as this
switching network appears
to be, it is really quite
simple as it admits a very
nice routing algorithm!
Wrapped Butterfly: When
top and bottom ranks are
merged into single rank.
A 23 = 8 processor
butterfly network with
8*4=32 switching nodes
0
1
2
3
4
5
6
7
R ank 0
0 ,0
0 ,1
0 ,2
0 ,3
0 ,4
0 ,5
0 ,6
0 ,7
R ank 1
1 ,0
1 ,1
1 ,2
1 ,3
1 ,4
1 ,5
1 ,6
1 ,7
R ank 2
2 ,0
2 ,1
2 ,2
2 ,3
2 ,4
2 ,5
2 ,6
2 ,7
R ank 3
3 ,0
3 ,1
3 ,2
3 ,3
3 ,4
3 ,5
3 ,6
3 ,7
35
The rows are called ranks.
Building the 23 Butterfly Network
• There are 8 processors.
• Have 4 ranks (i.e. rows) with 8 switches per rank.
• Connections:
– Node(i,j), for i > 0, is connected to two nodes on rank
i-1, namely node(i-1,j) and node(i-1,m), where m is
the integer found by flipping the ith most significant bit
in the binary d-bit representation of j.
– For example, suppose i = 2 and j = 3. Then node (2,3)
is connected to node (1,3).
– To get the other connection, 3 = 0112. So, flip 2nd
significant bit – i.e. 0012 and connect node(2,3) to
node(1,1) --- NOTE: There is an error on pg 32 on this
example.
• Nodes connected by a cross edge from rank i to rank i+1
have node numbers that differ only in their (i+1) bit.
36
Why It Is Called a Butterfly Network
• Walk cycles such as node(i,j), node(i-1,j),
node(i,m), node(i-1,m), node(i,j) where m
is determined by the bit flipping as shown
and you “see” a butterfly:
37
Butterfly Network Routing
Send message from
processor 2 to processor 5.
Algorithm:
0 means ship left;
1 means ship right.
1) 5 = 101. Pluck off leftmost
bit 1 and send “01msg” to
right.
2) Pluck off leftmost bit 0
and send “1msg” to left.
3) Pluck off leftmost bit 1
and send “msg” to right.
Each cross edge followed
changes address by 1 bit.
38
Evaluating the Butterfly Network
with n Processors
• Diameter:
0
1
2
3
4
5
6
7
R ank 0
0 ,0
0 ,1
0 ,2
0 ,3
0 ,4
0 ,5
0 ,6
0 ,7
R ank 1
1 ,0
1 ,1
1 ,2
1 ,3
1 ,4
1 ,5
1 ,6
1 ,7
R ank 2
2 ,0
2 ,1
2 ,2
2 ,3
2 ,4
2 ,5
2 ,6
2 ,7
R ank 3
3 ,0
3 ,1
3 ,2
3 ,3
3 ,4
3 ,5
3 ,6
3 ,7
– log n
• Bisection width:
– n / 2 *(Likely error 32/2=16)
• Degree:
– 4 (even for d > 3)
• Constant edge length?
– No, grows exponentially
as rank size increase
* On pg 442, Leighton gives “(n / log(n))” as the bisection width. Simply remove
cross edges between two successive levels to create bisection cut.
39
Hypercube
(also called binary n-cube)
1110
0110
0111
1010
0010
1111
1011
0011
1100
0100
0101
1000
0000
1101
1001
0001
A hypercube with n = 2d processors & switches for d=4
40
Hypercube (or Binary n-cube)
n = 2d Processors
• Direct topology
• 2 x 2 x … x 2 mesh
• Number of nodes is a
power of 2
• Node addresses 0, 1,
…, n-1
• Node i is connected
to k nodes whose
addresses differ from i
in exactly one bit
position.
• Example: k = 0111 is
connected to 1111,
0011, 0101, and 0110
1110
0110
0111
1010
0010
1111
1011
0011
1100
0100
0101
1000
0000
1101
1001
0001
41
Growing a Hypercube
Note: For d = 4, it is called a
4-D hypercube or just a 4 cube
42
Evaluating Hypercube Network
with n = 2d nodes
• Diameter:
• d = log n
•Bisection width:
•n/2
•Edges per node:
• log n
•Constant edge
length?
• No.
The length of the
longest edge
increases as n
increases.
1110
0110
0111
1010
0010
1111
1011
0011
1100
0100
0101
1000
0000
1101
1001
0001
43
Routing on the Hypercube Network
• Example: Send a
message from node 2
= 0010 to node 5 =
0101
• The nodes differ in 3
bits so the shortest
path will be of length
3.
• One path is
0010  0110 
0100  0101
obtained by flipping
one of the differing bits
at each step.
• As
with the butterfly network, bit
flipping helps you route on this
network.
1110
0110
0111
1010
0010
1111
1011
0011
1100
0100
0101
1000
0000
1101
1001
0001
44
A Perfect Shuffle
• A permutation that is produced as follows is
called a perfect shuffle:
• Given a power of 2 cards, numbered 0, 1, 2, ...,
2d -1, write the card number with d bits. By left
rotating the bits with a wrap, we calculate the
position of the card after the perfect shuffle.
• Example: For d = 3, card 5 = 101. Left rotating
and wrapping gives us 011. So, card 5 goes to
position 3. Note that card 0 = 000 and card 7 =
111, stay in position.
45
Shuffle-exchange Network
with n = 2d Processors
0
1
2
3
4
5
6
• Direct topology
• Number of nodes is a power of 2
• Nodes have addresses 0, 1, …, 2d-1
• Two outgoing links from node i
Shuffle link to node LeftCycle(i)
Exchange link between node i and node i+1
when i is even
7
46
Shuffle-exchange Addressing – 16
processors
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
No arrows on line segment means it is bidirectional.
Otherwise, you must follow the arrows.
Devising a routing algorithm for this network is
interesting and may be a homework problem.
47
Evaluating the Shuffle-exchange
• Diameter:
 2log n – 1
• Edges per node:
 3
0000
0001
0010
0011
0100
0101
0110
0111
• Constant edge length?
 No
1000
1001
1010
1011
1100
1101
1110
1111
• Bisection width:
 (n/ log n)
 Between 2n/log n and n/(2 log n)*
48
* See Leighton pg 480
Two Problems with
Shuffle-Exchange
• Shuffle-Exchange does not expand well
– A large shuffle-exchange network does not
decompose well into smaller separate shuffle
exchange networks.
– In a large shuffle-exchange network, a small
percentage of nodes will be hot spots
• They will encounter much heavier traffic
• Above results are in dissertation of one of
Batcher’s students.
49
Comparing Networks
(See Table 2.1)
• All have logarithmic diameter
except 2-D mesh
• 4nary Hypertree, butterfly, and hypercube have
bisection width n / 2 (Likely untrue only butterfly)
• All have constant edges per node except
hypercube
• Only 2-D mesh, linear, and ring topologies keep
edge lengths constant as network size increases
• Shuffle-exchange is a good compromise- fixed
number of edges per node, low diameter, good
bisection width.
– However, negative results on preceding slide also
need to be considered.
50
Alternate Names for SIMDs
• Recall that all active processors of a true SIMD
computer must simultaneously access the same
memory location.
• The value in the i-th processor can be viewed as
the i-th component of a vector.
• SIMD machines are sometimes called vector
computers [Jordan,et.al.] or processor arrays
[Quinn 94,04] based on their ability to execute
vector and matrix operations efficiently.
51
SIMD Computers
• SIMD computers that focus on vector
operations
– Support some vector and possibly matrix
operations in hardware
– Usually limit or provide less support for nonvector type operations involving data in the
“vector components”.
• General purpose SIMD computers
– May also provide some vector and possibly
matrix operations in hardware.
– More support for traditional type operations
(e.g., other than for vector/matrix data types).
52
Pipelined Architectures
• Pipelined architectures are sometimes
considered to be SIMD architectures
– See pg 37 of Textbook & pg 8-9 Jordan et. al.
– Vector components are entered successively into first
processor in pipeline.
– The i-th processor of the pipeline receives the output
from the (i-1)th processor.
• Normal “operations” in each processor are much
larger (coarser) in pipelined computers than in
true SIMDs
• “Pipelined” is somewhat SIMD in nature in that
synchronization is not required.
53
Why Processor Arrays?
• Historically, high cost of control units
• Scientific applications have data parallelism
54
Data/instruction Storage
• Front end computer
– Also called the control unit
– Holds and runs program
– Data manipulated sequentially
• Processor array
– Data manipulated in parallel
55
Processor Array Performance
• Performance: work done per time unit
• Performance of processor array
– Speed of processing elements
– Utilization of processing elements
56
Performance Example 1
• 1024 processors
• Each adds a pair of integers in 1 sec (1
microsecond or one millionth of second or
10-6 second.)
• What is the performance when adding two
1024-element vectors (one per
processor)?
Performanc
e
1024 operations
1  sec
 1 . 024  10 ops / sec
9
57
Performance Example 2
• 512 processors
• Each adds two integers in 1 sec
• What is the performance when adding two
vectors of length 600?
• Since 600 > 512, 88 processor must add
two pairs of integers.
• The other 424 processors add only a
single pair of integers.
58
Example of a 2-D Processor
Interconnection Network in a Processor
Array
Each VLSI chip has 16 processing elements.
Each PE can simultaneously send a value to a neighbor.
PE =
processor
element
59
SIMD Execution Style
• The traditional (SIMD, vector, processor array) execution
style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:
– The sequential processor that broadcasts the
commands to the rest of the processors is called the
front end or control unit (or sometimes host).
– The front end is a general purpose CPU that stores
the program and the data that is not manipulated in
parallel.
– The front end normally executes the sequential
portions of the program.
• Alternately, all PEs needing computation can execute steps
synchronously and avoid broadcast cost to distribute results
– Each processing element has a local memory that
cannot be directly accessed by the control unit or
other processing elements.
60
SIMD Execution Style
– Collectively, the individual memories of the
processing elements (PEs) store the (vector)
data that is processed in parallel.
– When the front end encounters an instruction
whose operand is a vector, it issues a
command to the PEs to perform the
instruction in parallel.
– Although the PEs execute in parallel, some
units can be allowed to skip particular
instructions.
61
Masking on Processor Arrays
• All the processors work in lockstep except those
that are masked out (by setting mask register).
• The conditional if-then-else is different for
processor arrays than sequential version
– Every active processor tests to see if its data meets
the negation of the boolean condition.
– If it does, it sets its mask bit so those processors will
not participate in the operation initially.
– Next the unmasked processors, execute the THEN
part.
– Afterwards, mask bits (for original set of active
processors) are flipped and unmasked processors
perform the the ELSE part.
62
if (COND) then A else B
63
if (COND) then A else B
64
if (COND) then A else B
65
SIMD Machines
• An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer
– Initial development at the University of Illinois 1965-70
– Moved to NASA Ames, completed in 1972 but not
fully functional until 1976.
– See Jordan et. al., pg 7 and Wikepedia
• The MPP, DAP, the Connection Machines CM-1
and CM-2, and MasPar’s MP-1 and MP-2 are
examples of SIMD computers
– See Akl pg 8-12 and [Quinn, 94]
• The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and
are sometimes called a pipelined SIMD
– See [Jordan, et al, p7], [Quinn 94, pg 61-2], and
[Quinn 2004, pg37).
66
SIMD Machines
• Quinn [1994, pg 63-67] discusses the CM-2
Connection Machine (with 64K PEs) and a
smaller & updated CM-200.
• Professor Batcher was the chief architect for the
STARAN and the MPP (Massively Parallel
Processor) and an advisor for the ASPRO
– ASPRO is a small second generation STARAN used
by the Navy in the spy planes.
• Professor Batcher is best known architecturally
for the MPP, which is at the Smithsonian
Institute & currently displayed at a D.C. airport.
67
Today’s SIMDs
• SIMD functionality is sometimes embedded in
sequential machines.
• Others are being build as part of hybrid
architectures.
• Some SIMD and SIMD-like features are included
in some multi/many core processing units
• Some SIMD-like architectures have been build
as special purpose machines, although some of
these could classify as general purpose.
– Some of this work has been proprietary.
– The fact that a parallel computer is SIMD or SIMD-like
is often not advertised by the company building them.68
A Company that Recently Built an
Inexpensive SIMD
• ClearSpeed produced a COTS (commodity off
the shelf) SIMD Board
– WorldScape has developed some defense and
commercial applications for this computer.
• Not a traditional SIMD as the hardware doesn’t
tightly synchronize the execution of instructions.
– Hardware design supports efficient synchronization
• This machine is programmed like a SIMD.
• The U.S. Navy observed that their machines
process radar a magnitude faster than others.
• Earlier, quite a bit of information about this was
at www.wscape.com and www.clearspeed.com
69
An Example of a Hybrid SIMD
• Embedded Massively Parallel Accelerators
– Systola 1024: PC addon board with 1024
processors
– Fuzion 150: 1536 processors
on a single chip
– Other accelerators: Decypher, Biocellerator,
GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
BioScan
(This and next three slides are due to Prabhakar R. Gudla (U
of Maryland) at a CMSC 838T Presentation, 4/23/2003.) 70
Hybrid Architecture
Systola Systola Systola Systola Systola Systola Systola Systola
1024
1024
1024
1024
1024
1024
1024
1024
High speed Myrinet switch
Systola Systola Systola Systola Systola Systola Systola Systola
1024
1024
1024
1024
1024
1024
1024
1024
– combines SIMD and MIMD paradigm within a parallel
architecture  Hybrid Computer
71
Architecture of Systola
1024
• Instruction Systolic
Array:
– 32  32 mesh of
processing elements
– wavefront instruction
execution
RAM NORTH
RAM WEST
program memory
host computer bus
Controller
ISA
Interface processors
72
SIMDs Embedded in SISDs
• Intel's Pentium 4 included what they call MMX
technology to gain a significant performance
boost
• IBM and Motorola incorporated the technology
into their G4 PowerPC chip in what they call
their Velocity Engine.
• Both MMX technology and the Velocity Engine
are the chip manufacturer's name for their
proprietary SIMD processors and parallel
extensions to their operating code.
• This same approach is used by NVidia and
Evans & Sutherland to dramatically accelerate
graphics rendering.
73
Special Purpose SIMDs in the
Bioinformatics Arena
• Parcel
– Acquired by Celera Genomics in 2000
– Products include the sequence
supercomputer GeneMatcher, which has a
high throughput sequence analysis capability
• Supports over a million processors
– GeneMatcher was used by Celera in their
race with U.S. government to complete the
description of the human genome sequencing
• TimeLogic, Inc
– Has DeCypher, a reconfigurable SIMD
74
Advantages of SIMDs
• Reference: [Roosta, pg 10]
• Less hardware than MIMDs as they have only
one control unit.
– Control units are complex.
• Less memory needed than MIMD
– Only one copy of the instructions need to be stored
– Allows more data to be stored in memory.
• Less startup time in communicating between
PEs.
75
Advantages of SIMDs (cont)
• Single instruction stream and synchronization of
PEs make SIMD applications easier to program,
understand, & debug.
– Similar to sequential programming
• Control flow operations and scalar operations
can be executed on the control unit while PEs
are executing other instructions.
• MIMD architectures require explicit
synchronization primitives, which create a
substantial amount of additional overhead.
76
Advantages of SIMDs (cont)
• During a communication operation between
PEs,
– PEs send data to a neighboring PE in parallel and in
lock step
– No need to create a header with routing information
as “routing” is determined by program steps.
– the entire communication operation is executed
synchronously
– SIMDs are deterministic & have much more
predictable running time.
• Can normally compute a tight (worst case) upper bound for
the time for communications operations.
• Less complex hardware in SIMD since no
message decoder is needed in the PEs
– MIMDs need a message decoder in each PE.
77
SIMD Shortcomings
(with some rebuttals)
• Claims are from our textbook [i.e., Quinn 2004].
– Similar statements are found in [Grama, et. al].
• Claim 1: Not all problems are data-parallel
– While true, most problems seem to have a
data parallel solution.
– In [Fox, et.al.], the observation was made in
their study of large parallel applications at
national labs, that most were data parallel by
nature, but often had points where significant
branching occurred.
78
SIMD Shortcomings
(with some rebuttals)
• Claim 2: Speed drops for conditionally executed
branches
– MIMDs processors can execute multiple branches
concurrently.
– For an if-then-else statement with execution times for
the “then” and “else” parts being roughly equal, about
½ of the SIMD processors are idle during its execution
• With additional branching, the average number of
inactive processors can become even higher.
• With SIMDs, only one of these branches can be
executed at a time.
• This reason justifies the study of multiple SIMDs (or
MSIMDs).
– On many applications, any branching is quite shallow.79
SIMD Shortcomings
(with some rebuttals)
• Claim 2 (cont): Speed drops for
conditionally executed code
– In [Fox, et.al.], the observation was made that
for the real applications surveyed, the
MAXIMUM number of active branches at any
point in time was about 8.
– The cost of the extremely simple processors
used in a SIMD are extremely low
• Programmers used to worry about ‘full utilization of
memory’ but stopped this after memory cost
became insignificant overall.
80
SIMD Shortcomings
(with some rebuttals)
• Claim 3: Don’t adapt to multiple users well.
– This is true to some degree for all parallel computers.
– If usage of a parallel processor is dedicated to a
important problem, it is probably best not to risk
compromising its performance by ‘sharing’
– This reason also justifies the study of multiple SIMDs
(or MSIMD).
– SIMD architecture has not received the attention that
MIMD has received and can greatly benefit from
further research.
81
SIMD Shortcomings
(with some rebuttals)
• Claim 4: Do not scale down well to
“starter” systems that are affordable.
– This point is arguable and its ‘truth’ is likely to
vary rapidly over time
– ClearSpeed currently sells a very economical
SIMD board that plugs into a PC.
82
SIMD Shortcomings
(with some rebuttals)
Claim 5: Requires customized VLSI for processors
and expense of control units in PCs has dropped.
• Reliance on COTS (Commodity, off-the-shelf parts)
has dropped the price of MIMDS
• Expense of PCs (with control units) has dropped
significantly
• However, reliance on COTS has fueled the success
of ‘low level parallelism’ provided by clusters and
restricted new innovative parallel architecture
research for well over a decade.
83
SIMD Shortcomings
(with some rebuttals)
Claim 5 (cont.)
• There is strong evidence that the period of
continual dramatic increases in speed of PCs
and clusters is ending.
• Continued rapid increases in parallel
performance in the future will be necessary in
order to solve important problems that are
beyond our current capabilities
• Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.
84
Multiprocessors
• Multiprocessor: multiple-CPU computer
with a shared memory
• Same address on two different CPUs
refers to the same memory location
• Avoids three cited criticisms for SIMDs
– Can be built from commodity CPUs
– Naturally support multiple users
– Maintain efficiency in conditional code
85
Centralized Multiprocessor
86
Centralized Multiprocessor
•
•
•
•
Straightforward extension of uniprocessor
Add CPUs to bus
All processors share same primary memory
Memory access time same for all CPUs
– Uniform memory access (UMA)
multiprocessor
• Also called a symmetrical multiprocessor (SMP)
87
Private and Shared Data
• Private data: items used only by a single
processor
• Shared data: values used by multiple
processors
• In a centralized multiprocessor (i.e. SMP),
processors communicate via shared data
values
88
Problems Associated with Shared Data
The cache coherence problem
• Replicating data across multiple caches reduces
contention among processors for shared data
values.
• But - how can we ensure different processors
have the same value for same address?
• The cache coherence problem is when an
obsolete value is still stored in a processor’s
cache.
89
Write Invalidate Protocol
Most common solution to cache coherency
1. Each CPU’s cache controller monitors
(snoops) the bus & identifies which cache
blocks are requested by other CPUs.
2. A PE gains exclusive control of data item
before performing “write”.
3. Before “write” occurs, all other copies of data
item cached by other PEs are invalidated.
4. When any other CPU tries to read a memory
location from an invalidated cache block,
–
–
a cache miss occurs
It has to retrieve updated data from memory
90
Cache-coherence Problem
Memory
X 7
Cache
Cache
CPU A
CPU B
91
Cache-coherence Problem
Memory
Read from
memory is not a
problem.
X 7
7
CPU A
CPU B
92
Cache-coherence Problem
Memory
X 7
7
CPU A
7
CPU B
93
Cache-coherence Problem
Write to main
memory is a
problem.
Memory
X 2
7
CPU A
2
CPU B
94
Write Invalidate Protocol
A cache control monitor
snoops the bus to see
which cache block is
being requested by
other processors.
X 7
7
CPU A
7
CPU B
95
Write Invalidate Protocol
X 7
Intent to write X
7
CPU A
7
CPU B
Before a write
can occur, all
copies of data at
that address are
declared invalid.
96
Write Invalidate Protocol
X 7
Intent to write X
7
CPU A
CPU B
97
Write Invalidate Protocol
X 2
2
CPU A
When another
processor tries to
read from this
location in cache,
it receives a cache
miss error and will
have to refresh
from main
memory.
CPU B
98
Synchronization Required for
Shared Data
• Mutual exclusion
– Definition: At most one process can be engaged in an
activity at any time.
– Example: Only one processor can write to the same
address in main memory at the same time.
– We say that process must mutually exclude all others
while it performs this write.
• Barrier synchronization
– Definition: Guarantees that no process will proceed
beyond a designated point (called the barrier) until
every process reaches that point.
99
Distributed Multiprocessor
• Distributes primary memory among processors
• Increase aggregate memory bandwidth and
lower average memory access time
• Allows greater number of processors
• Also called non-uniform memory access (NUMA)
multiprocessor
– Local memory access time is fast
– Non-local memory access time can vary
– Distributed memories have one logical address space
100
Distributed Multiprocessors
101
Cache Coherence
• Some NUMA multiprocessors do not
support it in hardware
– Only instructions and private data are stored
in cache
– Policy creates a large memory access time
variance
• Implementation more difficult
– No shared memory bus to “snoop”
– Directory-based protocol needed
102
Directory-based Protocol
• Distributed directory contains information
about cacheable memory blocks
• One directory entry for each cache block
• Each entry has
– Sharing status
– Which processors have copies
103
Sharing Status
• Uncached -- (denoted by “U”)
– Block not in any processor’s cache
• Shared – (denoted by “S”)
– Cached by one or more processors
– Read only
• Exclusive – (denoted by “E”)
– Cached by exactly one processor
– Processor has written to block
– Copy in memory is obsolete
104
Directory-based Protocol - step1
Interconnection Network
Directory
Directory
Directory
Local Memory
Local Memory
Local Memory
Cache
Cache
Cache
CPU 0
CPU 1
CPU 2
105
X has value 7 – step 2
Interconnection Network
Bit Vector
X U000
Directories
X 7
Memories
Caches
CPU 0
CPU 1
CPU 2
106
CPU 0 Reads X – step 3
Interconnection Network
Read Miss
X U000
Directories
X 7
Memories
Caches
CPU 0
CPU 1
CPU 2
107
CPU 0 Reads X –step 4
Interconnection Network
X S100
Directories
X 7
Memories
Caches
CPU 0
CPU 1
CPU 2
108
CPU 0 Reads X –step 5
Interconnection Network
X S100
Directories
X 7
Memories
Caches
X 7
CPU 0
CPU 1
CPU 2
109
CPU 2 Reads X – step 6
Interconnection Network
X S100
Directories
Memories
Caches
Read Miss
X 7
X 7
CPU 0
CPU 1
CPU 2
110
CPU 2 Reads X – step 7
Interconnection Network
X S101
Directories
X 7
Memories
Caches
X 7
CPU 0
CPU 1
CPU 2
111
CPU 2 Reads X – step 8
Interconnection Network
X S101
Directories
X 7
Memories
Caches
X 7
CPU 0
X 7
CPU 1
CPU 2
112
CPU 0 Writes 6 to X – step 9
Interconnection Network
Write Miss
X S101
Directories
X 7
Memories
Caches
X 7
CPU 0
X 7
CPU 1
CPU 2
113
CPU 0 Writes 6 to X – step 10
Interconnection Network
X S101
Directories
Invalidate
Memories
Caches
X 7
CPU 0
X 7
X 7
CPU 1
CPU 2
114
CPU 0 Writes 6 to X – step 11
Interconnection Network
X E100
Directories
X 7
Memories
Caches
X 6
CPU 0
CPU 1
CPU 2
115
CPU 1 Reads X – step 12
Interconnection Network
Read Miss
X E100
Directories
X 7
Memories
Caches
X 6
CPU 0
CPU 1
CPU 2
116
CPU 1 Reads X – step 13
Interconnection Network
Switch to Shared
X E100
Directories
X 7
Memories
Caches
X 6
CPU 0
CPU 1
CPU 2
117
CPU 1 Reads X – step 14
Interconnection Network
X E100
Directories
X 6
Memories
Caches
X 6
CPU 0
CPU 1
CPU 2
118
CPU 1 Reads X – step 15
Interconnection Network
X S110
Directories
X 6
Memories
Caches
X 6
CPU 0
X 6
CPU 1
CPU 2
119
CPU 2 Writes 5 to X – step 16
Interconnection Network
X S110
Directories
Memories
Caches
Write Miss
X 6
CPU 0
X 6
X 6
CPU 1
CPU 2
120
CPU 2 Writes 5 to X - step 17
Interconnection Network
Invalidate
X S110
Directories
X 6
Memories
Caches
X 6
CPU 0
X 6
CPU 1
CPU 2
121
CPU 2 Writes 5 to X –step 18
Interconnection Network
X E001
Directories
X 6
Memories
X 5
Caches
CPU 0
CPU 1
CPU 2
122
CPU 0 Writes 4 to X – step 19
Interconnection Network
Write Miss
X E001
Directories
X 6
Memories
X 5
Caches
CPU 0
CPU 1
CPU 2
123
CPU 0 Writes 4 to X – step 20
Interconnection Network
X E100
Directories
Memories
Take Away
X 6
X 5
Caches
CPU 0
CPU 1
CPU 2
124
CPU 0 Writes 4 to X – step 21
Interconnection Network
X E010
Directories
X 5
Memories
X 5
Caches
CPU 0
CPU 1
CPU 2
125
CPU 0 Writes 4 to X – step 22
Interconnection Network
X E100
Directories
X 5
Memories
Caches
CPU 0
CPU 1
CPU 2
126
CPU 0 Writes 4 to X – step 23
Interconnection Network
X E100
Directories
Creates
cache block
storage for X
Memories
Caches
X 5
X 5
CPU 0
CPU 1
CPU 2
127
CPU 0 Writes 4 to X – step 24
Interconnection Network
X E100
Directories
X 5
Memories
Caches
X 4
CPU 0
CPU 1
CPU 2
128
CPU 0 Writes Back X Block – step
25
Interconnection Network
Data Write Back
X E100
Directories
X 45
Memories
Caches
X 4
CPU 0
CPU 1
CPU 2
129
CPU 0 flushes cache block X
step 26
Interconnection Network
X U000
Directories
X 4
Memories
Caches
CPU 0
CPU 1
CPU 2
130
Characteristics of Multiprocessors
• Interprocessor communication is done in the
memory interface by read and write instructions
• Memory may be physically distributed and the
reads and writes from different processors may
take different time.
• Congestion and hotspots in the interconnection
network may occur.
• Memory latency (i.e., time to complete a read or
write) may be long and variable.
• Most messages through the bus or
interconnection network are the size of single
memory words.
• Randomization of requests may be used to
reduce the probability of collisions.
131
Multicomputers
• Distributed memory multiple-CPU
computer
• Same address on different processors
refers to different physical memory
locations
• Processors interact through message
passing
132
Typically, Two Flavors of
Multicomputers
• Commercial multicomputers
– Custom switch network
– Low latency (the time it takes to send a message).
– High bandwidth (data path width) across processors
• Commodity clusters
– Mass produced computers, switches and other
equipment
– Use low cost components
– Message latency is higher
– Communications bandwidth is lower
133
Multicomputer Communication
• Processors are connected by an interconnection
network
• Each processor has a local memory and can
only access its own local memory
• Data is passed between processors using
messages, as required by the program
• Data movement across the network is also
asynchronous
– A common approach is to use MPI to handling
message passing
134
Multicomputer Communications (cont)
• Multicomputers can be scaled to larger
sizes much easier than multiprocessors.
• The amount of data transmissions
between processors have a huge impact
on the performance
– The distribution of the data among the
processors is a very important factor in
the performance efficiency.
135
Message-Passing Advantages
• No problem with simultaneous access
to data.
• Allows different PCs to operate on the
same data independently.
• Allows PCs on a network to be easily
upgraded when faster processors
become available.
136
Disadvantages of
Message-Passing
• Programmers must make explicit
message-passing calls in the code
• This is low-level programming and is error
prone.
• Data is not shared but copied, which
increases the total data size.
• Data Integrity
– Difficulty in maintaining correctness of
multiple copies of data item.
137
Some Interconnection Network
Terminology (1/2)
References: Wilkinson, et. al. & Grama, et. al.
Also, earlier slides on architecture & networks.
A link is the connection between two nodes.
• A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed.
• The link between two nodes can be either
bidirectional or use two directional links .
• Can assume either one wire that carries one bit
or parallel wires (one wire for each bit in word).
• The above choices do not have a major impact
on the concepts presented in this course.
138
Network Terminology (2/2)
• The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second).
• The network latency is the time required to
transfer a message through the network.
– The communication latency is the total time
required to send a message, including
software overhead and interface delay.
– The message latency or startup time is the
time required to send a zero-length message.
• Includes software & hardware overhead, such as
– Choosing a route
– packing and unpacking the message
139
Circuit Switching Message Passing
• Technique establishes a path and allows the
entire message to transfer uninterrupted.
• Similar to telephone connection that is held until
the end of the call.
• Links used are not available to other messages
until the transfer is complete.
• Latency (message transfer time): If the length of
control packet sent to establish path is small wrt
(with respect to) the message length, the latency
is essentially
– the constant L/B, where L is message length
and B is bandwidth.
140
Store-and-forward Packet
Switching
• Message is divided into “packets” of information
• Each packet includes source and destination
addresses.
• Packets can not exceed a fixed, maximum size
(e.g., 1000 byte).
• A packet is stored in a node in a buffer until it
can move to the next node.
• Different packets typically follow different routes
but are re-assembled at the destination, as the
packets arrive.
• Movements of packets is asynchronous.
141
Packet Switching (cont)
• At each node, the designation information is
looked at and used to select which node to
forward the packet to.
• Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
• Significant latency is created by storing each
packet in each node it reaches.
• Latency increases linearly with the length of the
route.
142
Virtual Cut-Through Package
Switching
• Used to reduce the latency.
• Allows packet to pass through a node
without being stored, if the outgoing link is
available.
• If complete path is available, a message
can immediately move from source to
destination..
143
Wormhole Routing
• Alternate to store-and-forward packet
routing
• A message is divided into small units
called flits (flow control units).
• Flits are 1-2 bytes in size.
• Can be transferred in parallel on links with
multiple wires.
• Only head of flit is initially transferred
when the next link becomes available.
144
Wormhole Routing (cont)
• As each flit moves forward, the next flit can
move forward.
• The entire path must be reserved for a message
as these packets pull each other along (like cars
of a train).
• Request/acknowledge bit messages are
required to coordinate these pull-along moves.
– See Wilkinson, et. al.
• Latency: If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L the
145
message length and B the link bandwidth.
Deadlock
• Routing algorithms needed to find a path
between the nodes.
• Adaptive routing algorithms choose different
paths, depending on traffic conditions.
• Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination.
• Deadlock: No packet can be forwarded because
they are blocked by other stored packets waiting
to be forwarded.
146
Asymmetric Multicomputers
• Has a front-end that interacts with users
and I/O devices.
– Processors in back end are used for
computation.
– Programming similar to SIMDs (i.e., processor
arrays)
– Common with early multicomputers
• Examples of asymmetrical multicomputers
given in textbook.
147
Asymmetrical MC Advantages
• Back-end processors
dedicated to parallel
computations 
Easier to understand,
model, tune
performance
• Only a simple backend operating system
needed  Easy for a
vendor to create
148
Asymmetrical MC Disadvantages
• Front-end computer is a
single point of failure
• Single front-end computer
limits scalability of system
• Primitive operating
system in back-end
processors makes
debugging difficult
• Every application requires
development of both
front-end and back-end
program
149
Symmetric Multicomputers
• Every computer executes the same operating
system and has identical functionality.
• Users may log into any computer to edit or
compile their programs.
• Any or all computers may be involved in the
execution of their program.
• During execution of programs, every PE
executes the same program.
• When only one PE should execute an operation,
an “if” statement is used to select the PE.
150
Symmetric Multicomputers
151
Symmetrical MC Advantages
• Alleviate performance
bottleneck caused by
single front-end computer
• Better support for
debugging
• Every processor executes
same program
152
Symmetrical MC Disadvantages
• More difficult to maintain
illusion of single “parallel
computer”
• No simple way to balance
program development
workload among
processors
• More difficult to achieve
high performance when
multiple processes run on
each processor
– Details on next slide
153
Symmetric MC Disadvantages (cont)
• (cont.) More difficult to achieve high
performance when multiple processes run
on each processor
– Processes on same processor compete for
same resources
• CPU Cycles
• Cache space
• Memory bandwidth
– Increased cache misses
• Cache is “PE oriented” instead of “process
oriented”
154
Best Model for Commodity Cluster
• Full-Fledged operating system (e.g.,
Linux) desirable
– Feature of symmetric multicomputer
• Desirable to increase cache hits
– Favors having only a single user process on each PE
– Favors most nodes being off-limits for program
development
• Need fast network
– Keep program development users off networks and
have them access front-end by another path.
– Reserve interconnection network to usage by parallel
processes
• Overall, a mixed model may be best for
commodity clusters
155
Ideal Commodity Cluster Features
• Co-located computers
• Computers dedicated to running a single
process at a time to lower cache misses
• Accessible only from the network
– No keyboards or displays
•
•
•
•
Users access front-end by another route.
Identical operating systems
Identical local disk images
Administered as an entity
156
ParPar Cluster, A Mixed Model
• Mixed model
• Incorporates both asymetrical and
symetrical designs.
157
Network of Workstations
• Dispersed computers
– Typically located on user’s desks
• First priority: response time for person at
the keyboard
• Parallel jobs wait in background and run
with spare CPU cycles are available.
• Different operating systems
• Different local images
• Checkpointing and restarting important
• Typically connected by ethernet
– Too slow for commodity network usage
158
A Commodity Cluster vs
Network of Workstations
• A commodity cluster contains components of
local area networks
– Commodity computers
– Switches
• A network of workstations is a dispersed
collection of computers
– Distributed hetergeneous computers
– Located on primary users’ desks
– Unused cycles available for parallel use
– Example: SETI project
159
Summary
• Commercial parallel computers appeared
in 1980s
• Multiple-CPU computers now dominate
• Small-scale: Centralized multiprocessors
• Large-scale: Distributed memory
architectures
– Multiprocessors
– Multicomputers
160
Descargar

CPSC 367: Parallel Computing