Programming Multi-Core
Processors based Embedded
Systems
A Hands-On Experience on Cavium
Octeon based Platforms
Lecture 2 (Mapping Applications to Multi-core Arch)
Course Outline


Introduction
Multi-threading on multi-core processors






Developing parallel applications
Introduction to POSIX based multi-threading
Multi-threaded application examples
Applications for multi-core processors
Application layer computing on multi-core
Performance measurement and tuning
Cavium Univ Program ©
2010
2-87
KICS, UET
Agenda for Today


Mapping applications to multi-core
applications
Parallel programming using threads


POSIX multi-threading
Using multi-threading for parallel programming
Cavium Univ Program ©
2010
2-87
KICS, UET
Mapping Applications to
Multi-Core Architectures
Chapter 2
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
Parallelization

Assumption: Sequential algorithm is given


Pieces of the job:





Sometimes need very different algorithm, but beyond scope
Identify work that can be done in parallel
Partition work and perhaps data among processes
Manage data access, communication and synchronization
Note: work includes computation, data access and I/O
Main goal: Speedup (plus low prog. effort and resource needs)
Speedup (p) = Performance(p) / Performance(1)

For a fixed problem:
Speedup (p) = Time(1) / Time(p)
Cavium Univ Program ©
2010
2-87
KICS, UET
Steps in Creating a Parallel Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation

A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
pr ogram
Pr ocesses
M
a
p
p
i
n
g
P0
P1
P2
P3
Pr ocessors
4 steps: Decomposition, Assignment, Orchestration, Mapping


Done by programmer or system software (compiler, runtime, ...)
Issues are the same, so assume programmer does it all explicitly
Cavium Univ Program © 2010
2-87
KICS, UET
Some Important Concepts



Task:

Arbitrary piece of undecomposed work in parallel computation

Executed sequentially; concurrency is only across tasks

E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace

Fine-grained versus coarse-grained tasks
Process (thread):

Abstract entity that performs the tasks assigned to processes

Processes communicate and synchronize to perform their tasks
Processor:


Physical engine on which process executes
Processes virtualize machine to programmer

first write program in terms of processes, then map to processors
Cavium Univ Program ©
2010
2-87
KICS, UET
Decomposition

Break up computation into tasks to be divided
among processes




Tasks may become available dynamically
No. of available tasks may vary with time
i.e., identify concurrency and decide level at
which to exploit it
Goal: Enough tasks to keep processes busy,
but not too many

No. of tasks available at a time is upper bound on
achievable speedup
Cavium Univ Program ©
2010
2-87
KICS, UET
Limited Concurrency: Amdahl’s
Law



Most fundamental limitation on parallel speedup
If fraction s of seq execution is inherently serial, speedup
<= 1/s
Example: 2-phase calculation






Time for first phase = n2/p
Second phase serialized at global variable, so time = n2
2n2
2n2
Speedup <=
or at most 2
2
2 + p2
n
2
2n
Trick: divide second
into two
+ phase
n
p



sweep over n-by-n grid and do some independent computation
sweep again and add each value to global sum
accumulate into private sum during sweep
add per-process private sum into global sum
Parallel time is n2/p + n2/p + p, and speedup at best
Cavium Univ Program ©
2010
2-87
KICS, UET
Pictorial Depiction
1
(a)
work done concurrently
n2
n2
p
(b)
1
n2/p
n2
p
1
(c)
Time
n2/p n2/p p
Cavium Univ Program ©
2010
2-87
KICS, UET
Concurrency Profiles
1,400
1,200
C o n cu r r e n cy
1,000
800
600
400
200
733
702
662
633
589
564
526
504
483
444
415
380
343
313
286
247
219
150
0
Clock cycle number

Cannot usually divide into serial and parallel part


Area under curve is total work done, or time with 1 processor
Horizontal extent is lower bound
on time (infinite processors)

fk k

k=1

k=1 fk kp

Speedup is the ratio:

Amdahl’s law applies to any overhead, not just limited concurrency
Cavium Univ Program ©
2010
2-87
, base case:
1
s + 1-s
p
KICS, UET
Assignment

Specifying mechanism to divide work up among processes




Structured approaches usually work well




Code inspection (parallel loops) or understanding of application
Well-known heuristics
Static versus dynamic assignment
As programmers, we worry about partitioning first



E.g. which process computes forces on which stars, or which rays
Together with decomposition, also called partitioning
Balance workload, reduce communication and management cost
Usually independent of architecture or prog model
But cost and complexity of using primitives may affect decisions
As architects, we assume program does reasonable job of it
Cavium Univ Program ©
2010
2-87
KICS, UET
Orchestration

Includes:





Goals





Naming data
Structuring communication
Synchronization
Organizing data structures and scheduling tasks temporally
Reduce cost of communication and synch. as seen by processors
Reserve locality of data reference (incl. data structure organization)
Schedule tasks to satisfy dependences early
Reduce overhead of parallelism management
Closest to architecture (and programming model & language)


Choices depend a lot on comm. abstraction, efficiency of primitives
Architects should provide appropriate primitives efficiently
Cavium Univ Program ©
2010
2-87
KICS, UET
Mapping

After orchestration, already have parallel program

Two aspects of mapping:


Which processes will run on same processor, if necessary
Which process runs on which particular processor


One extreme: space-sharing



OS uses the performance techniques we will discuss later
Real world is between the two


Machine divided into subsets, only one app at a time in a subset
Processes can be pinned to processors, or left to OS
Another extreme: complete resource management control to OS


mapping to a network topology
User specifies desires in some aspects, system may ignore
Usually adopt the view: process <-> processor
Cavium Univ Program ©
2010
2-87
KICS, UET
Parallelizing Computation vs.
Data

Above view is centered around computation


Partitioning Data is often a natural view too



Computation is decomposed and assigned (partitioned)
Computation follows data: owner computes
Grid example; data mining; High Performance Fortran (HPF)
But not general enough

Distinction between comp. and data stronger in many
applications



Barnes-Hut, Raytrace (later)
Retain computation-centric view
Data access and communication is part of orchestration
Cavium Univ Program ©
2010
2-87
KICS, UET
High-level Goals
Table 2.1
Steps in the Parallelization Pr
ArchitectureDependent?
Step
ocess and Their Goals
Major Performance Goals
Decomposition
Mostly no
Expose enough concurr
ency but not too much
Assignment
Mostly no
Balance workload
Reduce communication volume
Or chestration
Ye s
Reduce noninher ent communication via data
locality
Reduce communication and synchr
onization cost
as seen by the pr ocessor
Reduce serialization at shar
ed r esour ces
Schedule tasks to satisfy dependences early
Mapping
Ye s
Put r elated pr ocesses on the same pr
necessary
Exploit locality in network topology
ocessor if

High performance (speedup over sequential program)

But low resource usage and development effort

Implications for algorithm designers and architects


Algorithm designers: high-perf., low resource needs
Architects: high-perf., low cost, reduced programming effort

e.g. gradually improving perf. with programming effort may be
preferable to sudden threshold after large programming effort
Cavium Univ Program ©
2010
2-87
KICS, UET
Parallelization of An Example
Program


Motivating problems all lead to large, complex
programs
Examine a simplified version of a piece of Ocean
simulation


Iterative equation solver
Illustrate parallel program in low-level parallel
language



C-like pseudocode with simple extensions for parallelism
Expose basic comm. and synch. primitives that must be
supported
State of most real parallel programming today
Cavium Univ Program ©
2010
2-87
KICS, UET
Grid Solver Example
Expr ession for updating each interior point:
A [i,j ] = 0.2  (A [i,j ] + A [i,j – 1] + A [i – 1 , j ] +
A [i,j + 1] + A [i + 1, j ])


Simplified version of solver in Ocean simulation
Gauss-Seidel (near-neighbor) sweeps to convergence





interior n-by-n points of (n+2)-by-(n+2) updated in each sweep
updates done in-place in grid, and diff. from prev. value computed
accumulate partial diffs into global diff at end of every sweep
check if error has converged (to within a tolerance parameter)
if so, exit solver; if not, do another sweep
Cavium Univ Program ©
2010
2-87
KICS, UET
1. i n t n ;
/* size o f m atrix : (n + 2 -b y -n + 2) elem en ts* /
2. f l o a t * * A , d i f f = 0 ;
3. m a i n ( )
4. b e g i n
5.
read(n) ;
6.
A  malloc (a 2-d array of size n + 2 by n + 2 doubles);
7.
initialize(A);
/* in itialize th e m atrix A so m eh o w * /
8.
Solve (A);
/* call th e ro u tin e to so lv e equ atio n * /
/* read in pu t p aram eter: m atrix size* /
9. e n d m a i n
10. p r o c e d u r e S o l v e ( A )
11.
float **A;
/* so lv e th e eq u ation sy stem * /
/* A is an (n + 2)-by -(n + 2 ) array* /
12. b e g i n
13.
int i, j, done = 0;
14.
float diff = 0, temp;
15.
while (!done) do
/* o u term o st loo p ov er sw eep s* /
16.
diff = 0;
/* in itialize m ax im u m d ifferen ce to 0 * /
17.
for i  1 to n do
/* sw eep o v er no n bo rd er p o in ts o f grid * /
18.
for j  1 to n do
19.
temp = A[i,j];
20.
A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
A [ i , j + 1 ] + A [ i + 1 , j ] ) ; /* co m p u te av erag e* /
21.
22.
23.
/* sav e o ld v alu e o f elem en t* /
diff += abs(A[i,j] - temp);
end for
24.
end for
25.
if (diff/(n*n) < TOL) then done = 1;
26.
end while
27. e n d p r o c e d u r e
Cavium Univ Program ©
2010
2-87
KICS, UET
Decomposition

Simple way to identify concurrency is to look at loop iterations






dependence analysis; if not enough concurrency, then look further
Not much concurrency here at this level (all loops sequential)
Examine fundamental dependences, ignoring loop structure
Concurrency O(n) along anti-diagonals, serialization O(n) along
diag.
Retain loop structure, use pt-to-pt synch; Problem: too many
synch ops.
Restructure loops, use global synch; imbalance and too much
synch
Cavium Univ Program ©
2010
2-87
KICS, UET
Exploit Application Knowledge
Red point
Black point

Reorder grid traversal: red-black ordering




Different ordering of updates: may converge quicker or slower
Red sweep and black sweep are each fully parallel:
Global synch between them (conservative but convenient)
Ocean uses red-black; we use simpler, asynchronous one to
illustrate
no red-black, simply ignore dependences within sweep
sequential
Cavium Univ Program
© order same as original, parallel program nondeterministic

2010
2-87
KICS, UET
Decomposition Only
/*a sequential loop*/
15. while (!done) do
16.
diff = 0;
17.
for_all i  1 to n do
/*a parallel loop nest*/
for_all j  1 to n do
temp = A[i,j];
18.
19.
20.
A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21.
A[i,j+1] + A[i+1,j]);
22.
diff += abs(A[i,j] - temp);
23.
end for_all
24.
end for_all
25.
if (diff/(n*n) < TOL) then done = 1;
26. end while



Decomposition into elements: degree of concurrency n2
To decompose into rows, make line 18 loop sequential;
degree n
for_all leaves assignment left to system

but implicit global synch. at end of for_all loop
Cavium Univ Program ©
2010
2-87
KICS, UET
P0
Assignment
P1
P2
P4

Static assignments (given decomposition into rows)



Dynamic assignment


i
p
get a row index, work on the row, get a new row, and so on
Static assignment into rows reduces concurrency (from n to p)


block assignment of rows: Row i is assigned to process
cyclic assignment of rows: process i is assigned rows i, i+p, and so on
block assign. reduces communication by keeping adjacent rows
together
Let’s dig into orchestration under three programming models
Cavium Univ Program ©
2010
2-87
KICS, UET
Data Parallel Solver
1.
2.
int n, nprocs;
float **A, diff = 0;
3.
4.
5.
main()
begin
read(n); read(nprocs);
6.
7.
8.
9.
;
/*read input grid size a nd nu m ber o f pro cesses*/
A  G_MALLOC (a 2-d array of size n+2 by n+2 doubles);
initialize(A);
/* init ia liz e the m atrix A so m e ho w */
Solve (A);
/*ca ll the ro utine to so lve equat io n*/
end main
10. procedure Solve(A)
11.
float **A;
12.
begin
13.
int i, j, done = 0;
14.
float mydiff = 0, temp;
14a.
DECOMP A[BLOCK,*, nprocs];
15.
while (!done) do
16.
mydiff = 0;
17.
18.
19.
/*grid siz e (n + 2-by-n + 2) and nu m ber o f pro cesses*/
for_all i  1 to n do
/*so lve the equat io n syste m */
/*A is a n (n + 2-by-n + 2) array*/
/*o uterm o st lo o p o ver sw eeps*/
/* init ia liz e m a xim u m d iffere nce to 0 * /
/*sw eep o ver no n-bo rder po ints o f grid*/
for_all j  1 to n do
temp = A[i,j];
/*sa ve o ld va lue o f e le m e nt*/
20.
A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21.
A[i,j+1] + A[i+1,j]);
/*co m pute average*/
22.
mydiff += abs(A[i,j] - temp);
23.
end for_all
24.
end for_all
24a.
REDUCE (mydiff, diff, ADD);
25.
if (diff/(n*n) < TOL) then done = 1;
26.
end while
27. end procedure
Cavium Univ Program ©
2010
2-87
KICS, UET
Shared Address Space Solver
Single Program Multiple Data (SPMD)
Processes
Solve
Solve
Solve
Solve
Sweep
T est Convergence
Assignment controlled by values of variables used as
loop bounds
Cavium Univ Program ©

2010
2-87
KICS, UET
1.
2a.
int n, nprocs;
float **A, diff;
2b.
2c.
LOCKDEC(diff_lock);
BARDEC (bar1);
/*m a trix dim e n sio n a n d nu m b er o f pr oc e ssor s to b e u sed */
/*A is glo ba l ( sha r e d) a rra y r epr e se ntin g th e grid */
/*diff is globa l ( sha r e d) m a x im u m diffe re n c e in cu rr en t
sw e ep */
/*de cla ra tio n o f lo ck to e n forc e m u tu a l e x clu sion */
/*ba rrier d ec la ra tio n for g lo ba l sy nc hr on iza tio n b etw e e n
sw e ep s*/
3.
4.
5.
6.
7.
8a.
8.
8b.
9.
main()
begin
10.
11.
procedure Solve(A)
float **A;
12.
13.
14.
14a.
14b.
begin
int i,j, pid, done = 0;
float temp, mydiff = 0;
int mymin = 1 + (pid * n/nprocs);
int mymax = mymin + n/nprocs - 1
15.
16.
16a.
17.
18.
19.
20.
21.
22.
23.
24.
25a.
25b.
25c.
25d.
25e.
while (!done) do
/*ou ter loo p o ve r a ll dia g ona l e le m e nts*/
mydiff = diff = 0;
/*set glo ba l diff to 0 (ok a y for a ll to do it) */
BARRIER(bar1, nprocs);
/*en su r e a ll r ea c h h er e be for e a n yo n e m od ifie s diff*/
for i  mymin to mymax do
/*for ea c h o f m y ro w s*/
for j  1 to n do
/*for a ll n o nb or d er ele m en ts in tha t ro w */
temp = A[i,j];
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
A[i,j+1] + A[i+1,j]);
mydiff += abs(A[i,j] - temp);
endfor
endfor
LOCK(diff_lock);
/*u p da te globa l d iff if ne c e ssa ry */
diff += mydiff;
UNLOCK(diff_lock);
BARRIER(bar1, nprocs);
/*en su r e a ll r ea c h h er e be for e c h e ck in g if d o n e */
if (diff/(n*n) < TOL) then done = 1;
/*ch e ck co n v erg e n ce ; a ll g et
sa m e a n sw er */
BARRIER(bar1, nprocs);
endwhile
end procedure
25f.
26.
27.
read(n); read(nprocs);
/*r ea d inpu t m a trix siz e a n d nu m b er o f pr o c e sse s*/
A  G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);
initialize(A);
/*initia liz e A in a n u n sp e cifie d w a y */
CREATE (nprocs–1, Solve, A);
Solve(A);
/*m a in pro c e ss b ec o m e s a w ork er to o */
WAIT_FOR_END (nprocs–1);
/*w a it for a ll c hild pr o ce sse s cr ea te d to ter m ina te */
end main
Cavium Univ Program ©
2010
/*A is en tir e n+ 2 -b y -n+ 2 sha r ed a rra y,
a s in th e se qu e ntia l pro gra m */
/*priva te va r ia ble s*/
/*a ssu m e tha t n is e xa c tly d iv isible by */
/*npr o c s for sim plic ity he re */
2-87
KICS, UET
Notes on SAS Program



SPMD: not all
Code that does the update lockstep or even necessarily
same instructions
Assignment controlled by values of variables used as loop
bounds


Done condition evaluated redundantly by identical to
sequential program


unique pid per process, used to control assignment
each process has private mydiff variable
Most interesting special operations are for synchronization


accumulations into shared diff have to be mutually exclusive
why the need for all the barriers?
Cavium Univ Program ©
2010
2-87
KICS, UET
Need for Mutual Exclusion

Code each process executes:
load the value of diff into register r1
add the register r2 to register r1
store the value of register r1 into diff
A possible interleaving:
P1
r1  diff

P2
r1  diff
r1  r1+r2
r1  r1+r2
diff  r1
diff  r1

{P1
{P2
{P1
{P2
{P1
{P2
gets 0 in its r1}
also gets 0}
sets its r1 to 1}
sets its r1 to 1}
sets cell_cost to 1}
also sets cell_cost to 1}
Need the sets of operations to be atomic (mutually exclusive)
Cavium Univ Program ©
2010
2-87
KICS, UET
Global Event Synchronization

BARRIER(nprocs): wait here till nprocs processes get here



Built using lower level primitives
Global sum example: wait for all to accumulate before using sum
Often used to separate phases of computation
Process P_1
Process P_2
Process P_nprocs
set up eqn system
set up eqn system
set up eqn system
Barrier (name, nprocs)
Barrier (name, nprocs) Barrier (name, nprocs)
solve eqn system
solve eqn system
Barrier (name, nprocs)
Barrier (name, nprocs) Barrier (name, nprocs)
apply results
apply results
Barrier (name, nprocs)
Barrier (name, nprocs) Barrier (name, nprocs)


solve eqn system
apply results
Conservative form of preserving dependences, but easy to use
WAIT_FOR_END (nprocs-1)
Cavium Univ Program ©
2010
2-87
KICS, UET
Pt-to-pt Event Synch (Not Used
Here)

One process notifies another of an event so it
can proceed



Common example: producer-consumer (bounded
buffer)
Concurrent programming
on uniprocessor:
P1
P2
semaphores
A = 1;
b: flag
Shared
address
space
parallel
programs:
a: while (flag is 0) do nothing;
= 1;
semaphores,
or use ordinary variables as flags
print A;
•Busy-waiting or
Cavium Univ Program ©
2010
spinning
2-87
KICS, UET
Group Event Synchronization

Subset of processes involved



Can use flags or barriers (involving only the
subset)
Concept of producers and consumers
Major types:



Single-producer, multiple-consumer
Multiple-producer, single-consumer
Multiple-producer, single-consumer
Cavium Univ Program ©
2010
2-87
KICS, UET
Message Passing Grid Solver


Cannot declare A to be shared array any more
Need to compose it logically from per-process
private arrays




usually allocated in accordance with the assignment of
work
process assigned a set of rows allocates them locally
Transfers of entire rows between traversals
Structurally similar to SAS (e.g. SPMD), but
orchestration different



data structures and data access/naming
communication
synchronization
Cavium Univ Program ©
2010
2-87
KICS, UET
1. int pid, n, b;
2. float **myA;
3. main()
4. begin
5.
read(n);
read(nprocs);
8a.
CREATE (nprocs-1, Solve);
8b.
Solve();
8c.
WAIT_FOR_END (nprocs–1);
9. end main
10.
11.
13.
14.
/*process id, m atrix dim en sion and num ber of
processors to be used*/
/*read in put m atrix size an d n um ber of processes*/
/*m ain process becom es a w orker too*/
/*w ait for all child processes created to term inate*/
procedure Solve()
begin
int i,j, pid, n’ = n/nprocs, done = 0;
float temp, tempdiff, mydiff = 0;
/*private variables*/
myA  malloc(a 2-d array of size [n/nprocs + 2] by n+2);
/*m y assign ed row s of A */
7. initialize(myA);
/*initialize m y row s of A , in an un specified w a y*/
6.
15. while (!done) do
16.
mydiff = 0;
/*set local diff to 0*/
16a.
if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);
16b.
if (pid = nprocs-1) then
SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);
16c.
if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);
16d.
if (pid != nprocs-1) then
RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);
/*border row s of n eigh bors have n ow been copied
in to m yA [0,*] an d m yA [n ’+ 1,*]*/
17.
18.
19.
20.
21.
22.
23.
24.
25a.
25b.
25c.
25d.
25e.
25f.
25g.
25h.
25i
for i  1 to n’ do
/*for each of m y (n on gh ost) row s* /
for j  1 to n do
/*for all n on border elem en ts in that row */
temp = myA[i,j];
myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +
myA[i,j+1] + myA[i+1,j]);
mydiff += abs(myA[i,j] - temp);
endfor
endfor
/*com m un icate local diff values an d determ in e if
don e; can be replaced by reduction and broadcast*/
if (pid != 0) then
/*process 0 h olds global total diff*/
SEND(mydiff,sizeof(float),0,DIFF);
RECEIVE(done,sizeof(int),0,DONE);
else
/*pid 0 does th is*/
for i  1 to nprocs-1 do
/*for each oth er process*/
RECEIVE(tempdiff,sizeof(float),*,DIFF);
mydiff += tempdiff;
/*accum ulate into total*/
endfor
if (mydiff/(n*n) < TOL) then
done = 1;
25j.
for i  1 to nprocs-1 do
/*for each oth er process*/
25k.
SEND(done,sizeof(int),i,DONE);
25l.
endfor
25m.
endif
26. endwhile
27. end procedure
Cavium Univ Program ©
2010
2-87
KICS, UET
Notes on Message Passing
Program


Use of ghost rows
Receive does not transfer data, send does





Communication done at beginning of iteration, so no asynchrony
Communication in whole rows, not element at a time
Core similar, but indices/bounds in local rather than global space
Synchronization through sends and receives



25b.
25c.
25i.
25k.
25m.
unlike SAS which is usually receiver-initiated (load fetches data)
Update of global diff and event synch for done condition
Could implement locks and barriers with messages
Can use REDUCE and BROADCAST library calls to simplify code
/*com m un icate local diff values an d determ in e if don e, usin g reduction an d broadcast*/
REDUCE(0,mydiff,sizeof(float),ADD);
if (pid == 0) then
if (mydiff/(n*n) < TOL) then done = 1;
endif
BROADCAST(0,done,sizeof(int),DONE);
Cavium Univ Program ©
2010
2-87
KICS, UET
Send and Receive Alternatives


Can extend functionality: stride, scatter-gather, groups
Semantic flavors: based on when control is returned

Affect when data structures or buffers can be reused at either end
Send/Receive
Synchronous
Asynchronous
Blocking asynch.


Affect event synch (mutual excl. by fiat: only one process touches data)

Affect ease of programming and performance
Synchronous messages provide built-in synch. through match


Nonblocking asynch.
Separate event synchronization needed with asynch. messages
With synch. messages, our code is deadlocked. Fix?
Cavium Univ Program ©
2010
2-87
KICS, UET
Orchestration: Summary

Shared address space






Shared and private data explicitly separate
Communication implicit in access patterns
No correctness need for data distribution
Synchronization via atomic operations on shared data
Synchronization explicit and distinct from data communication
Message passing




Data distribution among local address spaces needed
No explicit shared structures (implicit in comm. patterns)
Communication is explicit
Synchronization implicit in communication (at least in synch. case)

mutual exclusion by fiat
Cavium Univ Program ©
2010
2-87
KICS, UET
Correctness in Grid Solver
Program


Decomposition and Assignment similar in SAS and
message-passing
Orchestration is different

Data structures, data access/naming, communication,
synchronization
SAS
Msg-Passing
Explicit global data structure?
Yes
No
Assignment indept of data layout?
Yes
No
Communication
Implicit
Explicit
Synchronization
Explicit
Implicit
Explicit replication of border rows?
No
Yes
Cavium Univ Program ©
2010
2-87
KICS, UET
Programming for
Performance
Chapter 3
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
Outline
Programming techniques for performance

Partitioning for performance

Relationship of communication, data locality and architecture

Programming for performance

For each issue:


Techniques to address it, and tradeoffs with previous issues

Application to grid solver

Some architectural implications
Components of execution time as seen by processor


What workload looks like to architecture, and relate to software issues
Implications for programming models
Cavium Univ Program ©
2010
2-87
KICS, UET
Partitioning for Performance

Balancing the workload and reducing wait time at synch points
Reducing inherent communication
Reducing extra work

Even these algorithmic issues trade off:






Minimize comm. => run on 1 processor => extreme load
imbalance
Maximize load balance => random assignment of tiny tasks => no
control over communication
Good partition may imply extra work to compute or manage it
Goal is to compromise

Fortunately, often not difficult in practice
Cavium Univ Program ©
2010
2-87
KICS, UET
Load Balance and Synch Wait
Time

Limit on speedup:


Speedupproblem(p) <
Sequential Work
Max Work on any Processor
Work includes data access and other costs
Not just equal work, but must be busy at same time

Four parts to load balance and reducing synch wait time:
1.
Identify enough concurrency
2.
Decide how to manage it
3.
Determine the granularity at which to exploit it
4.
Reduce serialization and cost of synchronization
Cavium Univ Program ©
2010
2-87
KICS, UET
Identifying Concurrency

Techniques seen for equation solver:

Loop structure, fundamental dependences, new algorithms

Data Parallelism versus Function Parallelism

Often see orthogonal levels of parallelism; e.g. VLSI routing
W1
W2
W3
(a)
Wi re W 2 expands to segments
S21
S22
S23
S24
S25
S26
(b)
Segment S 23 expands to r
outes
(c)
Cavium Univ Program ©
2010
2-87
KICS, UET
Identifying Concurrency (Cont’d)

Function parallelism:








entire large tasks (procedures) that can be done in parallel
on same or different data
e.g. different independent grid computations in Ocean
pipelining, as in video encoding/decoding, or polygon
rendering
degree usually modest and does not grow with input size
difficult to load balance
often used to reduce synch between data parallel phases
Most scalable programs data parallel (per this loose
definition)
function parallelism reduces synch between data parallel
phases
Cavium Univ Program ©

2010
2-87
KICS, UET
Deciding How to Manage
Concurrency

Static versus Dynamic techniques

Static:

Algorithmic assignment based on input; won’t change

Low runtime overhead

Computation must be predictable


Preferable when applicable (except in
multiprogrammed/heterogeneous environment)
Dynamic:



Adapt at runtime to balance load
Can increase communication and reduce locality
Can increase task management overheads
Cavium Univ Program ©
2010
2-87
KICS, UET
Dynamic Assignment

Profile-based (semi-static):



Profile work distribution at runtime, and repartition dynamically
Applicable in many computations, e.g. Barnes-Hut, some graphics
Dynamic Tasking:

Deal with unpredictability in program or environment (e.g.
Raytrace)





computation, communication, and memory system interactions
multiprogramming and heterogeneity
used by runtime systems and OS too
Pool of tasks; take and add tasks until done
E.g. “self-scheduling” of loop iterations (shared loop counter)
Cavium Univ Program ©
2010
2-87
KICS, UET
Dynamic Tasking with Task
Queues


Centralized versus distributed queues
Task stealing with distributed queues




Can compromise comm and locality, and increase synchronization
Whom to steal from, how many tasks to steal, ...
Termination detection
Maximum imbalance related to size of task
All pr ocesses
insert tasks
P0 inserts
QQ
0
P1 inserts
P2 inserts
P3 inserts
Q1
Q2
Q3
Others may
steal
All r emove tasks
(a) Centralized task queue
Cavium Univ Program ©
2010
P0 r emoves
P1 r emoves
P2 r emoves
(b) Distributed task queues (one per pr
2-87
P3 r emoves
ocess)
KICS, UET
Determining Task Granularity


Task granularity: amount of work associated with a
task
General rule:



Coarse-grained => often less load balance
Fine-grained => more overhead; often more comm.,
contention
Comm., contention actually affected by assignment,
not size

Overhead by size itself too, particularly with task queues
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Serialization


Careful about assignment and orchestration (including
scheduling)
Event synchronization

Reduce use of conservative synchronization



e.g. point-to-point instead of barriers, or granularity of pt-to-pt
But fine-grained synch more difficult to program, more synch ops.
Mutual exclusion

Separate locks for separate data




Smaller, less frequent critical sections



e.g. locking records in a database: lock per process, record, or field
lock per task in task queue, not per queue
finer grain => less contention/serialization, more space, less reuse
don’t do reading/testing in critical section, only modification
e.g. searching for task to dequeue in task queue, building tree
Stagger critical sections in time
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Inherent
Communication



Communication is expensive!
Measure: communication to computation ratio
Focus here on inherent communication





Determined by assignment of tasks to processes
Later see that actual communication can be greater
Assign tasks that access same data to same process
Solving communication and load balance NP-hard in
general case
But simple heuristic solutions work well in practice

Applications have structure!
Cavium Univ Program ©
2010
2-87
KICS, UET
Domain Decomposition


Works well for scientific, engineering, graphics, ... applications
Exploits local-biased nature of physical problems




Information requirements often short-range
Or long-range but fall off with distance
Simple example: nearest-neighbor grid computation
Perimeter to Area comm-to-comp ratio (area to volume in 3-d)

Depends on n,p: decreases with n, increases with p
n
n
p
P0
P1
P2
P3
P4
P5
P6
P7
n
p
n
P8
P9
P10
P11
P12
P13
P14
P15
Cavium Univ Program ©
2010
2-87
KICS, UET
Domain Decomposition (Cont’d)
Best domain decomposition depends on information requirements
Nearest neighbor example: block versus strip decomposition:
n
----p
n
P0
P1
P2
P3
P4
P5
P6
P7
n
----p

P8
P9
P10
P11
P12
P13
P14
P15
Comm to comp:


n
4*√p
2*p
for block,
for strip
n
n
Retain block from here on
Application dependent: strip may be better in other cases

E.g. particle flow in tunnel
Cavium Univ Program ©
2010
2-87
KICS, UET
Finding a Domain Decomposition

Static, by inspection


Static, but not by inspection



Input-dependent, require analyzing input structure
E.g sparse matrix computations, data mining (assigning
itemsets)
Semi-static (periodic repartitioning)


Must be predictable: grid example
Characteristics change but slowly; e.g. Barnes-Hut
Static or semi-static, with dynamic task stealing

Initial decomposition, but highly unpredictable; e.g ray
tracing
Cavium Univ Program ©
2010
2-87
KICS, UET
Other Techniques

Scatter Decomposition, e.g. initial partition in Raytrace
3
12
12
12
4
3
4
3
12
4
3
4
12
12
3
4
12
3
3
3
4
3
3
4
4
3
3
4
3
4
4
3
4
3
12
12
12
12
4
3
4
3
4
Scatter decomposition
Domain decomposition
Preserve locality in task stealing
• Steal large tasks for locality, steal from same
queues,
Cavium Univ
Program ©...
2010
4
12
12
12
4
12
12
12
2-87
KICS, UET
Implications of Comm-to-Comp
Ratio




Architects examine application needs to see where to spend
money
If denominator is execution time, ratio gives average BW needs
If operation count, gives extremes in impact of latency and
bandwidth

Latency: assume no latency hiding

Bandwidth: assume all latency hidden

Reality is somewhere in between
Actual impact of comm. depends on structure and cost as well
Sequential Work
Speedup <
Max (Work + Synch Wait Time + Comm Cost)

Need to keep communication balanced across processors as well
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Extra Work

Common sources of extra work:

Computing a good partition



Using redundant computation to avoid communication
Task, data and process management overhead


applications, languages, runtime systems, OS
Imposing structure on communication


e.g. partitioning in Barnes-Hut or sparse matrix
coalescing messages, allowing effective naming
Architectural Implications:

Reduce need by making communication and orchestration efficient
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Cavium Univ Program ©
2010
2-87
KICS, UET
Memory-oriented View of
Performance

Multiprocessor as Extended Memory Hierarchy


Levels in extended hierarchy:




as seen by a given processor
Registers, caches, local memory, remote memory (topology)
Glued together by communication architecture
Levels communicate at a certain granularity of data transfer
Need to exploit spatial and temporal locality in
hierarchy


Otherwise extra communication may also be caused
Especially important since communication is expensive
Cavium Univ Program ©
2010
2-87
KICS, UET
Uniprocessor Optimization


Performance depends heavily on memory hierarchy
Time spent by a program
Timeprog(1) = Busy(1) + Data Access(1)


Divide by cycles to get CPI equation
Data access time can be reduced by:


Optimizing machine: bigger caches, lower latency...
Optimizing program: temporal and spatial locality
Cavium Univ Program ©
2010
2-87
KICS, UET
Extended Hierarchy


Idealized view: local cache hierarchy + single main memory
But reality is more complex



Centralized Memory: caches of other processors
Distributed Memory: some local, some remote; + network
topology
Management of levels





caches managed by hardware
main memory depends on programming model

SAS: data movement between local and remote transparent

message passing: explicit
Levels closer to processor are lower latency and higher bandwidth
Improve performance through architecture or program locality
Tradeoff with parallelism; need good node performance and
parallelism
Cavium Univ Program ©
2010
2-87
KICS, UET
Artifactual Comm. in Extended
Hierarchy

Accesses not satisfied in local portion cause communication

Inherent communication, implicit or explicit, causes transfers


Artifactual communication








determined by program
determined by program implementation and arch. interactions
poor allocation of data across distributed memories
unnecessary data in a transfer
unnecessary transfers due to system granularities
redundant communication of data
finite replication capacity (in cache or main memory)
Inherent communication assumes unlimited capacity, small
transfers, perfect knowledge of what is needed.
More on artifactual later; first consider replication-induced further
Cavium Univ Program ©
2010
2-87
KICS, UET
Communication and Replication

Comm induced by finite capacity is most fundamental artifact



View as three level hierarchy for simplicity


Like cache size and miss rate or memory traffic in uniprocessors
Extended memory hierarchy view useful for this relationship
Local cache, local memory, remote memory (ignore network
topology)
Classify “misses” in “cache” at any level as for uniprocessors





compulsory or cold misses (no size effect)
capacity misses (yes)
conflict or collision misses (yes)
communication or coherence misses (no)
Each may be helped/hurt by large transfer granularity (spatial
locality)
Cavium Univ Program ©
2010
2-87
KICS, UET
Orchestration for Performance

Reducing amount of communication:


Inherent: change logical data sharing patterns in algorithm
Artifactual: exploit spatial, temporal locality in extended
hierarchy

Techniques often similar to those on uniprocessors

Structuring communication to reduce cost

Let’s examine techniques for both...
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Artifactual
Communication

Message passing model



Communication and replication are both explicit
Even artifactual communication is in explicit
messages
Shared address space model


More interesting from an architectural perspective
Occurs transparently due to interactions of
program and system


sizes and granularities in extended memory hierarchy
Use shared address space to illustrate issues
Cavium Univ Program ©
2010
2-87
KICS, UET
Exploiting Temporal Locality

Structure algorithm so working sets map well to hierarchy



Multiple data structures in same phase


often techniques to reduce inherent communication do well here
schedule tasks for data reuse once assigned
e.g. database records: local versus remote
Solver example: blocking
(a) Unblocked access pattern in a sweep
(b) Blocked access pattern with B = 4
•More useful when
O(nk+1) computation on O(nk) data
–many linear algebra computations (factorization, matrix multiply)
Cavium Univ Program ©
2010
2-87
KICS, UET
Exploiting Spatial Locality

Besides capacity, granularities are important:





Major spatial-related causes of artifactual communication:

Conflict misses

Data distribution/layout (allocation granularity)

Fragmentation (communication granularity)

False sharing of data (coherence granularity)
All depend on how spatial access patterns interact with data structures


Granularity of allocation
Granularity of communication or data transfer
Granularity of coherence
Fix problems by modifying data structures, or layout/alignment
Examine later in context of architectures

one simple example here: data distribution in SAS solver
Cavium Univ Program ©
2010
2-87
KICS, UET
Spatial Locality Example


Repeated sweeps over 2-d grid, each time adding 1 to
elements
Natural 2-d versus higher-dimensional array representation
Contiguity in memory layout
P0
P4
P1
P5
P2
P3
P6
P0
P4
P7
P5
P2
P3
P6
P7
P8
P8
Page straddles
partition boundaries:
difficult to distribute
memory well
P1
Page does
not straddle
partition
boundary
Cache block
straddles partition
boundary
Two-dimensional
array
Cavium Univ(a)Program
©
2010
Cache block is
within a partition
(b) Four-dimensional array
2-87
KICS, UET
Tradeoffs with Inherent
Communication

Partitioning grid solver: blocks versus rows


Blocks still have a spatial locality problem on remote data
Rowwise can perform better despite worse inherent c-to-c
Good spacial locality on
ratio
nonlocal accesses at
row-oriented boudary
Poor spacial locality on
nonlocal accesses at
column-oriented
boundary
Result depends on n and p
Cavium Univ Program ©
2010
2-87
KICS, UET
Example Performance Impact
Equation solver on SGI Origin2000
50
30

25
Rows

4D

2D
4D-rr
40
35

15



Speedup
20
Speedup
4D
45

10





5
7
9 11 13 15 17 19 21 23 25 27 29 31
2D

2D-rr















1
Number of processors
Cavium Univ Program ©
2010

20






0
3
Rows-rr



5
0
1

25
10





Rows
15

5
30





3 5
7
9 11 13 15 17 19 21 23 25 27 29 31
Number of processors
2-87
KICS, UET
Structuring Communication


Given amount of comm (inherent or artifactual), goal is to reduce cost
Cost of communication as seen by process:
C=f*(o+l+











nc/m + t - overlap)
c
B
f = frequency of messages
o = overhead per message (at both ends)
l = network delay per message
nc = total data sent
m = number of messages
B = bandwidth along path (determined by network, NI, assist)
tc = cost induced by contention per message
overlap = amount of latency hidden by overlap with comp. or comm.
Portion in parentheses is cost of a message (as seen by processor)
That portion, ignoring overlap, is latency of a message
Goal: reduce terms in latency and increase overlap
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Overhead


Can reduce no. of messages m or overhead per
message o
o is usually determined by hardware or system
software



Program should try to reduce m by coalescing messages
More control when communication is explicit
Coalescing data into larger messages:


Easy for regular, coarse-grained communication
Can be difficult for irregular, naturally fine-grained
communication

may require changes to algorithm and extra work

coalescing data and determining what and to whom to send
Cavium Univ Program ©
2010
2-87
KICS, UET
Reducing Contention

All resources have nonzero occupancy



Effects of contention:




Memory, communication controller, network link, etc.
Can only handle so many transactions per unit time
Increased end-to-end cost for messages
Reduced available bandwidth for individual messages
Causes imbalances across processors
Particularly insidious performance problem


Easy to ignore when programming
Slow down messages that don’t even need that resource


by causing other dependent resources to also congest
Effect can be devastating: Don’t flood a resource!
Cavium Univ Program ©
2010
2-87
KICS, UET
Overlapping Communication

Cannot afford to stall for high latencies




even on uniprocessors!
Overlap with computation or communication to hide
latency
Requires extra concurrency (slackness), higher
bandwidth
Techniques:




Prefetching
Block data transfer
Proceeding past communication
Multithreading
Cavium Univ Program ©
2010
2-87
KICS, UET
Summary of Tradeoffs

Different goals often have conflicting demands

Load Balance



Communication



usually coarse grain tasks
decompose to obtain locality: not random/dynamic
Extra Work



fine-grain tasks
random or dynamic assignment
coarse grain tasks
simple assignment
Communication Cost:


big transfers: amortize overhead and latency
small transfers: reduce contention
Cavium Univ Program ©
2010
2-87
KICS, UET
Relationship between
Perspectives
Parallelization step(s)
Pr ocessor time component
Performance issue
Decomposition/
assignment/
or chestration
Load imbalance and
synchr onization
Synch wait
Decomposition/
assignment
Extra work
Busy-overhead
Decomposition/
assignment
Inher ent
communication
volume
Data-r emote
Or chestration
Artifactual
communication
and data locality
Data-local
Or chestration/
mapping
Communication
structur e
Cavium Univ Program ©
2010
2-87
KICS, UET
Summary
Busy(1) + Data(1)
Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)




Goal is to reduce denominator components
Both programmer and system have role to play
Architecture cannot do much about load
imbalance or too much communication
But it can:
reduce incentive for creating ill-behaved programs
(efficient naming, communication and synchronization)
 reduce artifactual communication
 provide efficient naming for flexible assignment
 allow effective overlapping of communication
Cavium Univ Program ©

2010
2-87
KICS, UET
Multi-Threading
Parallel Programming on Shared Memory
Multiprocessors Using PThread
Chapter 2
Shameem Akhtar and Jason Roberts, MultiCore Programming, Intel Press, 2006
Outline of Multi-Threading Topics

Threads




Terminology
OS level view
Hardware level threads
Threading as a parallel programming model


Types of thread level parallel programs
Implementation issues
Cavium Univ Program ©
2010
2-87
KICS, UET
Threads

Definition



Every program has at least one thread





A discrete sequence of related instructions
Executed independently of other such sequences
Initializes
Executes instructions
May create other threads
Each thread maintains its current state
OS maps a thread to hardware resources
Cavium Univ Program ©
2010
2-87
KICS, UET
System View of Threads

Thread computational model layers:



User level threads
Kernel level threads
Hardware threads
Cavium Univ Program ©
2010
2-87
KICS, UET
Flow of Threads in an Execution
Environment


Defining and preparing stage
Operating stage


Created and managed by the OS
Execution stage
Cavium Univ Program ©
2010
2-87
KICS, UET
Threads Inside the OS
Cavium Univ Program ©
2010
2-87
KICS, UET
Processors, Processes, and
Threads
A processor runs threads from one or more processes, each of
which contains one or more threads
Cavium Univ Program ©
2010
2-87
KICS, UET
Mapping Models of Threads to
Processors: 1:1 Mapping
Cavium Univ Program ©
2010
2-87
KICS, UET
Mapping Models of Threads to
Processors: M:1 Mapping
Cavium Univ Program ©
2010
2-87
KICS, UET
Mapping Models of Threads to
Processors: M:N Mapping
Cavium Univ Program ©
2010
2-87
KICS, UET
Threads Inside the Hardware
Cavium Univ Program ©
2010
2-87
KICS, UET
Thread Creation

Multiple threads inside a process




Share same address space, FDs, etc.
Operate independently
Need their own stack space
Who handles thread creation details


Not the programmer
Typically handled at system level



OS support for threads
Threading libraries
Same is true for thread management
Cavium Univ Program ©
2010
2-87
KICS, UET
Stack Layout for a Multi-Threaded
Process
Cavium Univ Program ©
2010
2-87
KICS, UET
Thread State Diagram
Cavium Univ Program ©
2010
2-87
KICS, UET
Thread Implementation

Often implemented as a thread package



Operations to create and destroy threads
Synchronization mechanisms
Approaches to implement a thread package:


Implement as a thread library to execute entirely
in user mode
Have the kernel be aware of threads and schedule
them
Cavium Univ Program ©
2010
2-87
KICS, UET
Thread Implementation (2)

Characteristics of a user level thread library


Cheap to create and destroy threads
Switching thread context can be done in just a
few instructions




Need to save and restore CPU registers only
No need to change memory maps, flush TLB, CPU
accounting, etc.
Drawback: a blocking system call will block all
threads in a process
Solution to blocking: implement thread in OS
kernel
Cavium Univ Program ©
2010
2-87
KICS, UET
Kernel Implementations of
Threads

High price to solve blocking problem

Every thread operation will require a system call




Thread creation
Thread deletion
Thread synchronization
Thread switching will now become as expensive as
process context switching
Cavium Univ Program ©
2010
2-87
KICS, UET
Kernel Implementations of Threads
(2)

Lightweight processes (LWP)






A hybrid form of user and kernel level threads
An LWP runs in the context of a (heavy-weight) process
There can be several LWPs  each with its own scheduler
and stack
System also offers a user level thread package for usual
operations (creation, deletion, and synchronization)
Assignment of a user level thread to LWP is hidden from
programmer
LWP handles the scheduling for multiple threads
Cavium Univ Program ©
2010
2-87
KICS, UET
LWP Implementation

Thread table is shared among LWPs



Protected through mutexes  no kernel intervention for LWP synch.
When an LWP finds a runnable thread  switches context to that
thread  done entirely in user space
When a thread makes a blocking system call:


OS might block one LWP
May switch to another LWP  will allow other threads to continue
Cavium Univ Program ©
2010
2-87
KICS, UET
Parallel Programming with
Threads
Overview of POSIX threads, data races
and types of synchronization
Shared Memory Programming


Several Thread Libraries
PTHREADS is the POSIX Standard




OpenMP is newer standard



Solaris threads are very similar
Relatively low level
Portable but possibly slow
Support for scientific programming on shared
memory
http://www.openMP.org
Multiple other efforts by specific vendors
Cavium Univ Program ©
2010
2-87
KICS, UET
Overview of POSIX Threads

POSIX: Portable Operating System Interface for UNIX


PThreads: The POSIX threading interface



Interface to Operating System utilities
System calls to create and synchronize threads
Should be relatively uniform across UNIX-like OS platforms
PThreads contain support for



Creating parallelism
Synchronizing
No explicit support for communication, because shared
memory is implicit; a pointer to shared data is passed to a
thread
Cavium Univ Program ©
2010
2-87
KICS, UET
POSIX Thread Creation

Signature:
int pthread_create(pthread_t *,
const pthread_attr_t *,
void * (*)(void *),
void *);

Example call:
pthread_create(&thread_id;
&thread_attribute
&thread_fun; &fun_arg);
Cavium Univ Program ©
2010
2-87
KICS, UET
POSIX Thread Creation (2)


thread_id is the thread id or handle (used to halt,
etc.)
thread_attribute various attributes




standard default values obtained by passing a NULL pointer
thread_fun the function to be run (takes and returns
void*)
fun_arg an argument can be passed to thread_fun
when it starts
errorcode will be set nonzero if the create operation
fails
Cavium Univ Program ©
2010
2-87
KICS, UET
Simple Threading Example
Compile using gcc –lpthread
void* SayHello(void *foo) {
printf( "Hello, world!\n" );
return NULL;
}
int main() {
pthread_t threads[16];
int tn;
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], NULL, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}
Cavium Univ Program ©
2010
2-87
KICS, UET
Loop Level Parallelism

Many scientific application have parallelism in loops


With threads:
… my_stuff [n][n];
Also need i & j
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
… pthread_create (update_cell, …,
my_stuff[i][j]);
But overhead of thread creation is nontrivial


update_cell should have a significant amount of work
1/pth if possible
Cavium Univ Program ©
2010
2-87
KICS, UET
Shared Data and Threads



Variables declared outside of main are shared
Object allocated on the heap may be shared
(if pointer is passed)
Variables on the stack are private: passing
pointer to these around to other threads can
cause problems
Cavium Univ Program ©
2010
2-87
KICS, UET
Shared Data and Threads (2)

Often done by creating a large “thread data”
struct


Passed into all threads as argument
Simple example:
char *message = "Hello World!\n";
pthread_create( &thread1,
NULL,
(void*)&print_fun,
(void*) message);
Cavium Univ Program ©
2010
2-87
KICS, UET
Setting Attribute Values

Once an initialized attribute object exists,
changes can be made. For example:

To change the stack size for a thread to 8192
(before calling pthread_create), do this:


pthread_attr_setstacksize(&my_attributes, (size_t)8192);
To get the stack size, do this:

size_t my_stack_size;
pthread_attr_getstacksize(&my_attributes,
&my_stack_size);
Slide Source: Theewara Vorakosit
Cavium Univ Program ©
2010
2-87
KICS, UET
Other Attributes
Other attributes:
 Detached state – set if no other thread will use
pthread_join to wait for this thread (improves
efficiency)
 Scheduling parameter(s) – in particular, thread
priority
 Scheduling policy – FIFO or Round Robin
 Contention scope – with what threads does this
thread compete for a CPU
 Stack address – explicitly dictate where the stack is
located
 Lazy stack allocation – allocate on demand (lazy) or
all at once, “up front”
Cavium Univ Program ©
2010
2-87
KICS, UET
Data Race Example
static int s = 0;
Thread 1
Thread 2
for i = 0, n/2-1
s = s + f(A[i])


for i = n/2, n-1
s = s + f(A[i])
Problem is a race condition on variable s in
the program
A race condition or data race occurs when:


two processors (or two threads) access the same
variable, and at least one does a write.
The accesses are concurrent (not synchronized) so
they could happen simultaneously
Cavium Univ Program ©
2010
2-87
KICS, UET
Basic Types of Synchronization:
Barrier

Barrier—global synchronization

Especially common when running multiple copies
of the same function in parallel


SPMD “Single Program Multiple Data”
simple use of barriers -- all threads hit the same
one
work_on_my_subgrid();
barrier;
read_neighboring_values();
barrier;
Cavium Univ Program ©
2010
2-87
KICS, UET
Barrier (2)


More complicated—barriers on branches (or
loops)
if (tid % 2 == 0) {
work1();
barrier
} else { barrier }
Barriers are not provided in all thread libraries
Cavium Univ Program ©
2010
2-87
KICS, UET
Creating and Initializing a
Barrier

To (dynamically) initialize a barrier, use code
similar to this (which sets the number of
threads to 3):
pthread_barrier_t b;
pthread_barrier_init(&b,NULL,3);

The second argument specifies an object
attribute; using NULL yields the default
attributes.
Cavium Univ Program ©
2010
2-87
KICS, UET
Creating and Initializing a
Barrier

To wait at a barrier, a process executes:
pthread_barrier_wait(&b);

This barrier could have been statically
initialized by assigning an initial value created
using the macro
PTHREAD_BARRIER_INITIALIZER(3)
Cavium Univ Program ©
2010
2-87
KICS, UET
Basic Types of Synchronization:
Mutexes

Mutexes—mutual exclusion aka locks


threads are working mostly independently
need to access common data structure
lock *l = alloc_and_init();
acquire(l);
access data
release(l);
Cavium Univ Program ©
2010
2-87
/* shared */
KICS, UET
Mutexes (2)

Java and other languages have lexically
scoped synchronization



similar to cobegin/coend vs. fork and join tradeoff
Semaphores give guarantees on “fairness” in
getting the lock, but the same idea of mutual
exclusion
Locks only affect processors using them:

pair-wise synchronization
Cavium Univ Program ©
2010
2-87
KICS, UET
Mutexes in POSIX Threads
To create a mutex:
#include <pthread.h>
pthread_mutex_t amutex =
PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_init(&amutex, NULL);
 To use it:
int pthread_mutex_lock(amutex);
int pthread_mutex_unlock(amutex);

Cavium Univ Program ©
2010
2-87
KICS, UET
Mutexes in POSIX Threads (2)
To deallocate a mutex
int pthread_mutex_destroy(pthread_mutex_t
*mutex);
 Multiple mutexes may be held, but can lead
to deadlock:
thread1
thread2
lock(a)
lock(b)
lock(b)
lock(a)

Cavium Univ Program ©
2010
2-87
KICS, UET
Summary of Programming with
Threads

POSIX Threads are based on OS features




Pitfalls



Can be used from multiple languages
Familiar language for most of program
Ability to shared data is convenient
Intermittent data race bugs are very nasty to find
Deadlocks are usually easier, but can also be
intermittent
OpenMP is commonly used today as an
alternative
Cavium Univ Program ©
2010
2-87
KICS, UET
Multi-Threaded
Distributed Application
Examples
Distributed Operating Systems
By Andrew S. Tanenbaum
Multithreaded Clients

Distribution transparency



Needed when a DS operates in a wide-area network
environment
Need some mechanism to hide communication latency
Multithreading on client side is useful




One connection per thread
If one thread is blocked, other can do useful work
More responsive to the user
Example: a web browser


One thread connected to a server can bring an HTML
document
Another thread connected to the same server can bring
images while the first displays the text, scroll bars, etc.
Cavium Univ Program ©
2010
2-87
KICS, UET
Multithreaded Servers (1)

A multithreaded server organized in a
dispatcher/worker model.
Cavium Univ Program ©
2010
2-87
KICS, UET
Multithreaded Servers (2)

Three ways to construct a server.
Model
Characteristics
Threads
Parallelism, blocking system
calls
No parallelism, blocking system
Single-threaded process
calls
Finite-state machine
Cavium Univ Program ©
2010
Parallelism, nonblocking system
calls
2-87
KICS, UET
Clients

Anatomy of a client process:

User interface





A major task for most clients is to interact with human users
Provide a means to interact with a remote server
An important class: Graphical User Interfaces (GUIs)
Client side software  distribution transparency
Example: X Windows system

Used to control bit-mapped devices



Monitor, keyboard, keyboard, and a pointing device
X kernel (X Server) contains hardware-specific details  device drivers
X uses an event-driven approach



Captures events from devices
Provides an interface in the form of Xlib for GUI/graphics applications
Two types of applications: normal and window manager
Cavium Univ Program ©
2010
2-87
KICS, UET
The X-Window System

The basic organization of the X Window
System
Cavium Univ Program ©
2010
2-87
KICS, UET
User Interface: Compound
Documents

Function of a user interface is more than interacting with users!



May allow multiple applications to share a single graphical window
Use that window to exchange data through user actions
Typical examples:

Drag and drop



In-place editing



Drag an icon representing a file on trash can icon
Application associated with trash can will be activated to delete file
Image within a text document in a word processor
Pointing on the image can activate a drawing tool
Compound documents notion of user interface



A collection of different documents (text, images, spreadsheets)
Seamlessly integrated through user interface
Different applications operate on different parts of the document
Cavium Univ Program ©
2010
2-87
KICS, UET
Client-Side Software for
Distribution Transparency

A possible approach to transparent replication of a remote object
using a client-side solution
Proxy replicates requests to all replicated servers

Forms a single response for the client application  replication
transparency

Failure
transparency
is also possible through client middleware
Cavium
Univ Program
©

2010
2-87
KICS, UET
Servers

Organization of a server process:


Design issues of a server
Object servers



Alternatives for invoking objects
Object adapter
General design of a server:

Iterative server



Handles all requests itself
If necessary, returns a response to the requesting user
Concurrent server


Does not handle request itself
Passes it to a separate thread or process and waits for the next
request
Cavium Univ Program ©
2010
2-87
KICS, UET
Servers: General Design Issues



Client-to-server binding using a
daemon as in DCE
Client-to-server binding using a
superserver as in UNIX
Other distinctions:


Stateless server
Stateful server
Cavium Univ Program ©
2010
2-87
KICS, UET
Key Takeaways of this Session

A wealth of knowledge exists about
developing parallel applications



On legacy parallel architectures
For high performance computing (HPC)
applications
Techniques are applicable to multi-core



Similar decomposition, assignment, orchestration,
and mapping
Shared address space programming
Wider range of applications  topic for next
session
Cavium Univ Program ©
2010
2-87
KICS, UET
Descargar

COE 590 Special Topics: Parallel Architectures