Orchestrating the Execution of Stream
Programs on Multicore Platforms
Manjunath Kudlur, Scott Mahlke
Advanced Computer Architecture Lab.
University of Michigan
1
University of Michigan
Electrical Engineering and Computer Science
Cores are the New Gates
(Shekhar Borkar, Intel)
512
Stream Programming
Unicore
256
PicoChip AMBRIC
Homogeneous Multicore
CISCO CSR1
Heterogeneous Multicore
128
NVIDIA G80
Courtesy: Gordon’06
Larrabee
# cores/chip
64
X10
Peakstream
32
RAZA
XLR
C/C++/Java
16
RAW
8
Accelerator
Ct
BCM 1480
4004
8008
8080
Xbox 360
Power4 PA8800
8086
286
386
486
Pentium
P2
1
P3 P4
Athlon
1975
1980
1985
1990
1995
2
2000
Fortress
Cavium
Cell
Niagara
4
2
CUDA
Opteron 4P
AMD Fusion
CTM
Core2Quad
Xeon
Power6
Rstream
Opteron
Core2Duo
CoreDuo
Rapidmind
Core
Itanium Itanium2
2005
2010
University of Michigan
Electrical Engineering and Computer Science
Stream Programming
• Programming style
– Embedded domain
• Audio/video (H.264), wireless (WCDMA)
– Mainstream
• Continuous query processing (IBM
SystemS), Search (Google Sawzall)
• Stream
– Collection of data records
• Kernels/Filters
–
–
–
–
Functions applied to streams
Input/Output are streams
Coarse grain dataflow
Amenable to aggressive compiler
optimizations [ASPLOS’02, ’06, PLDI ’03]
3
University of Michigan
Electrical Engineering and Computer Science
Compiling Stream Programs
Stream Program
Multicore System
?
Core 1
Core 2
Core 3
Core 4
Mem
Mem
Mem
Mem
• Heavy lifting
• Equal work distribution
• Communication
• Synchronization
4
University of Michigan
Electrical Engineering and Computer Science
Stream Graph Modulo Scheduling(SGMS)
• Coarse grain software pipelining
– Equal work distribution
– Communication/computation overlap
• Target : Cell processor
SPE0
SPE1
SPE7
SPU
SPU
SPU
256 KB LS
256 KB LS
256 KB LS
MFC(DMA)
MFC(DMA)
MFC(DMA)
– Cores with disjoint address spaces
– Explicit copy to access remote data
• DMA engine independent of PEs
EIB
PPE
(Power PC)
DRAM
• Filters = operations, cores = function units
5
University of Michigan
Electrical Engineering and Computer Science
Streamroller Software Overview
Stylized C
SPEX
(C++ with
stream extensions)
StreamIt
Streamroller
Focus of this talk
Profiling
Stream IR
Analysis
Code
Generation
Scheduling
Custom
Application
Engine
SODA
Multicore
(Low power multicore
processor for SDR)
(Cell, Core2Quad,
Niagara)
6
University of Michigan
Electrical Engineering and Computer Science
Preliminaries
• Synchronous Data Flow (SDF) [Lee ’87]
• StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
int wgts[N];
work pop 1 push 1 {
int i, sum = 0;
wgts = adapt(wgts);
for(i=0; i<N; i++)
sum += peek(i)*wgts[i];
push(sum);
pop();
}
}
Push and pop items from
input/output FIFOs
7
University of Michigan
Electrical Engineering and Computer Science
SGMS Overview
PE0
Prologue
PE0
T1
PE1
DMA
DMA
PE2
DMA
DMA
PE3
DMA
DMA
DMA
DMA
Epilogue
T1
≈4
T4
8
University of Michigan
Electrical Engineering and Computer Science
T4
SGMS Phases
Fission +
Processor
assignment
Stage
assignment
Load balance
Code
generation
Causality
DMA overlap
9
University of Michigan
Electrical Engineering and Computer Science
Processor Assignment
• Assign filters to processors
– Goal : Equal work distribution
• Graph partitioning?
• Bin packing?
Modified
Original stream program
A
5
PE0
A
S
40
C
B
C
BB2
B1
5
B2
10
B
C
D
J
PE1
A
B1
C
D
A
Speedup = 60/40 = 1.5
D
Speedup = 60/32 ~ 2
S
J
D
10
University of Michigan
Electrical Engineering and Computer Science
Filter Fission Choices
PE0
PE1
PE2
PE3
Speedup ~ 4 ?
11
University of Michigan
Electrical Engineering and Computer Science
Integrated Fission + PE Assign
• Exact solution based on Integer Linear Programming (ILP)
Split/Join overhead
factored in
…
• Objective function- Maximal load
on any PE
– Minimize
• Result
– Number of times to “split” each
filter
– Filter → processor mapping
12
University of Michigan
Electrical Engineering and Computer Science
SGMS Phases
Fission +
Processor
assignment
Stage
assignment
Load balance
Code
generation
Causality
DMA overlap
13
University of Michigan
Electrical Engineering and Computer Science
Forming the Software Pipeline
• To achieve speedup
– All chunks should execute concurrently
– Communication should be overlapped
• Processor assignment alone is insufficient information
PE0
B
PE0
PE1
Time
A
A
B
C
AA1
A2
A3
PE1
B
B1
A→B
X
A→B
B1
C
Overlap Ai+2 with Bi
14
University of Michigan
Electrical Engineering and Computer Science
Stage Assignment
PE 1
PE 1
i
i
Sj ≥ Si
DMA
j
Si
SDMA > Si
PE 2
j
Preserve causality
(producer-consumer dependence)
Sj = SDMA+1
Communication-computation
overlap
• Data flow traversal of the stream graph
– Assign stages using above two rules
15
University of Michigan
Electrical Engineering and Computer Science
Stage Assignment Example
A
A
S
S
Stage 0
C
B1
B1
B2
DMA
J
DMA
D
DMA
Stage 1
C
B2
Stage 2
PE 0
J
PE 1
DMA
DMA
D
16
Stage 3
Stage 4
University of Michigan
Electrical Engineering and Computer Science
SGMS Phases
Fission +
Processor
assignment
Stage
assignment
Load balance
Code
generation
Causality
DMA overlap
17
University of Michigan
Electrical Engineering and Computer Science
Code Generation for Cell
• Target the Synergistic Processing Elements (SPEs)
– PS3 – up to 6 SPEs
– QS20 – up to 16 SPEs
• One thread / SPE
• Challenge
– Making a collection of independent threads implement a
software pipeline
– Adapt kernel-only code schema of a modulo schedule
18
University of Michigan
Electrical Engineering and Computer Science
Complete Example
A
B1
DMA
DMA
DMA
C
B2
J
DMA
DMA
D
S
{0};
Time
A
S
void spe1_work()
{
char stage[5] =
stage[0] = 1;
for(i=0; i<MAX;
if (stage[0])
A();
S();
B1();
}
if (stage[1])
}
if (stage[2])
JtoD();
CtoD();
}
if (stage[3])
}
if (stage[4])
D();
}
barrier();
}
}
B1
A
i++) {
{
B1toJ
StoB2
AtoC
S
B1
A
S
B2
B1
J
{
{
B1toJ
StoB2
AtoC
C
A
S
JtoD
CtoD
B1
B2
B1toJ
StoB2
AtoC
J
C
{
A
S
{
JtoD
CtoD
B1
D
SPE1 DMA1
19
B2
B1toJ
StoB2
AtoC
J
C
SPE2 DMA2
University of Michigan
Electrical Engineering and Computer Science
Experiments
• StreamIt benchmarks
–
–
–
–
–
Signal processing – DCT, FFT, FMRadio, radar
MPEG2 decoder subset
Encryption – DES
Parallel sort – Bitonic
Range of 30 to 90 filters/benchmark
• Platform
– QS20 blade server – 2 Cell chips, 16 SPEs
• Software
– StreamIt to C – gcc 4.1
– IBM Cell SDK 2.1
20
University of Michigan
Electrical Engineering and Computer Science
Evaluation on QS20
2P
18
4P
8P
16P
Split/join overhead reduces
benefit from fission
16
Relative Speedup
14
12
10
Barrier synchronization
(1 per iteration of the stream graph)
8
6
4
2
0
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
geomean
Benchmarks
21
University of Michigan
Electrical Engineering and Computer Science
SGMS(ILP) vs. Greedy
(MIT method, ASPLOS’06)
ILP Partitioning
Greedy Partitioning
Exposed DMA
9
8
Relative Speedup
7
6
5
4
3
2
1
0
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
Benchmarks
• Solver time < 30 seconds for 16 processors
22
University of Michigan
Electrical Engineering and Computer Science
Conclusions
• Streamroller
– Efficient mapping of stream programs to multicore
– Coarse grain software pipelining
• Performance summary
– 14.7x speedup on 16 cores
– Up to 35% better than greedy solution (11% on average)
• Scheduling framework
– Tradeoff memory space vs. load balance
• Memory constrained (embedded) systems
• Cache based system
23
University of Michigan
Electrical Engineering and Computer Science
24
University of Michigan
Electrical Engineering and Computer Science
25
University of Michigan
Electrical Engineering and Computer Science
Naïve Unfolding
Completely stateless stream program
PE1
PE2
A
PE3
PE4
A
A
A
B
C
B
C
B
C
B
C
D
E
D
E
D
E
D
E
F
F
F
F
D
E
F
DMA
DMA
PE3
A
PE4
A
B
C
D
E
F
A
B
C
D
E
F
DMA
C
PE2
DMA
B
DMA
A
DMA
PE1
DMA
DMA
Stream program with stateful nodes
B
C
D
E
F
Stream data dependence
State data dependence
26
University of Michigan
Electrical Engineering and Computer Science
SGMS (ILP) vs. Greedy
(MIT method, ASPLOS’06)
ILP partitioning
Greedy partitioning
9
8
Relative speedup
7
6
• Highly dependent on graph structure
• DMA overlapped in both ILP and Greedy
• Speedup drops with exposed DMA
5
4
3
2
1
0
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
Benchmarks
• Solver time < 30 seconds for 16 processors
27
University of Michigan
Electrical Engineering and Computer Science
Unfolding vs. SGMS
Unfolding
SGMS
18
14
12
10
8
6
4
2
ar
ra
d
r
vo
co
de
g2
m
pe
td
e
io
ra
d
fm
nk
fil
te
rb
a
fft
s
de
t
dc
ne
l
an
ch
ni
c
0
bi
to
Relative Speedup
16
Benchmarks
28
University of Michigan
Electrical Engineering and Computer Science
Pros and Cons of Unfolding
• Pros
– Simple
– Good speedups for mostly stateless programs
• Cons
A
B
C
D
PE 1
PE 2
PE 3
A
A
A
B
B
B
C
D
Sync + DMA
A
C
D
B
A
Sync+DMA
Sync + DMA
Speedup affected by filters with
large state
C
D
Sync + DMA
C
D
All input data need to be
available for iterations to begin
B
A
Sync + DMA
B
C
D
Sync + DMA
• Long latency for one iteration
• Not suitable for real time scheduling
C
D
29
University of Michigan
Electrical Engineering and Computer Science
SGMS Assumptions
• Data independent filter work functions
– Static schedule
• High bandwidth interconnect
– EIB 96 Bytes/cycle
– Low overhead DMA, low observed latency
• Cores can run MIMD efficiently
– Not (directly) suitable for GPUs
30
University of Michigan
Electrical Engineering and Computer Science
Speedup Upperbounds
2P
4P
8P
16P
32P
64P
50
45
15% work in one stateless filter
Max speedup = 100/15 = 6.7
Relative Speedup
40
35
30
25
20
15
10
5
0
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
Benchmarks
31
University of Michigan
Electrical Engineering and Computer Science
Summary
• 14.7x speedup on 16 cores
– Preprocessing steps (fission) necessary
– Hiding communication important
• Extensions
– Scaling to more cores
– Constrained memory system
• Other targets
– SODA, FPGA, GPU
32
University of Michigan
Electrical Engineering and Computer Science
Stream Programming
• Algorithmic style suitable for
– Audio/video processing
– Signal processing
– Networking (packet processing,
encryption/decryption)
– Wireless domain, software defined radio
• Characteristics
– Independent filters
Bitstream
parser
Zigzag
unorder
Inverse
Quant AC
Motion
decode
Inverse
Quant DC
1D-DCT
(row)
Saturate
1D-DCT
(column)
• Segregated address space
• Independent threads of control
– Explicit communication
– Coarse grain dataflow
– Amenable to aggressive compiler
optimizations [ASPLOS’02, ’06, PLDI ’03,’08]
33
Bounded
saturate
MPEG Decoder
University of Michigan
Electrical Engineering and Computer Science
Effect of Exposed DMA
SGMS
SGMS with exposed DMA
18
16
Relative speedup
14
12
10
8
6
4
2
0
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
Benchmarks
34
University of Michigan
Electrical Engineering and Computer Science
Scheduling via Unfolding Evaluation
• IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server
• StreamIt benchmarks characteristics
Benchmark
bitonic
channel
dct
des
fft
filterbank
fmradio
tde
mpeg2
vocoder
radar
Total filters
40
55
40
55
17
85
43
55
39
54
57
Stateful
1
1
1
0
1
1
1
1
1
11
44
35
Peeking
0
34
0
0
0
32
14
0
0
6
0
State size
(bytes)
4
252
4
4
4
508
508
4
4
112
1032
University of Michigan
Electrical Engineering and Computer Science
Absolute Performance
Benchmark
bitonic
GFLOPS with
16 SPEs
N/A
channel
18.72949
dct
17.2353
des
N/A
fft
1.65881
filterbank
18.41478
fmradio
17.05445
tde
6.486601
mpeg2
N/A
vocoder
5.488236
radar
30.29937
36
University of Michigan
Electrical Engineering and Computer Science
P
Original actor
Σa1,0,0,i – b1,0 = 0
0_0
i=1
P
3
ΣΣa1,1,j,i – b1,1
1_2
≤ Mb1,1
i=1 j=0
Fissed 2x
1_0
1_1
P
3
ΣΣa1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1
1_3
i=1 j=0
P
4
ΣΣa1,2,j,i – b1,2 ≤ Mb1,2
2_3
i=1 j=0
Fissed 3x
2_0
2_1
2_4
2_2
P
4
ΣΣa1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2
i=1 j=0
b1,0 + b1,1 + b1,2 = 1
37
University of Michigan
Electrical Engineering and Computer Science
SPE Code Template
void spe_work()
{
char stage[N] = {0, 0,...,0};
stage[0] = 1;
Bit mask to control active stages
for (i=0; i<max_iter+N-1; i++) {
if (stage[N-1]) {
Bit mask controls what
gets executed
}
if (stage[N-2]) {
}
}
Activate Stage 0
Go through all input items
}
...
if (stage[0]) {
Start DMA operation
Start_DMA();
Call filter work function
FilterA_work();
}
if (i == max_iter-1)
stage[0] = 0;
Left shift bit mask
for(j=N-1; j>=1; j--)
(activate more stages)
stage[j] = stage[j-1];
Poll for completion of all
wait_for_dma();
outstanding DMAs
barrier();
Barrier synchronization
38
University of Michigan
Electrical Engineering and Computer Science
Descargar

Document