Programming High Performance
Embedded Systems:
Tackling the Performance
Portability Problem
Alastair Reid
Principal Engineer, R&D
ARM Ltd
1
Programming HP Embedded Systems
High-Performance Energy-Efficient Hardware
 Example: Ardbeg processor cluster (ARM R&D)
Portable System-level programming
 Example: SoC-C language extensions (ARM R&D)
Portable Kernel-level programming
 Example: C+Builtins
 Example: Data Parallel Language
 Merging System/Kernel-level programming
2
Mobile Consumer Electronics Trends
Mobile Application Requirements Still Growing Rapidly
 Still cameras: 2Mpixel  10 Mpixel
 Video cameras: VGA  HD 1080p  …
 Video players: MPEG-2  H.264
 2D Graphics: QVGA  HVGA  VGA  FWVGA  …
 3D Gaming: > 30Mtriangle/s, antialiasing, …
 Bandwidth: HSDPA (14.4Mbps)  WiMax (70Mbps)  LTE (326Mbps)
Feature Convergence
 Phone
 + graphics + UI + games
 + still camera + video camera
 + music
 + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS
 +…
3
3
Mobile SDR Design Challenges
1000
r
cy
tte e n
Be ffici
rE
we
Po
W
Peak Performance (Gops)
/m
IBM Cell
ps
SDR Design Objectives
for
3G
and
WiFi
o
M
100
10
0
W
High-end
DSPs
 Throughput requirements
Mobile SDR
1
General
 Requirements
40+Gops peak throughput
Purpose
/m
ps
o
0M
Embedded ps/m
o
DSPs 1 M
10
W
Processors
 Power
budget Pentium M
TI C6x
 100mW~500mW peak power
1
0.1
1
10
100
Power (Watts)
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
5
5
5
Energy Efficient Systems are “Lumpy”
Drop Frequency 10x
 Desktop: 2-4GHz
 Mobile: 200-400MHz
Increase Parallelism 100x
 Desktop: 1-2 cores
 Mobile: 32-way SIMD Instruction Set, 4-8 cores
Match Processor Type to Task
 Desktop: homogeneous, general purpose
 Mobile: heterogeneous, specialised
Keep Memory Local
 Desktop: coherent, shared memory
 Mobile: processor-memory clusters linked by DMA
8
8
Ardbeg SDR Processor
Ardbeg PE
Ardbeg System
Bus
512-bit
64-bit AMBA 3 AXI Interconnect
1024-bit
SIMD
ACC RF
3. Memory
Application Specific Hardware
FEC
Accelerator
L2
Mem
wide SIMD
Sparse Connected1. VLIW
L2
Memory
L1
Mem
L1
Mem
8,16,32 bit fixed point support
512-bit
SIMD
512-bit SIMD
Reg.
PE
Execution
Unit
PE
Execution
Unit
File
L1
Data
Memory
Pred.
RF
SIMD
Pred.
ALU
2-level memory hierarchy
L1
Mem
Control
Processor
L1
Program
Memory
SIMD
wdata
Scalar
wdata
E
X
512-bit
SIMD
ALU
with
shuffle
W
B
E
X
SIMD
Shuffle
Network
W
B
SIMD+
Scalar
Transf
Unit
E
X
Scalar
ALU+
Mult
AGU
AGU
RF
AGU
Controller
10
W
B
Scalar
RF+ACC
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
10
512-bit
SIMD
Mult
2. Scalar & AGU
DMAC
Peripherals
I
N
T
E
R
C
O
N
N
E
C
T
S
E
X
AGU
W
B
I
N
T
E
R
C
O
N
N
E
C
T
S
Summary of Ardbeg SDR Processor
Achieved Throughput (Mbps)
100
802.11a
802.11a
802.11a
10
180nm 802.11a
802.11a
DVB-H
W-CDMA 2Mbps
Ardbeg
DVB-T
SODA
W-CDMA 2Mbps
1
180nm W-CDMA 2Mbps
W-CDMA 2Mbps
W-CDMA 2Mbps
ASIC
Sandblaster
W-CDMA data
W-CDMA data
TigerSHARC
W-CDMA data
7 Pentium M
0.1
W-CDMA voice
W-CDMA voice
0.01
0.01
0.1
1
10
100
1000
Power (Watts)
• Ardbeg is lower power at same throughput
• We are getting closer to ASICs
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
11
11
How do we program AMP systems?
C doesn’t provide language features to support
 Multiple processors (or multi-ISA systems)
 Distributed memory
 Multiple threads
12
12
Use Indirection (Strawman #1)
Add a layer of indirection
 Operating System
 Layer of middleware
 Device drivers
 Hardware support
All impose a cost in Power/Performance/Area
13
13
Raise Pain Threshold (Strawman #2)
Write efficient code at very low level of abstraction
Problems
 Hard, slow and expensive to write, test, debug and maintain
 Design intent drowns in sea of low level detail
 Not portable across different architectures
 Expensive to try different points in design space
14
14
Our Response
Extend C
 Support Asymmetric Multiprocessors
 SoC-C language raises level of abstraction
 … but take care not to hide expensive operations
15
15
SoC-C Overview
Pocket-Sized Supercomputers
 Energy efficient hardware is “lumpy”
 … and unsupported by C
 … but supported by SoC-C
SoC-C Extensions by Example
 Pipeline Parallelism
 Code Placement
 Data Placement
SoC-C Conclusion
16
16
3 steps in mapping an application
1. Decide how to parallelize
2. Choose processors for each pipeline stage
3. Resolve distributed memory issues
17
17
A Simple Program
int x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
18
18
Simplified System Architecture
Control Processor
SIMD Instruction Set
Data Engines
Artist’s impression
Accelerators
Distributed Memories
19
19
Step 1: Decide how to parallelize
int x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
20
50% of work
50% of work
20
Step 1: Decide how to parallelize
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
21
PIPELINE
indicates region to paralleliz
FIFO
indicates boundaries
between pipeline stages
21
SoC-C Feature #1: Pipeline Parallelism
Annotations express coarse-grained pipeline parallelism
 PIPELINE indicates scope of parallelism
 FIFO indicates boundaries between pipeline stages
Compiler splits into threads communicating through FIFOs
22
22
Step 2: Choose Processors
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
23
23
Step 2: Choose Processors
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
24
@P
indicates processor to
execute function
24
SoC-C Feature #2: RPC Annotations
Annotations express where code is to execute
 Behaves like Synchronous Remote Procedure Call
 Does not change meaning of program
 Bulk data is not implicitly copied to processor’s local memory
25
25
Step 3: Resolve Memory Issues
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
26
P0 uses x  x must be in M0
P1 uses z  z must be in M1
P0 uses y  y must be in M0
Conflict?!
P1 uses y  y must be in M1
26
Hardware Cache Coherency
P1
P0
invalidate x
$0
copy x
$1
invalidate x
write x
read x
write x
27
27
Step 3: Resolve Memory Issues
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
28
Two versions: [email protected], [email protected]
write [email protected][email protected] is invalid
reads [email protected]
 Coherence error
28
Step 3: Resolve Memory Issues
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
29
SYNC(x) @ P
copies data from one version of
x to another using processor P
 [email protected] and [email protected] are valid
read [email protected]
29
SoC-C Feature #3: Compile Time Coherency
Variables can have multiple coherent versions
 Compiler uses memory topology to determine which version
is being accessed
Compiler applies cache coherency protocol
 Writing to a version makes it valid and other versions invalid
 Dataflow analysis propagates validity
 Reading from an invalid version is an error
 SYNC(x) copies from valid version to invalid version
30
30
Compiling SoC-C
See paper:
SoC-C: efficient programming abstractions for
heterogeneous multicore systems on chip, Proceedings
of the 2008 international conference on Compilers,
architectures and synthesis for embedded systems (CASES)
2008.
(Or view ‘bonus slides’ after talk.)
31
More realistic SoC-C code
adc_t adc;
DVB-T Inner Receiver
 OFDM receiver
 20 tasks
 500-7000 cycles each
 29000 cycles total
ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13);
SOCC_PIPELINE {
ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd;
for(int sym = 0; sym<LOOPS; ++sym) {
cbuffer_t src_r, src_i;
unsigned len = Nguard+asC_MODE[Mode];
ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i);
align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512;
align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512;
ADC_ReleaseRoom(&adc,&src_r,&src_i,len);
RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa;
cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa;
fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa;
SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb;
DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum,
demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb;
DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum,
demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb;
cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc;
cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb;
DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb;
DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb;
cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc;
int Direction = (ISymNum & 1);
Direction ^= 1;
if (Direction) {
Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc;
pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc;
} else {
unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc;
Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc;
}
ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd;
Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd;
BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd;
uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd;
Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM;
p += uMaxCnt/8;
ISymNum = (ISymNum+1) % 4;
}
ADC_Fini(&adc);
32
Parallel Speedup
Speedup
400%
Efficient

Same performance as hand-written code
350%
300%
250%
200%
Near Linear Speedup

Very efficient use of parallel hardware
150%
100%
50%
0%
1
33
2
3
33
4
What SoC-C Provides
SoC-C language features
 Pipeline to support parallelism
 Coherence to support distributed memory
 RPC to support multiple processors/ISAs
Non-features
 Does not choose boundary between pipeline stages
 Does not resolve coherence problems
 Does not allocate processors
SoC-C is concise notation to express mapping decisions
(not a tool for making them on your behalf)
34
34
Related Work
Language
 OpenMP: SMP data parallelism using ‘C plus annotations’
 StreamIt: Pipeline parallelism using dataflow language
Pipeline parallelism
 J.E. Smith, “Decoupled access/execute computer
architectures,” Trans. Computer Systems, 2(4), 1984
 Multiple independent reinventions
Hardware
 Woh et al., “From Soda to Scotch: The Evolution of a
Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008
35
35
More Recent Related Work
Mapping applications onto Embedded SoCs
 Exposing Non-Standard Architectures to Embedded Software
using Compile-Time Virtualization, CASES 2009
Pipeline parallelism
 The Paralax Infrastructure: Automatic Parallelization with a
Helping Hand, PACT 2010
36
36
The SoC-C Model
Program as if using SMP system
 Single multithreaded processor: RPCs provide a “Migrating
thread Model”
 Single memory: Compiler Managed Coherence handles
“bookkeeping”
 Annotations change execution, not semantics
Avoid need to restructure code
 Pipeline parallelism
 Compiler managed coherence
Efficiency
 Avoid abstracting expensive operations
 programmer can optimize and reason about
37
37
Kernel Programming
38
Overview
Example: FIR filter
T 1
yi 
h
j0
Hand-vectorized code
 Optimal performance
 Issues
An Alternative Approach
39
j
xi j
Example Vectorized Code
Very fast, efficient code
 Uses 32-wide SIMD
 Each SIMD multiply performs 32

(useful) multiplies
VLIW compiler overlaps
operations

3 vector operations per cycle
 VLIW compiler performs software
pipelining

40
Multiplier active on every cycle
void FIR(vint16_t x[], vint16_t y[],
int16_t h[]) {
vint16_t v = x[0];
for (int i=0; i<N/SIMD_WIDTH; ++i) {
vint16_t w = x[i+1];
vint32L_t acc = vqdmull(v,h[0]);
s = vget_lane(w,0);
v = vdown(v,s);
for(int j=1; j<T-1; ++j) {
acc = vqdmlal(acc,v,h[j]);
s = vget_lane(w,j);
v = vdown(v,s);
}
y[i] = vqrdmlah(acc,v,h[j]);
v = w;
}
}
Portability Issues
Vendor specific SIMD operations
 vqdmull, vdown, vget_lane
SIMD-width specific
 Assumes SIMD_WIDTH >= T
Doesn’t work/performs badly on
 Many SIMD architectures
 GPGPU
 SMP
41
Flexibility issues
Improve arithmetic intensity
 Merge with adjacent kernel
 E.g., if filtering input to FFT, combine with bit reversal
Parallelize task across two Ardbeg engines
 Requires modification to system-level code
42
Summary
Programming directly to the processor
 Produces very high performance code
 Kernel is not portable to other processor types
 Kernels cannot be remapped to other devices
 Kernels cannot be split/merged to improve scheduling or
reduce inter-kernel overheads
Often produces local optimum
But misses global optimum
43
(Towards)
Performance-Portable
Kernel Programming
44
Outline
The goal
Quick and dirty demonstration
References to (more complete) versions
What still needs to be done
45
An alternative approach
T 1
yi 
h
j
j0
Compiler
46
xi j
A simple data parallel language
loop(N) {
V1 = load(a);
V1:
a0
a1
a2
a3
a4
a5
a6
a7
...
V2 = load(b);
V2:
b0
b1 b2
b3
b4
b5
b6
b7
...
V3 = add(V1,V2);
V3:
a0
+
b0
a1
+
b1
a2
+
b2
a3
+
b3
a4
+
b4
a5 a6
+
+
b5 b6
a7
+
b7
...
c:
a0
+
b0
a1
+
b1
a2
+
b2
a3 a4
+
+
b3 b4
a5
+
b5
a7
+
b7
...
store(c,V3);
}
N
* Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.
47
a6
+
b6
Compiling Vector Expressions
Vector Expression
Init
Generate
Next
V1 = load(a);
p1=a;
V1=vld1(p1);
p1+=32;
V2 = load(b);
p2=b;
V2=vld1(p2);
p2+=32;
V3 = add(V1,V2);
store(c,V3);
V3=vadd(V1,V2);
p3=c;
vst1(p3,V3);
p1=a; p2=b; p3=c;
for(i=0; i<N; i+=32) {
V1=vld1(p1);
V2=vld1(p2);
V3=vadd(V1,V2);
vst1(p3,V3);
p1+=32;
p2+=32;
p3+=32;
}
48
p3+=32;
Generating datapath
+1
Mem
A
+
+1
Mem
B
* Warning: this circuit does not adhere to any ARM quality standards.
50
Mem
C
+1
Adding control
+1
Mem
A
+
+1
Mem
B
en
-1
Mem
C
en
+1
en
en
!=0
nDone
* Warning: this circuit does not adhere to any ARM quality standards.
51
Fixing timing
+1
Mem
A
+
+1
Mem
B
en
-1
Mem
C
en
+1
en
en
!=0
nDone
52
Related Work
NESL – Nested Data Parallelism (CMU)

Cray Vector machines, Connection Machine
DpH – Generalization of NESL as a Haskell library (SLPJ++)

GPGPU
Accelerator – Data parallel library in C#/F# (MSR)

SMP, DirectX9, FPGA
Array Building Blocks – C++ template library (Intel)

SMP, SSE
Thrust – C++ template library (NVidia)

GPGPU
(Also: Parallel Skeletons, Map-reduce, etc. etc.)
53
Summary of approach
(Only) Use highly structured bulk operations
 Bulk operations  reason about vectors, not individual elements
 Simple mathematical properties  easy to optimize
Single frontend, multiple backends
 SIMD, SMP, GPGPU, FPGA, ...
(Scope for significant platform-dependent optimization)
54
Breaking down boundaries
Hard boundary between system and kernel layers
 Separate languages
 Separate tools
 Separate people writing/optimizing
Need to soften boundary
 Allow kernels to be split across processors
 Allow kernels to be merged across processors
 Allow kernels A and B to agree to use a non-standard
memory layout (to run more efficiently)
(This is an open problem)
55
Tackling Performance Portability Problem
High Performance Embedded Systems
 Energy Efficient systems are “lumpy”
 The hardware is the easy bit
Two level approach
 System Programming
 Stitch kernels together
 Inter-kernel parallelism, mapping onto processors/memory
 Kernel Programming
 C+builtins efficient but inflexible, non-portable
 Simple DPL in talk, references to more substantial efforts
 Intra-kernel parallelism expressed
 Boundary must be softened
56
Fin
57
Language Design Meta Issues
Compiler only uses simple analyses
 Easier to maintain consistency between different compiler
versions/implementations
Programmer makes the high-level decisions
 Code and Data Placement
 Inserting SYNC
 Load balancing
Implementation by many source-source transforms
 Programmer can mix high- and low-level features
 90-10 rule: use high-level features when you can, low-level features when
you need to
58
58
Compiling SoC-C
1. Data Placement
a) Infer data placement
b) Propagate coherence
c) Split variables with multiple placement
2. Pipeline Parallelism
a) Identify maximal threads
b) Split into multiple threads
c) Apply zero copy optimization
3. RPC (see paper for details)
59
59
Step 1a: Infer Data Placement
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
60
Solve Set of Constraints
60
Step 1a: Infer Data Placement
int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
61
Solve Set of Constraints

Memory Topology constrains where
variables could live
61
Step 1a: Infer Data Placement
int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo([email protected], [email protected]) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar([email protected], [email protected]) @ P1;
baz([email protected]) @ P1;
put(z@?);
}
}
62
Solve Set of Constraints

Memory Topology constrains where
variables could live
62
Step 1b: Propagate Coherence
int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo([email protected], [email protected]) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar([email protected], [email protected]) @ P1;
baz([email protected]) @ P1;
put(z@?);
}
}
63
Solve Set of Constraints


Memory Topology constrains where
variables could live
Forwards Dataflow propagates
availability of valid versions
63
Step 1b: Propagate Coherence
int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo([email protected], [email protected]) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar([email protected], [email protected]) @ P1;
baz([email protected]) @ P1;
put([email protected]);
}
}
64
Solve Set of Constraints


Memory Topology constrains where
variables could live
Forwards Dataflow propagates
availability of valid versions
64
Step 1b: Propagate Coherence
int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo([email protected], [email protected]) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar([email protected], [email protected]) @ P1;
baz([email protected]) @ P1;
put([email protected]);
}
}
65
Solve Set of Constraints



Memory Topology constrains where
variables could live
Forwards Dataflow propagates
availability of valid versions
Backwards Dataflow propagates
need for valid versions
65
Step 1b: Propagate Coherence
int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get([email protected]);
foo([email protected], [email protected]) @ P0;
SYNC(y,M1,M0) @ DMA;
FIFO([email protected]);
bar([email protected], [email protected]) @ P1;
baz([email protected]) @ P1;
put([email protected]);
}
}
66
Solve Set of Constraints



Memory Topology constrains where
variables could live
Forwards Dataflow propagates
availability of valid versions
Backwards Dataflow propagates
need for valid versions
66
Step 1c: Split Variables
int x[100] @ {M0};
int y0[100] @ {M0};
int y1[100] @ {M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1,y0,…) @ DMA;
FIFO(y1);
bar(z, y1) @ P1;
baz(z) @ P1;
put(z);
}
}
67
Split variables with multiple locations
Replace SYNC with memcpy
67
Step 2: Implement Pipeline Annotation
int x[100] @ {M0};
int y0[100] @ {M0};
int y1[100] @ {M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1,y0,…) @ DMA;
FIFO(y1);
bar(z, y1) @ P1;
baz(z) @ P1;
put(z);
}
}
68
Dependency Analysis
68
Step 2a: Identify Dependent Operations
int x[100] @ {M0};
int y0[100] @ {M0};
int y1[100] @ {M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1,y0,…) @ DMA;
FIFO(y1);
bar(z, y1) @ P1;
baz(z) @ P1;
put(z);
}
}
69
Dependency Analysis
Split use-def chains at
FIFOs
69
Step 2b: Identify Maximal Threads
int x[100] @ {M0};
int y0[100] @ {M0};
int y1[100] @ {M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1,y0,…) @ DMA;
FIFO(y1);
bar(z, y1) @ P1;
baz(z) @ P1;
put(z);
}
}
70
Dependency Analysis
Split use-def chains at
FIFOs
Identify Thread Operations
70
Step 2b: Split Into Multiple Threads
int x[100] @ {M0};
int y0[100] @ {M0};
int y1a[100] @ {M1};
int y1b[100] @ {M1};
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1a,y0,…) @ DMA;
fifo_put(&f, y1a);
}
}
SECTION {
while (1) {
fifo_get(&f, y1b);
bar(z, y1b) @ P1;
baz(z) @ P1;
put(z);
}
}
}
71
Perform Dataflow Analysis
Split use-def chains at
FIFOs
Identify Thread Operations
Split into threads
71
Step 2c: Zero Copy Optimization
int x[100] @ {M0};
int y0[100] @ {M0};
int y1a[100] @ {M1};
int y1b[100] @ {M1};
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1a,y0,…) @ DMA;
fifo_put(&f, y1a);
}
}
SECTION {
while (1) {
fifo_get(&f, y1b);
bar(z, y1b) @ P1;
baz(z) @ P1;
put(z);
}
}
}
72
Generate Data
Copy into FIFO
Copy out of FIFO
Consume Data
72
Step 2c: Zero Copy Optimization
int x[100] @ {M0};
int y0[100] @ {M0};
int y1a[100] @ {M1};
int y1b[100] @ {M1};
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
foo(y0, x) @ P0;
memcpy(y1a,y0,…) @ DMA;
fifo_put(&f, y1a);
}
}
SECTION {
while (1) {
fifo_get(&f, y1b);
bar(z, y1b) @ P1;
baz(z) @ P1;
put(z);
}
}
}
73
Calculate Live Range of variables
passed through FIFOs
Live Range of y1a
Live Range of y1b
73
Step 2c: Zero Copy Optimization
int x[100] @ {M0};
int y0[100] @ {M0};
int *py1a;
int *py1b;
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
foo(y0, x) @ P0;
fifo_acquireRoom(&f, &py1a);
memcpy(py1a,y0,…) @ DMA;
fifo_releaseData(&f, py1a);
}
}
SECTION {
while (1) {
fifo_acquireData(&f, &py1b);
bar(z, py1b) @ P1;
fifo_releaseRoom(&f, py1b);
baz(z) @ P1;
put(z);
}
}
}
74
Calculate Live Range of variables
passed through FIFOs
Transform FIFO operations to pass
pointers instead of copying data
Acquire empty buffer
Generate data directly into buffer
Pass full buffer to thread 2
Acquire full buffer from thread 1
Consume data directly from buffer
Release empty buffer
74
Step 3a: Resolve Overloaded RPC
int x[100] @ {M0};
int y0[100] @ {M0};
int *py1a;
int *py1b;
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
DE32_foo(0, y0, x);
fifo_acquireRoom(&f, &py1a);
DMA_memcpy(py1a,y0,…);
fifo_releaseData(&f, py1a);
}
}
SECTION {
while (1) {
fifo_acquireData(&f, &py1b);
DE32_bar(1, z, py1b);
fifo_releaseRoom(&f, py1b);
DE32_baz(1, z);
put(z);
}
}
}
75
Replace RPC by architecture specific call
bar(…)@P1  DE32_bar(1,…)
75
Step 3b: Split RPCs
int x[100] @ {M0};
int y0[100] @ {M0};
int *py1a;
int *py1b;
int z[100] @ {M1};
PARALLEL {
SECTION {
while (1) {
get(x);
start_DE32_foo(0, y0, x);
wait(semaphore_DE32[0]);
fifo_acquireRoom(&f, &py1a);
start_DMA_memcpy(py1a,y0,…);
wait(semaphore_DMA);
fifo_releaseData(&f, py1a);
}
}
SECTION {
while (1) {
fifo_acquireData(&f, &py1b);
start_DE32_bar(1, z, py1b);
wait(semaphore_DE32[1]);
fifo_releaseRoom(&f, py1b);
start_DE32_baz(1, z);
wait(semaphore_DE32[1]);
put(z);
}
}
}
76
RPCs have two phases
 start RPC
 wait for RPC to complete
DE32_foo(0,…);

start_DE32_foo(0,…);
wait(semaphore_DE32[0]);
76
Order of transformations
Dataflow-sensitive transformations go first
 Inferring data placement
 Coherence checking within threads
 Dependency analysis for parallelism
Parallelism transformations

Obscures data and control flow
Thread-local optimizations go last


77
Zero-copy optimization of FIFO operations
Continuation passing thread implementation
77
Aside: Why hardware companies are fun
 You get to play with cool hardware
 Often before it has been debugged 
 You get to play with powerful debugging tools
 Incredible level of detail visible
 E.g., Palladium traces on next slides
78
Unoptimized task scheduling
273 cycles
968
DE0
DE1
195 cycles
fft
79
demod
79
Optimized device driver on ARM
257 cycles
968
DE0
DE1
155 cycles
fft
80
demod
80
Task scheduling hardware support
303 cycles
968
DE0
DE1
1 cycle
fft
demod
183 cycles
81
81
Descargar

Programming High Performance Embedded Systems: …