Static Translation of Stream
Programming to a Parallel
System
S. M. Farhad
PhD Student
Supervisor: Dr. Bernhard Scholz
Programming Language Group
School of Information Technology
University of Sydney
Uniprocessor Performance
Motivation
512
Picochip
PC102
256
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
128
64
32
# of
cores 16
Raw
Niagara
8
Broadcom 1480
4
2
1
Raza
XLR
4004
8080
8086
286
386
486
Pentium
8008
1970
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Motivation
512
256
128
64
32
# of
cores 16
8
4
For uniprocessors,
Uniprocessors:
C was:
C •Portable
is the common
machine
language
•High Performance
•Composable
•Malleable
•Maintainable
Picochip
PC102
Cisco
CSR-1
Intel
Tflops
Raw
1
8080
8086
286
386
486
Broadcom 1480
Pentium
8008
1970
1975
1980
1985
1990
Raza
XLR
Niagara
2
4004
Ambric
AM2045
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Motivation
What is the common
machine language
for multicores?
512
256
128
Picochip
PC102
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
64
32
# of
cores 16
Raw
Niagara
8
Broadcom 1480
4
2
1
Raza
XLR
4004
8080
8086
286
386
486
Pentium
8008
1970
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Common Machine Languages
Uniprocessors:
Multicores:
Common Properties
Common Properties
Single flow of control
Multiple flows of control
Single memory image
Multiple local memories
Differences:
Differences:
Number and capabilities of cores
Register Allocation
Communication Model
ISA Instruction Selection
Synchronization Model
Functional Units Instruction Scheduling
Register File
von-Neumann languages represent the
common properties and abstract away
the differences
Stream Programming Language is a
common machine language for
multicores
Properties of Stream Programs [W. Thies ‘02]
• A large (possibly infinite) amount of data
• Limited lifespan of each data item
• Little processing of each data item
• A regular, static computation pattern
• Stream program structure is relatively
constant
• A lot of opportunities for compiler
optimizations
Application of Streaming Programming
AtoD
Model of Computation
FMDemod
• Synchronous Dataflow [Lee ‘92]
– Graph of autonomous filters
– Communicate via FIFO channels
• Static I/O rates [Edward ‘87]
– Compiler decides on an order
of execution (schedule)
– Static estimation of
computation
Scatter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Gather
Adder
Speaker
StreamIt Language Overview [Thies ‘04]
• StreamIt is a novel
language for streaming
– Exposes parallelism and
communication
– Architecture independent
– Modular and composable
• Simple structures
composed to creates
complex graphs
filter
pipeline
may be
any StreamIt
language construct
splitjoin
splitter
parallel computation
joiner
– Malleable
• Change program behavior
with small modifications
feedback loop
joiner
splitter
Mapping of Filters to Multicores
•
•
•
•
Task Parallelism [Edward ‘87]
Fine-Grained Data Parallelism [Michael ‘06]
3-phase solution [Michael ’06]
Orchestrating the Execution of Stream Programs
[Kudlur ‘08]
11
Baseline 1: Task Parallelism
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder
12
• Inherent task parallelism
between two processing
pipelines
• Task Parallel Model:
– Only parallelize explicit
task parallelism
– Fork/join parallelism
• Execute this on a 2 core
machine ~2x speedup over
single core
Baseline 2: Fine-Grained
Data Parallelism
Splitter
Splitter
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Joiner
Joiner
Splitter
Splitter
Compress
Compress
Compress
Compress
Process
Process
Process
Process
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Splitter
Process
Process
Process
Process
Joiner
Joiner
Splitter
Splitter
Expand
Expand
Expand
Expand
BandStop
BandStop
BandStop
BandStop
Expand
Expand
Expand
Expand
Joiner
Joiner
Splitter
Splitter
Joiner
BandStop
BandStop
BandStop
Adder
Adder
Joiner
13
– Example: 4 cores
• Each fission group occupies
entire machine
Joiner
Splitter
– Fiss each stateless filter N
ways (N is number of cores)
– Remove scatter/gather if
possible
• We can introduce data
parallelism
BandStop
BandStop
BandStop
BandStop
Joiner
• Each of the filters in the
example are stateless
• Fine-grained Data Parallel
Model:
3-Phase Solution [Michael ‘06]
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
RectPolar 20
Data Parallel
Splitter
Splitter
2
1
1
1
UnWrap
Unwrap
2
Diff
Diff
1
Amplify
Amplify
1
Accum
Accum
1
Data Parallel, but too little work!
Joiner
Joiner
PolarRect
20
Data Parallel
14
Target a 4 core machine
Data Parallelize
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
Splitter
RectPolar
RectPolar
RectPolar
RectPolar
20 5
Joiner
Splitter
Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Joiner
Joiner
Splitter
RectPolar
RectPolar
RectPolar
PolarRect
20 5
Joiner
15
Target a 4 core machine
Data + Task Parallel Execution
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
21
Joiner
Joiner
Splitter
5
RectPolar
Joiner
16
Target 4 core machine
Better Mapping
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
16
Joiner
Joiner
Splitter
5
RectPolar
Joiner
17
Target 4 core machine
Phase 3: Coarse-Grained
Software Pipelining
Prologue
New
Steady
State
RectPolar
RectPolar
• New steady-state is free of
dependencies
• Schedule new steady-state
using a greedy partitioning
RectPolar
RectPolar
18
Greedy Partitioning [Michael ‘06]
Cores
To Schedule:
Time
16
19
Target 4 core machine
Static Translation of Stream Programs [Proposal]
• We study
– A mathematical model and algorithms to resolve
bottlenecks in stream programs
– Map actors of stream programs to processors in a
parallel systems
– Compute a schedule for each processor
• Goal is to statically optimize the throughput of a
stream program
• Assuming constant input bandwidth
Research Question: Removing the
bottleneck from the stream graph
A
A
S
B
C
B́
B
C
J
D
Filter B is the bottleneck
Original stream graph
D
Filter B is duplicated
After removing the bottleneck
Research Method
• Perform a quantitative analysis that
detects bottlenecks in the stream graph
• The bottleneck resolver duplicates actors
that impose a bottleneck.
• The process continues until the program is
bottleneck free
• Then mapping the actors to processors is
performed via Integer Linear Programming
Plan
•
•
•
•
•
•
Background study
Research question
Proposal
Implementation
Results
Publication
Question?
Descargar

Document