Portability for FPGA
Applications—Warp Processing
and SystemC Bytecode
Frank Vahid
Dept. of CS&E
University of California, Riverside
Associate Director, Center for Embedded
Computer Systems, UC Irvine
Contributing Ph.D. Students
Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of
Arizona
Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of
Florida, Gainesville
Scotty Sirowy (current)
David Sheldon (current)
Chen Huang (current)
This research was supported in part by the National Science
Foundation, the Semiconductor Research Corporation, Intel,
Freescale, IBM, and Xilinx
Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
Frank Vahid, UC Riverside
2/64
Portable Applications on PCs


Standard software binary
Dynamic software binary
translation
x86
Binary
VLIW
x86
µP
Applications
Tools
VLIW
Binary
SW binary translation
Architectures
“Ecosystem”
Frank Vahid, UC Riverside
3/64

P
m
ot
if
d
g
g
fo
rm
ns
et
ec
tra
or
tin
S
tio
n
ei
sm
FP
ic
T
S
m
im
ig
ul
ra
at
tio
ed
n
an
ne
al
in
g
A
ve
ra
ge
n
g
in
g
m
m
in
S
gr
a
te
r
500
ro
te
i
pr
o
ur
ie
r
ic
Fo
et
tin
20
lu
s
40
C
t
25
ro
u
50
le
la
c
em
en
0
eh
ic
P
30
Speedup
30
V
G Bl
a ur
ur uss
t-A ia
de n
ls
M on
ed
K
uw i an
ah
A ara
ve
ra
ge
Speedup
60
G
en
B
re
w
itt
W FIR
av
el
et
M
ax
B
l
A en
nt d
ia
l
B ias
rig
h
R t en
ob
er
t
S s
ob
E e
m l
b
S oss
ha
rp
en
P
Meanwhile, Circuits on FPGAs Show Large
Speedups
200
79.2
15
20
10
10
5
0
Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
MICRO, CASES, DAC, DATE, ICCAD, RAW, …
Frank Vahid, UC Riverside
4/64
FPGAs Entering
Computing Mainstream






AMD Opteron
Intel QuickAssist
Cray, SGI
Mitrionics
IBM Cell (research)
Xilinx, Altera
SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs)
Xilinx Virtex II Pro. Source: Xilinx
Frank Vahid, UC Riverside
5/64
Circuits on FPGAs are Software Binaries
Microprocessor Binaries
(Instructions)
01110100...
001010010
001010010
001010010
…
…
…
…
FPGA “Binaries”
(Circuits) not hardware
"Software"
…
…
aka
"bitstream"
Bits loaded into
LUTs and SMs
Bits loaded into
program memory
"Hardware"
0010
…
Processor
Processor
0111
…
FPGA
Processor
Sep 2007 IEEE Computer
Frank Vahid, UC Riverside
6/64
“Portable Applications” + “FPGAs”


Standard software binary
Dynamic translation
x86
Binary
VLIW
x86
µP
Speedup
VLIW
Binary
SW binary translation
x86 VLIW DSP FPGA
FPGA
Applications
x86
µP
FPGA
binary
SW binary translation
Tools
Architectures
“Ecosystem”
“Warp Processing”
Frank Vahid, UC Riverside
7/64
Warp Processing
1
Initially, software binary loaded
into instruction memory
Profiler
I
Mem
µP
D$
FPGA
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
On-chip CAD
Frank Vahid, UC Riverside
8/64
Warp Processing
2
Microprocessor executes
instructions in software binary
Profiler
I
Mem
µP
D$
FPGA
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
Frank Vahid, UC Riverside
9/64
Warp Processing
3
Profiler monitors instructions and
detects critical regions in binary
Profiler
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
µP
I
Mem
D$
FPGA
On-chip CAD
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
Critical Loop
Detected
Frank Vahid, UC Riverside
10/64
Warp Processing
4
On-chip CAD reads in critical region
Profiler
I
Mem
µP
D$
FPGA
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
Frank Vahid, UC Riverside
11/64
Warp Processing
5
On-chip CAD decompiles critical region
into control data flow graph (CDFG)
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
Recover
loops,
reg3 := 0
arrays,
reg4 := 0
subroutines,
loop:
etc. –
reg4 := reg4 + mem[
needed to
reg2 + (reg3 << 1)]
synthesize
reg3 := reg3 + 1
if (reg3 < 10) goto loop good circuits
ret reg4
Frank Vahid, UC Riverside
12/64
Warp Processing
6
On-chip CAD synthesizes decompiled
CDFG to a custom (parallel) circuit
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
+
+
+
Time Energy
reg3 := 0
+ := 0+
reg4
+
...
loop:
+reg4 := reg4++ mem[
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
if (reg3
. . loop
+< 10). goto
ret reg4
+
...
Frank Vahid, UC Riverside
13/64
Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
Time Energy
+
reg3 := 0
+ := 0+ +
reg4
SM
SM
SM . . .
loop:
reg4+
+ mem[
+ CLB
+ +reg4 :=CLB
+
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
SM +
ifSM
(reg3
goto
. . loop
+< 10).SM
ret reg4
+
...
Frank Vahid, UC Riverside
14/64
Warp Processing
8
On-chip CAD replaces instructions in
binary to use hardware, causing
performance and energy to “warp” by
an order of magnitude or more
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
>10x speedups
for some apps
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
//
instructions
Shl
reg1, reg3, that
1
interact
FPGA
Add reg5,with
reg2,
reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
Time
Energy
Time Energy
Software-only
“Warped”
+
reg3 := 0
+ := 0+ +
reg4
SM
SM
SM . . .
loop:
reg4+
+ mem[
+ CLB
+ +reg4 :=CLB
+
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
SM +
ifSM
(reg3
goto
. . loop
+< 10).SM
ret reg4
Warp speed, Scotty
+
...
Frank Vahid, UC Riverside
15/64
Warp Processing Challenges


Can we decompile
binaries sufficiently
for synthesis?
Can we just-in-time
(JIT) compile to
FPGAs?
Binary
Binary
Profiling & partitioning
Decompilation
Profiler
µP
I$
D$
CDFG
Binary
Updater
FPGA
On-chip
CAD
JIT FPGA
compilation
Microp
Binary
Binary
FPGA
Binary
binary
Frank Vahid, UC Riverside
16/64
Decompilation

Recover high-level information from binary: branches, loops, arrays,
subroutines, …


Adapted previous methods for processor-processor translation (UQBT)
Developed new synthesis-oriented methods (e.g., “reroll” loops,
strength “promotion”)
Original C Code
long f( short a[10] ) {
long accum;
for (int i=0; i < 10; i++) {
accum += a[i];
}
return accum;
}
Corresponding
Assembly
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Almost Identical
Representations
Data
Flow
Analysis
Control/Data
FlowRecovery
Graph Creation
Control
Structure
Function
Recovery
Array
Recovery
long f( long
reg2
) { :=
array[10]
reg3
long f( short
long reg2
) { ) {0
long reg4 = 0; reg4 := 0
int reg3 = 0;
for
(long=reg3
int reg4
0; = 0; reg3 < 10; reg3++) {
mem[reg2
reg4 += array[reg3];
loop:
loop: + (reg3 << 1)];
}reg4 = reg4 + mem[reg2
+ <<
reg4 := reg3
reg4
+ mem[
reg1
1
reg3 << 1)];
return reg4;
reg2
+ (reg3
<< 1)]
reg5
:= reg2
+ reg1
} reg3 = reg3 + 1;reg6
reg3 := mem[reg5
reg3 + 1 + 0]
if (reg3 < 10) goto loop;
if (reg3
< 10)+goto
reg4
:= reg4
reg6loop
return reg4;
reg3 := reg3 + 1
}
if (reg3 < 10) goto loop
ret reg4
ret
reg4
Frank Vahid, UC Riverside
17/64
From C
e
era
g
Av
1
TR
CH
0
PN
CT
R
N0
1
01
ID
MN
P
BIT
Ur
l
Bre
v
Fil
FIR
erb
i
From binary
Vit
Synthesis from
decompiled binary
is competitive with
synthesis from C
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
ter
Be
am
for
me
r

Speedup
Decompilation Results vs. C
Synthesis from C
Synthesis from Decompiled Binary
Example
Cycles ClkFrq Time Area
Cycles ClkFrq Time Area %TimeOverhead %AreaOverhead
bit_correlator
258 118
2.2
15
258
118
2.2
15
0%
0%
fir
129 125
1.0 359
129
125
1.0
371
0%
3%
udiv8
281 190
1.5 398
281
190
1.5
398
0%
0%
prewitt
64516 123 524.5 2690
64516
123 524.5 4250
0%
58%
mf9
258
57
4.5 1048
258
57
4.5 1048
0%
0%
moravec
195072
66 2951.2 680
195072
70 2790.7
676
-6%
-1%
Avg:
-1%
10%
Frank Vahid, UC Riverside
18/64
Decompilation Results on Optimized H.264
Function Nam e Ins tr %Tim e Cum ulative Spe e dup
MotionComp_00
33
6.8%
1.1
InvTransf orm4x4
63
12.5%
1.1
FindHorizontalBS
47
16.7%
1.2
GetBits
51
20.8%
1.3
FindVerticalBS
44
24.7%
1.3
MotionCompChromaFullXFullY
24
28.6%
1.4
FilterHorizontalLuma 557
32.5%
1.5
FilterVerticalLuma 481
35.8%
1.6
FilterHorizontalChroma
133
39.0%
1.6
CombineCoef sZerosInvQuantScan
69
42.0%
1.7
memset
20
44.9%
1.8
MotionCompensate 167
47.7%
1.9
FilterVerticalChroma 121
50.3%
2.0
MotionCompChromaFracXFracY
48
53.0%
2.1
ReadLeadingZerosAndOne
56
55.6%
2.3
DecodeCoef f TokenNormal
93
57.5%
2.4
DeblockingFilterLumaRow
272
59.4%
2.5
DecodeZeros
79
61.3%
2.6
MotionComp_23
279
63.0%
2.7
DecodeBlockCoef Levels
56
64.6%
2.8
MotionComp_21
281
66.2%
3.0
FindBoundaryStrengthPMB
44
67.7%
3.1

Speedup
In-depth Study with Freescale
Ideal
10
9
8
7
6
5
4
3
2
1
0
From C
From binary
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
# of Functions on FPGA
Again, competitive with synthesis from C
Frank Vahid, UC Riverside
19/64
30
25
20
15
10
5
-O
3
M
icr
oB
la
ze
icr
oB
la
ze
-O
1
3
M
AR
M
-O
1
-O
AR
M
3
-O
M
IP
S
-O
1
0
IP
S

Do compiler
optimizations hurt
decompilation?
(Surprisingly)
found optimized
code synthesizes
to even better
circuits
M

Speedup when decompiled binary is
partitioned and synthesized to FPGA
Decompilation Effective Even with Compiler
Optimizations
Average Speedup of 10 Examples
Frank Vahid, UC Riverside
20/64
Decompilation
Summary: Decompilation is surprisingly
effective at recovering high-level program
structures for synthesis
Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05,
FPGA’05, TODAES’06, TODAES’07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
Frank Vahid, UC Riverside
21/64
Warp Processing Challenges


Can we decompile
binaries sufficiently
for synthesis?
Can we just-in-time
(JIT) compile to
FPGAs?
Binary
Binary
Profiling & partitioning
Decompilation
Profiler
µP
I$
D$
CDFG
Binary
Updater
FPGA
On-chip
CAD
JIT FPGA
compilation
Microp
Binary
Binary
FPGA
Binary
binary
Frank Vahid, UC Riverside
22/64
Challenge: JIT Compile to FPGA
Logic synthesis
Tech. map.
Placement
Commercial tool
60 MB
Routing
9.1 s

Expand
Developed ultra-lean CAD heuristics for synthesis,
placement, routing, and technology mapping, e.g.,




on-set
dc-set
off-set
Reduce
Logic synthesis: run single expand phase
Technology mapping: bottom-up graph clustering heuristic
Placement: place critical path first, then adjacent items
Routing: use resource graph that matches switch matrix /
channel structure
Irredundant
Ultra-lean Riverside JIT FPGA tools (drawn to scale)
0.2 s
3.6MB
Penalty: 1.3-2x in performance & size
(even more might be acceptable)
Ultra-lean Riverside JIT FPGA tools on a 75MHz ARM7
1.4s
3.6MB
Frank Vahid, UC Riverside
23/64
JIT Compile to FPGA
Summary: Ultra-lean JIT FPGA compiler 
40x speedup, 20x less memory, 1.3x-2x circuit
penalty
Lysecky et al, DAC’03, ISSS/CODES’03, DATE’04, DAC’04,
DATE’05, FCCM’05, TODAES’06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005,
now Asst. Prof. at Univ. of Arizona)
Frank Vahid, UC Riverside
24/64
Warp Processing Results
Performance Speedup (Most Frequent Kernel Only)
vs. 200 MHz ARM
Average kernel speedup of 41
Profiler
80
191
130
Warp Proc.
70
60
Speedup
I$
D$
µP
FPGA
50
On-chip
CAD
40
30
20
10
1 = ARM-only
execution
Overall application speedup average is 7.4
at
Av mul
er
ag
e:
fir
m
ro
cm
pk
tfl
ow
ca
nr
d
bi r
tm
np
tb
lo
ok
tt s
pr
k
m
at
rix
id
ct
g7
2
m 1
pe
g2
ur
l
br
ev
g3
fa
x
0
Frank Vahid, UC Riverside
25/64
Warping Thread-Based Applications
Multi-core
platforms
 multithreaded
apps
for (i = 0; i < 10; i++) {
}
uP
OS schedules threads onto
accelerators (possibly dozens), in
addition to µPs
Compiler
Binary
µP
OS schedules
threads onto
available µPs
Performance
thread_create( f, i );
f()
µP
f()
µP
f()
Remaining threads
added to queue
f()
Very large speedups
possible – parallelism
at bit, arithmetic, and
now thread level too
On-chip CAD
OS
µP
FPGA
Warp
µP
f()
µP
f()
Acc.
Lib
OS invokes on-chip
CAD tools to create
accelerators for f()
Thread warping: use one core to
create accelerator for waiting threads
Frank Vahid, UC Riverside
26/64
Memory Access Synchronization (MAS)

Must deal with widely known memory bottleneck problem

FPGAs great, but often can’t get data to them fast enough
for (i = 0; i < 10; i++) {
thread_create( thread_function, a, i );
RAM
DMA
Data for dozens of
threads can
create bottleneck
a()


FPGA
….
b()
}
void f( int a[], int val )
{
int result;
for (i = 0; i < 10; i++) {
result += a[i] * val;
}
Same
....
}
array
Threaded programs exhibit unique feature: Multiple threads often
access same or overlapping data
Solution: Fetch data once, broadcast to multiple threads (MAS)
Frank Vahid, UC Riverside
27/64
Memory Access Synchronization (MAS)

Detect overlapping memory regions – “windows”

Synthesis creates active “smart buffer”


[Guo/Najjar FPGA04]
Actively fetches data, stores the reused data, delivers windows to threads
Active rather than passive component; designed for specific threads
a[0] a[1] a[2] a[3] a[4] a[5]
for (i = 0; i < 100; i++) {
thread_create( thread_function, a, i );
}
void f( int a[], int i )
{
int result;
result += a[i]+a[i+1]+a[i+2]+a[i+3];
....
}
Each thread accesses different
addresses – but addresses may
overlap
RAM
DMA
A[0-103]
………
Data streamed
to “smart
buffer”
Smart Buffer
A[0-3]
enable
f()
A[6-9]
A[1-4]
f()
………………
f()
Buffer delivers window to each thread
W/O smart buffer: 400 memory accesses
With smart buffer: 104 memory accesses
Frank Vahid, UC Riverside
28/64
Speedups from Thread Warping
38
50
40
30
20
10
0
Fir




Prewitt
Linear Moravec Wavelet Maxfilter 3DTrans N-body


Geo.
Mean
8-uP
16-uP
32-uP
64-uP
Chose benchmarks with extensive parallelism
Four core (ARM11 400 MHz) base system
Virtex IV FPGA at circuit-specific clock frequency (~100-300 MHz)
Average 130x speedup
But, FPGA uses additional area.

Avg.
4-uP
TW
Our FPGA size = ~36 ARM11s
Still 20x faster than 32-core system (and 11x faster than 64-core)
Simulation pessimistic, actual results likely better
FPGA more flexible
Frank Vahid, UC Riverside
29/64
Warp Scenarios
Warping takes time (seconds, minutes, or more) – when useful?

Long-running applications


Scientific computing, etc.
Recurring applications (save and reuse FPGA configurations)



Common in embedded systems
Might view as (long) boot phase
For networked/docked devices, CAD can occur on server (ongoing
work)
Long Running Applications
µP
Recurring Applications
µP (1st execution)
FPGA
On-chip CAD
µP FPGA
On-chip CAD
Time
Single-execution speedup
Time
Speedup
Frank Vahid, UC Riverside
30/64
Applications
Why Dynamic?
Tools

Static good, but hiding FPGA opens technique to all sw
platforms

Architectures
“Ecosystem”
Standard languages/tools/binaries
Static Compiling to FPGAs
Specialized Language
Specialized Compiler
Binary
Netlist
Dynamic Compiling to FPGAs
Any Language
Any Compiler
Binary
FPGA
FPGA
µ
P
µ
P
On-chip CAD
Frank Vahid, UC Riverside
31/64
Synthesis-Friendly
Applications

Coding style impacts
synthesis results
Ideal Speedup (Zero-time Hw Execution)
10
9
Speedup After Rewrite (C Partitioning)
7
Speedup from C Partititioning
6
5
4
3
2
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
0
1
Speedup
8
Number of Functions in Hardware
Frank Vahid, UC Riverside
32/64
Synthesis-Friendly
Application Coding Guidelines
Coding Guidelines
Conversion to Constants (CC)
Conversion to Explicit Data Flow (CEDF)
Conversion to Fixed Point (CF)
Conversion to Explicit Memory Accesses (CEMA)
Constant Input Enumeration (CIE)
Loop Rerolling (LR)
Conversion to Explicit Control Flow (CECF)
Function Specialization (FS)
Algorithmic Specialization (AS)
Pass-By-Value Return (PVR)
Frank Vahid, UC Riverside
33/64
Conversion to Explicit Control Flow (CECF)


Problem: Function pointers
may prevent static control
flow analysis
Guideline: Don’t use
function pointers. Replace
with if-else, static calls

Makes possible targets
explicit
Synthesis unlikely to
determine possible targets of
function pointer
enum Target { FUNC1, FUNC2, FUNC3 };
void f( int
(*fp)
(int) fp
) {) {
enum
Target
}
}
.....
for (i=0; i < 10; i++) {
a[i]
fp(i);
if (fp===
FUNC1)
} a[i] = f1(i);
else if (fp == FUNC2)
a[i] = f2(i);
else
Synthesized
Hardware
a[i] = f3(i);
}
?
Synthesized Circuit
a[i]
f1(i) f2(i) f3(i)
fp
3x1
a[i]
Frank Vahid, UC Riverside
34/64
Speedups from Synthesis-Friendly Coding
Guidelines
Ideal Speedup (Zero-time Hw Execution)
10
9
Speedup After Rewrite (C Partitioning)
8
7
Simple
guidelines
increased
speedup to
6.5x
Speedup from C Partititioning
6
5
4
3
2
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
0
1

10 guidelines
For ~1,000 line benchmark: 5-6 changes typical, tens of minutes each
Speedup

Number of Functions in Hardware
Frank Vahid, UC Riverside
35/64
Speedups from Synthesis-Friendly Coding
Guidelines
573 842
20
15
Sw
Hw /sw w ith original code
Hw /sw w ith guidelines
10
5
0
g3fax

jpeg
brev
fir
mpeg2
Original C code (Powerstone, Mediabench)


crc
Original average speedups with FPGA: 2.6x (excludes brev)
Refined C code with guidelines


Average speedup: 8.4x (excludes brev)
Guidelines led to 3.5x improvement of speedup
Frank Vahid, UC Riverside
36/64
“Spatial” Algorithms for FPGAs

Example – Count patterns

Sequential algorithm


Hash table
10s cycles per pattern
int patterns[1,000];
int counts[1,000];
while (1) {
WaitForPattern();
CurrPattern = X;
hash = HashFct(CurrPattern);
item = Find(patterns,
CurrPattern, hash);
if (item) {
counts[item]++;
}
}

Spatial algorithm


Pipelined stages
Essence is the connectivity of
components, not the
sequencing of instructions
bus
CurrPattern
logic
count
pattern
Level 1
Level 2
logic
count
pattern
.
.
.
count
pattern
logic
Level m
Frank Vahid, UC Riverside
37/64
Spatial Algorithms for FPGAs

Current pattern
Spatial algorithm 2

Pipelined binary tree
logic
Level 1
Level 2
logic
logic
Level 3
1 Count
Memory
1 pattern
2 Count
2 patterns
Memory
2 patterns
4 Count
Memory
4 patterns
4 patterns
.
.
.
logic
Level n
2n Count
Memory
2n patterns
2n patterns
.
.
.
Frank Vahid, UC Riverside
38/64
Example
Possible patterns pre-stored in
binary search tree circuit
Current pattern
48
73
logic
Memory
1 pattern
logic
Memory
2 patterns
logic
Memory
4 patterns
Level 1
Level 2
Level 3
.
.
.
logic
Level n
Stage 1
Memory
2n patterns
.
.
.
Stage 2
Stage 3
Stage 4
Frank Vahid, UC Riverside
39/64
Example
Current pattern
23
48
logic
Memory
1 pattern
logic
Memory
2 patterns
logic
Memory
4 patterns
Level 1
Level 2
Level 3
.
.
.
logic
Level n
Memory
2n patterns
.
.
.
Stage 1
73
Stage 2
Stage 3
Stage 4
Frank Vahid, UC Riverside
40/64
Example
Current pattern
75
logic
Memory
1 pattern
logic
Memory
2 patterns
logic
Memory
4 patterns
Level 1
Level 2
23
Level 3
.
.
.
logic
Level n
Memory
2n patterns
.
.
.
Stage 1
48
Stage 2
73
Stage 3
Stage 4
Frank Vahid, UC Riverside
41/64
Example
Current pattern
11
logic
Memory
1 pattern
logic
Memory
2 patterns
logic
Memory
4 patterns
Level 1
Level 2
75
Level 3
.
.
.
logic
Level n
Memory
2n patterns
.
.
.
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
Frank Vahid, UC Riverside
42/64
Example
Current pattern
logic
Memory
1 pattern
logic
Memory
2 patterns
logic
Memory
4 patterns
Level 1
Level 2
11
Level 3
.
.
.
logic
Level n
Memory
2n patterns
.
.
.
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
Frank Vahid, UC Riverside
43/64
Study of Spatial Algorithms in FCCM
Year
Application
Type
2001
2001
2001
2001
2002
2002
2002
2002
2002
2002
2003
2003
2003
2003
2003
2004
2004
2004
2004
2004
2004
2004
2004
2005
2005
2005
2005
2005
2005
2006
2006
2006
2006
2006
2006
3D Vec. Normalization
Efficient CAM
Automated Sensor
Regular Expression
Hyperspectral Image
Machine Vision
RC4
Set Covering
Template Matching
Triangle Mesh
Congruential Sieves
Content Scanning
F.P and Square Root
Gaussian Noise
TRNG
3D FDTD Method
Deep Packet Filter
Online Floating Point
Molecular Dynamics
Pattern Matching
Seismic Migration
Software Deceleration
V.M Window
Data Mining
Cell Automata
Particle Graphics
Radiosity
Transient Waves
Road Traffic
All Pairs Shortest Path
Apriori Data Mining
Molecular Dynamics
Gaussian Elimination
Radiation Dose
Random Variates
Spatial
-Temporal
Spatial
Spatial
Spatial
Temporal
Spatial
Spatial
Spatial
Temporal
Temporal
Spatial
Spatial
-Spatial
--Spatial
Spatial
Spatial
--Spatial
Temporal
Spatial
Temporal
Spatial
Temporal
Spatial
Spatial
Spatial
Spatial
Temporal
Spatial

FCCM 2001-2006





70 papers describing fast
application on FPGA
Examined 35 in depth (every other
one)
6 used device-specific features
9 represented expected
synthesized circuit from the
obvious sequential algorithm
20 were spatially-oriented
applications

e.g., earlier pipelined binary tree
Frank Vahid, UC Riverside
44/64
Portable Spatial Applications?

Current portable microprocessor binaries – sequential


How support spatial constructs


Extensions for threads, processes, ...
Ports, connections, timing model
.....


www.systemc.org

Adds libraries and macros, still standard C++
Sequential and spatial constructs
Compiling links in the simulation kernel


Self-executing simulation
Intended for SoC simulation
Frank Vahid, UC Riverside
45/64
Bytecode

Modern portability approach

Java, C#
Compiler
Virtual Machine (VM): Program
that executes bytecode
bytecode
May JIT compile to native
architecture
VM
Pentium
VM
Opteron
VM
Atom
Frank Vahid, UC Riverside
46/64
SystemC Bytecode?
SystemC
Compiler
SystemC
bytecode
VM
Pentium
VM
FPGA
VM
Opteron
+
FPGA
Frank Vahid, UC Riverside
47/64
UCR SystemC Bytecode and Compiler
class EDGE_DETECTOR : public sc_module {
//signal declarations
…
SystemC
EDGE_DETECTOR() {
SC_method(mainComp);
sensitive << dataReady;
SC_method(getPixel);
sensitive << clock.pos();
void getPixel(){
…
dataReady.write(1);
}
void mainComp(){
int i, j;
for(i = 0; i < 3; i++){
for(j = 0; j < 3; j++){
sumX = sumX + mem.read()*GX[i][j]
}
}
…
edge.write(sumX + sumY)
}
--header
signal clock : 1
signal reset : 1
signal memory_in : 32
signal fb_data : 32
signal leds : 4
UCR’s
SystemC-tobytecode
compiler
process(clock)
READ $1 memory_in
ADD $2 $0 3
ADD $3 $2 $1
WRITE $3 s1
ADDI $1 $0 1
WRITE $1 dataReady
END
process(dataReady)
READ $5 val6
SW $5 24($0)
READ $5 val7
…
ADDI $10 $0 0
ADDI $7 $0 0
ADDI $13 $0 8
…
END
UCR’s
SystemC
bytecode
Spatial
Constructs
MIPS-like
sequential
instructions
Frank Vahid, UC Riverside
48/64
SystemC Bytecode for FPGAs

Demo
Frank Vahid, UC Riverside
49/64
SystemC Bytecode Emulator
Emulator
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main
Processor
SystemC
bytecode
Instruction
Memory
Read Signal
Memory
Write Signal
Memory
FPGA
USB
Interface
Bytecode uploadable via USB drive
Accelerators speedup emulation
Frank Vahid, UC Riverside
50/64
SystemC Bytecode Accelerators
Emulator

Implementation

Input
Memory
Output
Memory
UART
Buttons
LEDs
Main
Processor
SystemC
bytecode



Instruction
Memory
Read Signal
Memory

USB
Interface

Write Signal
Memory

MIPS-like multicycle RISC
datapath
100 MHz Clock
~33 Million Instr/Sec
Communicates to core emulator
memory mapped registers
Area: ~5000 slices
# of accelerators limited to # of
masters allowed on bus
~1200 lines of VHDL
Accelerator
Accelerator 1
Accelerator 2
FPGA
Accelerator 3
Bus,
start,
load
logic
Register File
RISC
Datapath
Local Mem
Frank Vahid, UC Riverside
51/64
Dynamic SystemC Accelerator Management

Emulator
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main
Processor
SystemC
bytecode
Instruction
Memory
Read Signal
Memory

Involves online algorithms
USB
Interface
Write Signal
Memory
5
4.5
4
(ms)
Accelerator 1
Accelerator 2
FPGA

Only a limited number of
SystemC accelerators can fit
on an FPGA fabric
Dynamically map processes
to accelerators based on
process usage
Accelerator 3
42
11
44
12
43
10
Virtual machine
Big FPGA/no com
Big FPGA
Static preloaded
Greedy
AG
3.5
3
2.5
2
1.5
1
0.5
0
Random
Biased
Sequence
Periodic
Image Filter
Example
Frank Vahid, UC Riverside
52/64
Just-in-Time Synthesis
Emulator
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main
Processor
Send SystemC bytecode
to synthesis server
Instruction
Memory
Read Signal
Memory
Write Signal
Memory
Accelerator 1
Accelerator 2
FPGA
SystemC
bytecode
Accelerator 3
Dynamically reconfigure
some or all of the FPGA
FPGA Specific
Bitstream
Possible to even perform
synthesis on-chip – “warp
processing” (previous UCR work)
Frank Vahid, UC Riverside
53/64
Transmuting Coprocessors

Demo
Frank Vahid, UC Riverside
57/64
FPGA is a Size-Limited Coprocessing Resource
App executions change. Must decide which coprocessors should be
FPGA-resident at a given time – transmuting coprocessors
Speedup with
previous appsUser device
Upload app
profile info
A software update
A coprocessor update
Server
Internet
User app profile info
FPGA
Memory
Bus
FPGA
implements μP
coprocessors
DMA
I/O
DOOM: 23sec
DOOM: 6sec
23sec
Blowfish:
Blowfish: 6sec
0001001010
0001001010
0100100101
0100100101
Select coproc.
set, generate
new FPGA
bitstream
CP library
CP selection
CP placement
New FPGA binary
Send back new
bitstream, reprogram FPGA
Frank Vahid, UC Riverside
58/64
Transmuting Coprocessor Demo

Three image filters:



Blur filter (S/L): Blur the image
Sobel filter (S/L): Find the edge of the
image
Emboss filter(S/L): Emboss the image
120
100
80
Time
60
30x
40
120x
20

Platform:



Virtex 2P(XC2VP30): PPC + Coprocessors
PPC Frequency: 100Mhz
Coproc. Frequency: 50Mhz
0
MP
Small CP
Large CP
Size(slice)
Small
Large
Blur
30
120
Sobel
228
912
Emboss
81
324
Frank Vahid, UC Riverside
59/64
Demo architecture
UART Push button
Image
BRAM
PLB
PPC

Image (128*128 pixels and 24bit color):
24 BRAMs

Soft version: Read (Image
BRAM)Execution (PPC)Write
(Display BRAM)

Coprocessor version: Read (Image
BRAM)Execution(Coproc)Write
(Display BRAM)

Dock: send the profile information
through UART.
Peripherals
Coproc
Instruction
BRAM
EDK
Interface
to external
Display
BRAM
VGA
control
ISE
VGA display
Frank Vahid, UC Riverside
60/64
Coprocessor configurations








Microprocessor only
Small blur+ small sobel
Small blur + small emboss
Small sobel + small emboss
Large blur
Large sobel
Large emboss
Choose the configuration
according to app profile info.
PPC
Peripherals
Sobel(s)
Blur
(S)
Emboss(L)
Sobel
Blur (L)
(L)
Sobel(S)
Emboss(s)
Memory
Virtex2P
Coprocessor
region
Frank Vahid, UC Riverside
61/64
Dynamic Enables Expandable Logic Concept
RAM
RAM
Expandable
ExpandableRAM
Logic– –System
Warp tools detect
detects
duringinvisibly
start, adapt
amountRAM
of FPGA,
improves
performance
invisiblyhardware.
application
to use less/more
DMA
CacheCache
FPGA
FPGA
FPGA
FPGA
Profiler
µP µP
Warp
Tools
Expandable Logic
Expandable RAM
uP
Performance
Frank Vahid, UC Riverside
63/64
Summary



FPGAs entering mainstream
Portability of applications is important
Dynamic binary translation to FPGAs – Warp
processing


Shown feasible; Extensive future work
Trends towards FPGA ubiquity



Microprocessor binaries need extensions for spatial
constructs
One approach: SystemC bytecode and virtual machine
Can also be warped for circuit-speed
http://www.cs.ucr.edu/~vahid/pubs
Frank Vahid, UC Riverside
64/64
Descargar

Slide 1