Warp Processing -- Dynamic Transparent
Conversion of Binaries to Circuits
Frank Vahid
Professor
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Work supported by the National Science Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, Motorola/Freescale
Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris
Miller (MS 2007), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st yr PhD)
Outline

FPGAs






Overview
Hard to program --> Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary
Frank Vahid, UC Riverside
2/57
FPGAs

FPGA -- Field-Programmable Gate Array



Off-the-shelf chip, evolved in early 1990s
Implements custom circuit just by
downloading stream of bits (“software”)
Basic idea: N-address memory can

(Note: no “gate array” inside)
Memory called Lookup Table, or LUT


Thousands of small (~3-input) LUTs –
larger LUTs are inefficient
Thousands of switch matrices (SM) for
programming interconnections
Possibly additional hard core
components, like multipliers, RAM, etc.
CAD tools automatically map desired
circuit onto FPGA fabric
Frank Vahid, UC Riverside
4x2 Memory
a1 00 1 1
a0 01 1 0
1 1
10 0 0
11 d d
1
0
b
a
b
FPGA “fabric”


a
implement N-input combinational logic


Implement particular
circuit just by downloading
particular bits
F
G
F G
“Lookup table” -- LUT
*
LUT
LUT
LUT
SM
*
SM
LUT
LUT
LUT
*
+
3/57
FPGAs: "Programmable" like
Microprocessors -- Download Bits
Microprocessor
Binaries
Configurable
logic block
(CLB) -- (LUT
plus flip-flops)
FPGA Binaries
a
b
01110100...
001010010
001010010
001010010
…
…
…
…
…
…
x
a
b
c
Bits loaded into
LUTs, CLBs, and
SMs
Bits
loaded
into
program
memory
11
01
SM
SM
0010
…
Processor
Processor
0010
…
SM
SM
CLB
CLB
11
y
CLB
CLB
10
addr
0
1
1
0
1
0
0
1
x
y
01
FPGA
00
11
11
...
c
0
0
0
1
0
1
1
1
SM
SM
00
01
01
...
SM (Switch Matrix)
1
a
or
b
a
11
0
SM
SM
SM
SM
SM
SM
b
Frank Vahid, UC Riverside
a
or
b
4/57
FPGAs as Coprocessors

Coprocessor -- Accelarates
application kernel by implementing
as circuit

ASIC coprocessor known to speedup
many application kernels

Energy advantages too
Proc.
ASIC
(e.g., Henkel’98,
Rabaey’98, Stitt/Vahid’04)

Application
FPGA coprocessor also gives
speedup/energy benefits (Stitt/Vahid IEEE
Application
D&T’02, IEEE TECS’04)


Con: more silicon (~20x), ~4x performance
overhead (Rose FPGA'06)
Pro: platform fully programmable

Proc.
FPGA
Shorter time-to-market, smaller nonrecurring engineering (NRE) cost, low cost
devices available, late changes (even inproduct)
Frank Vahid, UC Riverside
5/57
FPGAs as Coprocessors Surprisingly Competitive to ASIC
FPGA 34% energy
savings versus ASIC’s
48% (Stitt/Vahid IEEE
D&T’02, IEEE TECS’04)
A jet isn’t as fast as a rocket,
but it sure beats driving
80%
60%
ASIC
FPGA
40%
20%
0%
Frank Vahid, UC Riverside
P
P
S_
g3
S_ fax
ad
pc
P m
S_
c
P rc
S_
P
S_ des
en
g
P in e
S_
P
S_ jpe
su g
m
m
P in
S_
v
P 42
S_
M bre
v
B
M _g7
B
_a 2 1
d
M pc
B
_p m
eg
w
N it
B
_d
N
B h
_m
d5
N
B
_t
l

100%
% Energy vs Sw Only

120%
6/57
FPGA – Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x
x
x
x
x
=
=
=
=
=
(x
((x
((x
((x
((x
>>16)
>> 8)
>> 4)
>> 2)
>> 1)
&
&
&
&
0x00ff00ff)
0x0f0f0f0f)
0x33333333)
0x55555555)
|
|
|
|
|
(x
((x
((x
((x
((x
<<16);
<< 8) &
<< 4) &
<< 2) &
<< 1) &
Hardware for Bit Reversal
X Value
BitOriginal
Reversed
X Value
0xff00ff00);
0xf0f0f0f0);
0xcccccccc);
0xaaaaaaaa);
...........
Compilation
...........
Binary
sll $v1[3],$v0[2],0x10
srl $v0[2],$v0[2],0x10
or $v0[2],$v1[3],$v0[2]
srl $v1[3],$v0[2],0x8
and $v1[3],$v1[3],$t5[13]
sll $v0[2],$v0[2],0x8
and $v0[2],$v0[2],$t4[12]
or $v0[2],$v1[3],$v0[2]
srl $v1[3],$v0[2],0x4
and $v1[3],$v1[3],$t3[11]
sll $v0[2],$v0[2],0x4
and $v0[2],$v0[2],$t2[10]
...
Bit
Bit Reversed
Reversed XX Value
Value
Processor
Processor

Requires between
32 and 128 cycles
Processor
FPGA

Requires only 1 cycle
(speedup of 32x to 128x)
In general, because of concurrency, from bit-level to task level
Frank Vahid, UC Riverside
7/57
FPGAs: Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
for
for (i=0;
(i=0; ii << 128;
128; i++)
i++)
y[i]
+=
c[i]
*
x[i]
y[i] += c[i] * x[i]
....
....
....
Hardware for FIR Filter
* * * * * * * * * * * *
+
+
+
+
+
+
+
+
+

1000’s of instructions

Several thousand cycles
Frank Vahid, UC Riverside
.......
.......
.......
+
.......
+
Processor
Processor
+
.......
FPGA
Processor

~ 7 cycles

Speedup > 100x
8/57
FPGAs are Hard to Program

Synthesis from hardware
description languages (HDLs)





VHDL, Verilog
Great for parallelism
But non-standard languages,
manual partitioning
SystemC a good step
Applic.
Binary
Special
Profiling
Compiler
Binary
Binary
Includes
synthesis,
tech. map,
pace & route
C/C++ partitioning compilers



Use language subset
Growing in importance
But special compiler limits adoption
Micropr.
Binary
FPGA
Binary
Netlist
100 software writers for every CAD user
Only about 15,000 CAD seats worldwide;
millions of compiler seats
Frank Vahid, UC Riverside
Proc.
FPGA
9/57
Binary-Level Partitioning Helps

SW
Binary
Binary-level partitioning

Traditional
partitioning
done here

Standard
Profiling
Compiler

Binary
Binary

Includes
Less disruptive, Binary-level synthesis,
back-end tool Partitioner tech. map,
place & route
Modified
Binary

Frank Vahid, UC Riverside
Any compiler, any language, multiple
sources, assembly/object support,
legacy code support

Better incorporation into toolflow
Disadvantage

FPGA
Partition and synthesize starting from
SW binary
Advantages
Netlist
Netlist

Proc.
Stitt/Vahid, ICCAD’02
Recent commercial product: Critical Blue
[www.criticalblue.com]
Quality loss due to lack of high-level
language constructs? (More later)
10/57
Outline

FPGAs






Overview
Hard to program --> Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary
Frank Vahid, UC Riverside
11/57
Warp Processing

Observation: Dynamic binary recompilation to
a different microprocessor architecture is a
mature commercial technology

e.g., Modern Pentiums translate x86 to VLIW
Question: If we can recompile binaries to
FPGA circuits, can we dynamically
recompile binaries to FPGA circuits?
Frank Vahid, UC Riverside
12/57
Warp Processing Idea
1
Initially, software binary loaded
into instruction memory
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UC Riverside
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
On-chip CAD
13/57
Warp Processing Idea
2
Microprocessor executes
instructions in software binary
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UC Riverside
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
14/57
Warp Processing Idea
3
Profiler monitors instructions and
detects critical regions in binary
Profiler
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
µP
I
Mem
D$
FPGA
Frank Vahid, UC Riverside
On-chip CAD
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
Critical Loop
Detected
15/57
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UC Riverside
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
16/57
Warp Processing Idea
5
On-chip CAD decompiles critical region
into control data flow graph (CDFG)
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
reg3 := 0
reg4 := 0
loop:
reg4 := reg4 + mem[
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
if (reg3 < 10) goto loop
ret reg4
Frank Vahid, UC Riverside
17/57
Warp Processing Idea
6
On-chip CAD synthesizes decompiled
CDFG to a custom (parallel) circuit
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
+
+
+
Time Energy
reg3 := 0
+ := 0+
reg4
+
...
loop:
+reg4 := reg4++ mem[
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
if (reg3
. . loop
+< 10). goto
ret reg4
Frank Vahid, UC Riverside
+
...
18/57
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
Time Energy
+
reg3 := 0
+ := 0+ +
reg4
SM
SM
SM . . .
loop:
reg4+
+ mem[
+ CLB
+ +reg4 :=CLB
+
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
SM +
ifSM
(reg3
goto
. . loop
+< 10).SM
ret reg4
Frank Vahid, UC Riverside
+
...
19/57
Warp Processing Idea
8
On-chip CAD replaces instructions in
binary to use hardware, causing
performance and energy to “warp” by
an order of magnitude or more
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
//
instructions
Shl
reg1, reg3, that
1
interact
FPGA
Add reg5,with
reg2,
reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UC Riverside
Dynamic
Part.
On-chip CAD
Module (DPM)
+
+
Time
Energy
Time Energy
Software-only
“Warped”
+
reg3 := 0
+ :=+0 +
reg4
SM
SM
SM . . .
loop:
+
+ + mem[
reg4
+ + reg4 :=CLB
+
CLB
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
. .SM
. loop
SM +
ifSM
(reg3
goto
+ < 10)
+
...
ret reg4
20/57
Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
Profiler
µP
µP
µP
µP
µP
µP
µP
FPGA
Frank Vahid, UC Riverside
I
Mem
D$
On-chip CAD
21/57
Warp Processing: Trend Towards
Processor/FPGA Programmable Platforms

FPGAs with hard
core processors
Xilinx Virtex II Pro. Source: Xilinx

Altera Excalibur. Source: Altera
FPGAs with soft
core processors
Xilinx Spartan. Source: Xilinx

Computer
boards with
FPGAs
Frank Vahid, UC Riverside
Cray XD1. Source: FPGA journal, Apr’05
22/57
Warp Processing: Trend Towards
Processor/FPGA Programmable Platforms

Programming a key challenge


Soln 1: Compile high-level language to
custom binaries using both
microprocessor and FPGA
Soln 2: Use standard microprocessor
binaries, dynamically re-compile (warp)

CAD
Tools
Profiling
CAD
Tools
CAD
Tools
Profiling
Profiling
Less high-level information when
compiling, less optimization
Proc.
Proc.
Proc.
Pros:



Available to all software developers, not
just specialists
Data dependent optimization
Most importantly, standard binaries
enable “ecosystem” among tools,
architecture, and applications
Standard binary (and ecosystem) concept presently
absent in FPGAs and other new programmable platforms
Frank Vahid, UC Riverside
Standard
Profiling
Compiler
Binary
Binary
Cons:


Traditional
partitioning
done here
SW
Binary
FPGA
FPGA
FPGA
Architectures
Standard binaries
Applications
Tools
23/57
Outline

FPGAs






Overview
Hard to program --> Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary
Frank Vahid, UC Riverside
24/57
Warp Processing Steps (On-Chip CAD)
Binary
Binary
Profiling & partitioning
Decompilation
Profiler
µP
Synthesis
I$
D$
Binary
Updater
FPGA
On-chip
CAD
JIT FPGA
compilation
Micropr.
Binary
Binary
Frank Vahid, UC Riverside
Std. HW
Binary
Technology mapping,
placement, and
routing
FPGA
Binary
binary
25/57
Warp Processing – Profiling and
Partitioning
Binary
Binary
Profiling & partitioning

Applications spend much
time in small amount of
code



90-10 rule
Observed 75-4 rule for
MediaBench, NetBench
Developed efficient
hardware profiler

Gordon-Ross/Vahid, CASES'04, IEEE
Trans. on Comp 06

Partitioning straightforward

Try most critical code first
Profiler
Decompilation
I$
D$
µP
Binary
Updater
On-chip
CAD
FPGA
Std. HW
Binary
JIT FPGA
compilation
Micropr.
Binary
Binary
FPGA
Binary
binary
% executio n time
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
% size o f pro gram
1
Frank Vahid, UC Riverside
Synthesis
2
3
4
5
6
7
8
9 10
26/57
Warp Processing – Decompilation
Binary
Binary

Synthesis from binary has a key challenge



High-level information (e.g., loops, arrays) lost during
compilation
Direct translation of assembly to circuit – huge overheads
Need to recover high-level information
8
7
5
4
3
2
1
0
Frank Vahid, UC Riverside
Decompilation
Synthesis
Binary
Updater
Std. HW
Binary
JIT FPGA
compilation
Micropr.
Binary
Binary
FPGA
Binary
binary
g3fax
adpcm
crc
des
engine
jpeg
summin
v42
avg
6
Overhead of
microprocessor/FPGA
solution WITHOUT
decompilation, vs.
microprocessor alone
Profiling & partitioning
Speedup Energy
Size
27/57
Warp Processing – Decompilation
Binary
Binary

Solution –Recover high-level information from binary:
Profiling & partitioning
decompilation

Extensive previous work (for different purposes)


Decompilation
Synthesis
Adapted
Binary
Updater
Developed new decompilation methods also
Original C Code
long f( short a[10] ) {
long accum;
for (int i=0; i < 10; i++) {
accum += a[i];
}
return accum;
}
Corresponding
Assembly
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Data
Flow
Analysis
Control/Data
FlowRecovery
Graph Creation
Control
Structure
Function
Recovery
Array
Recovery
Micropr.
Binary
Binary
Std. HW
Binary
JIT FPGA
compilation
FPGA
Binary
binary
long f( short
long reg2
) { :=
array[10]
reg3
long f( long reg2
) { ) {0
long
reg4= =0;0; reg4 := 0
int reg3
for
(long=reg3
int reg4
0; = 0; reg3 < 10; reg3++) {
reg4 += mem[reg2
array[reg3];
loop:
loop: + (reg3 << 1)];
}reg4 = reg4 + mem[reg2
+ <<
reg4 := reg3
reg4
+ mem[
reg1
1
reg3 << 1)];
return reg4;
reg2
+ (reg3
<< 1)]
reg5
:= reg2
+ reg1
} reg3 = reg3 + 1;reg6
reg3 := mem[reg5
reg3 + 1 + 0]
if (reg3 < 10) goto loop;
if (reg3
< 10)+goto
reg4
:= reg4
reg6loop
return reg4;
reg3 := reg3 + 1
}
if (reg3 < 10) goto loop
ret reg4
ret
reg4
Frank Vahid, UC Riverside
Almost Identical
Representations
28/57
New Decompilation Method: Loop
Rerolling

Problem: Compiler unrolling of loops (to expose
parallelism) causes synthesis problems:


Huge input (slow), can’t unroll to desired amount, can’t use
advanced loop methods (loop pipelining, fusion, splitting, ...)
Solution: New decompilation method: Loop Rerolling

Identify unrolled iterations, compact into one iteration
for (int i=0; i < 3; i++)
accum += a[i];
Frank Vahid, UC Riverside
Loop Unrolling
Ld reg2, 100(0)
Add reg1, reg1, reg2
Ld reg2, 100(1)
Add reg1, reg1, reg2
Ld reg2, 100(2)
Add reg1, reg1, reg2
Loop Rerolling
for (int i=0; i<3;i++)
reg1 += array[i];
29/57
Loop Rerolling:

Identify Unrolled Iterations
Find consecutively repeating instruction sequences
Original C Code
x = x + 1;
for (i=0; i < 2; i++)
a[i]=b[i]+1;
y=x;
b
c
abcd
Binary
Map to String
x= x + 1;
a[0] = b[0]+1;
a[1] = b[1]+1;
y = x;
Add r3, r3, 1
Ld r0, b(0)
Add r1, r0, 1
St a(0), r1
Ld r0, b(1)
Add r1, r0, 1
St a(1), r1
Mov r4, r3
Add r3, r3, 1 => B
Ld r0, b(0) => A
Add r1, r0, 1 => B
St a(0), r1 => C
Ld r0, b(1) => A
Add r1, r0, 1 => B
St a(1), r1 => C
Mov r4, r3 => D
Suffix Tree
Derived from
bioinformatics
techniques
abcabcd
Unrolled Loop
d
Frank Vahid, UC Riverside
abc
c
d
abcd
d
abcd
d
String
Representation
BABCABCD
Find Consecutive Repeating Substrings:
Adjacent Nodes with Same Substring
Unrolled Loop
2 unrolled iterations
Each iteration = abc (Ld, Add, St)
30/57
Warp Processing – Decompilation

Study

Synthesis after decompilation often quite similar

Almost identical performance, small area overhead
Synthesis from C Code
Synthesis after Decompiling Binary
Example
Cycles ClkFrq Time Area Cycles
ClkFrq Time Area %TimeOverhead %AreaOverhead
bit_correlator
258 118 2.19
15
258
118 2.186
15
0%
0%
fir
129 125 1.03 359
129
125 1.032
371
0%
3%
udiv8
281 190 1.48 398
281
190 1.479
398
0%
0%
prewitt
64516 123 525 2690
64516
123 524.5 4250
0%
58%
mf9
258
57
4.5 1048
258
57 4.503 1048
0%
0%
moravec
195072
66 2951 680
195072
70 2791
676
-6%
-1%
Avg:
-1%
10%
FPGA 2005
Frank Vahid, UC Riverside
31/57
2. Deriving high-level constructs from binaries

Recent study of decompilation robustness
In presence of compiler optimizations, and instruction sets
Energy savings of 77%/76%/87% for
ICCAD’05
MIPS/ARM/Microblaze


DATE’04
MIPS
O1
Example
Sw
Hw/Sw
FIR Filter
1.000
0.089
Beamformer 1.000
0.074
Viterbi
1.000
0.136
Crc
1.000
0.030
Des
1.000
0.275
Summin
1.000
0.111
Brev
1.000
0.120
BITMNP01
1.000
0.114
IDCTRN01
1.000
0.323
PNTRCH01 1.000
0.196
Average: 1.000
Geo.Mean:1.000
S
11.2
13.5
7.4
33.8
3.6
9.0
8.3
8.8
3.1
5.1
Sw
0.923
0.853
0.891
0.967
0.990
0.899
0.976
0.985
0.975
0.945
0.147 10.4 0.940
0.124 8.4 0.939
Frank Vahid, UC Riverside
ARM
O3
Hw/Sw
0.070
0.071
0.152
0.019
0.310
0.145
0.129
0.113
0.323
0.196
S
14.2
14.0
6.6
53.6
3.2
6.9
7.7
8.8
3.1
5.1
Sw
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.153 12.3 1.000
0.122 8.7 1.000
O1
Hw/Sw
0.085
0.149
0.131
0.020
0.360
0.183
0.156
0.188
0.230
0.325
S
11.8
6.7
7.6
49.5
2.8
5.5
6.4
5.3
4.4
3.1
Sw
0.999
1.018
0.957
1.105
1.028
0.684
1.476
0.988
1.005
0.963
0.183 10.3 1.022
0.150 7.0 1.008
MIcroBlaze
O3
Hw/Sw
S
0.084 11.9
0.172
5.8
0.126
7.9
0.007 134.8
0.401
2.5
0.128
7.8
0.153
6.5
0.186
5.4
0.230
4.3
0.313
3.2
Sw
1.000
1.000
1.000
1.000
1.000
n/a
1.000
1.000
1.000
n/a
0.180 19.0 1.000
0.134 8.3 1.000
O1
Hw/Sw
0.040
0.031
0.060
0.012
0.205
n/a
0.011
0.112
0.258
n/a
S
25.3
32.3
16.7
80.3
4.9
n/a
90.2
8.9
3.9
n/a
Sw
O3
Hw/Sw
S
0.549
0.647
0.765
0.995
0.998
n/a
0.951
0.999
0.885
n/a
0.015 68.4
0.032 31.4
0.017 59.0
0.011 88.6
0.218
4.6
n/a
n/a
0.009 106.5
0.115
8.7
0.150
6.7
n/a
n/a
0.091 32.8 0.849
0.053 19.0 0.831
0.071 46.7
0.037 27.4
32/57
Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
30
25
25
25
20
20
20
15
15
15
10
10
10
5
55
MOIP
3S
-O
S
M
IP
S
M-O
IP1
IP
S
-O
AAR
3
RMM
AR
--OO M
11
-O
AAR
1
RMM A
R
M --OO33M ic
O
ro
3
Bl
az
e
M
-O
ic
ro
1
Bl
az
e
-O
3
1
0
00
M
Speedups
Speedups
similar similar
on
MicroBlaze
speedups
on ARM
MIPSfor
for–O1
–O1
Speedups
similar
much
larger
and and
–O3ARM
–O3 and
between
optimizations
optimizations
MicroBlaze
is a slower
MIPS
microprocessor
Complex instructions
-O3
optimizations
of ARM
didn’t hurtwere
very
beneficial to
synthesis
hardware
30
30
Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F.
Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
Frank Vahid, UC Riverside
33/57
Decompilation Effectiveness In-Depth Study

Performed in-depth study with
Freescale


H.264 video decoder
Highly-optimized proprietary
code, not reference code


Huge difference
Research question: Is
synthesis from binaries
competitive on highlyoptimized code?

MPEG 2
H.264: Better quality,
or smaller files, using
more computation
Several-month study
Frank Vahid, UC Riverside
34/57
Optimized H.264

Larger than most benchmarks



H.264: 16,000 lines
Previous work: 100 to
several thousand lines
Highly-optimized

H.264: Many man-hours of
manual optimization


10x faster than reference
code used in previous works
Different profiling results

Previous examples


~90% time in several loops
H.264


~90% time in ~45 functions
Harder to speedup
Frank Vahid, UC Riverside
Function Nam e Ins tr %Tim e Cum ulative Spe e dup
MotionComp_00
33
6.8%
1.1
InvTransf orm4x4
63
12.5%
1.1
FindHorizontalBS
47
16.7%
1.2
GetBits
51
20.8%
1.3
FindVerticalBS
44
24.7%
1.3
MotionCompChromaFullXFullY
24
28.6%
1.4
FilterHorizontalLuma 557
32.5%
1.5
FilterVerticalLuma 481
35.8%
1.6
FilterHorizontalChroma
133
39.0%
1.6
CombineCoef sZerosInvQuantScan
69
42.0%
1.7
memset
20
44.9%
1.8
MotionCompensate 167
47.7%
1.9
FilterVerticalChroma 121
50.3%
2.0
MotionCompChromaFracXFracY
48
53.0%
2.1
ReadLeadingZerosAndOne
56
55.6%
2.3
DecodeCoef f TokenNormal
93
57.5%
2.4
DeblockingFilterLumaRow
272
59.4%
2.5
DecodeZeros
79
61.3%
2.6
MotionComp_23
279
63.0%
2.7
DecodeBlockCoef Levels
56
64.6%
2.8
MotionComp_21
281
66.2%
3.0
FindBoundaryStrengthPMB
44
67.7%
3.1
35/57
C vs. Binary Synthesis on Opt. H.264
Speedup from C Partititioning
10
9
10
Speedup from C Partititioning
8
9
Speedup
from
Binary
Partitioning
Speedup
from
C Partititioning
Speedup
Speedup
7
8
7
6
6
5
5
4
4
3
3
2
2
1
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
1
0
Number of Functions in Hardware

Binary partitioning competitive with source partitioning

Speedups compared to ARM9 software


Binary: 2.48, C: 2.53
Decompilation recovered nearly all high-level information needed for
partitioning and synthesis
Frank Vahid, UC Riverside
36/57
Warp Processing – Synthesis
Binary
Binary
Profiling & partitioning

ROCM - Riverside On-Chip Minimizer


Standard register-transfer synthesis
Logic synthesis – make it lean



Combination of approaches from
Espresso-II [Brayton, et al.,
1984][Hassoun & Sasoa, 2002] and
Presto [Svoboda & White, 1979]
Profiler
Decompilation
I$
D$
µP
Synthesis
Binary
Updater
FPGA
On-chip
CAD
Std. HW
Binary
JIT FPGA
compilation
Micropr.
Binary
Binary
FPGA
Binary
binary
Cost/benefit analysis of operations
Result



Single expand phase instead of
multiple iterations
Eliminate need to compute off-set –
reduces memory usage
On average only 2% larger than
optimal solution
Expand
on-set
dc-set
off-set
Reduce
Irredundant
Frank Vahid, UC Riverside
37/57
Warp Processing – JIT FPGA Compilation
Binary
Binary
Profiling & partitioning

Hard – Routing is extremely
compute/memory intensive

Solution – Jointly design
CAD and FPGA architecture


Cost/benefit analysis
Highly iterative process
Frank Vahid, UC Riverside
Profiler
Decompilation
µP
I$
D$
Synthesis
Binary
Updater
FPGA
On-chip
CAD
Std. HW
Binary
JIT FPGA
compilation
Micropr.
Binary
Binary
FPGA
Binary
binary
38/57
Warp-Targeted FPGA Architecture

Simplified switch matrices



Directly connected to adjacent CLB
All nets are routed using only a
single pair of channels
Allows for efficient routing



D$
On-chip
CAD
3L
2L
1L
0L
3
2
1
0
3L
2L
1L
0L
3
2
1
0
0 1 2 3 0L 1L 2L 3L
Two 3 input, 2 output LUTs
Each CLB connected to adjacent CLB
to simplify routing of carry chains
Currently being prototyped by Intel
(scheduled for 2006 Q3 shuttle)
Frank Vahid, UC Riverside
I$
0 1 2 3 0L 1L 2L 3L
Simplified CLBs


Routing is by far the most timeconsuming on-chip CAD task
µP
FPGA
CAD-specialized configurable logic
fabric

Profiler
Adj.
CLB
DATE’04
a b c
d e f
LUT
LUT
o1 o2
o3 o4
Adj.
CLB
39/57
Warp Processing – Technology Mapping

Binary
Binary
ROCTM - Technology Mapping/Packing

Decompose hardware circuit into DAG


Profiling & partitioning
Decompilation
Nodes correspond to basic 2-input logic gates (AND, OR,
Synthesis
XOR, etc.)
Hierarchical bottom-up graph clustering algorithm



Std. HW
Binary
Binary
Updater
Breadth-first traversal combining nodes to form single-output
LUTs
Combine LUTs with common inputs to form final 2-output
LUTs
Pack LUTs in which output from one LUT is input to second
LUT
JIT FPGA
compilation
FPGA
Binary
binary
Micropr.
Binary
Binary
Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA
Compilation
Frank Vahid, UC Riverside
Dynamic Hardware/Software Partitioning: A First Approach, DAC’03
A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04
Routing
40/57
Warp Processing – Placement

Binary
Binary
ROCPLACE - Placement

Profiling & partitioning
Dependency-based positional placement algorithm



Decompilation
Identify critical path, placing critical nodes in center of CLF
Use dependencies between remaining CLBs to determine
placement
Attempt to use adjacent CLB routing whenever possible
Synthesis
Std. HW
Binary
Binary
Updater
JIT FPGA
compilation
FPGA
Binary
binary
Micropr.
Binary
Binary
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA
Compilation
Frank Vahid, UC Riverside
Dynamic Hardware/Software Partitioning: A First Approach, DAC’03
A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04
Routing
41/57
Warp Processing – Routing
Binary
Binary
ROCR - Riverside On-Chip Router

Profiling & partitioning
Requires much less memory than VPR as resource graph is smaller
Decompilation
Synthesis
Memory Usage (KB)
70000
60000
VPR (RD)
VPR (TD)
ROCR
Binary
Updater
50000
40000
Std. HW
Binary
JIT FPGA
compilation
30000
20000
Micropr.
Binary
Binary
10000
FPGA
Binary
binary
0
y
2
4
4
alu apex apex bigke

q
s
de dif fe dsip
p
4
e6 llipt ic ex5
e
c
1
8
3
7
fris isex3 s142 s29 841 584.
m
s3 s38
Benchmark
e
q
se tseng erag
Av
10x faster execution time than VPR (Timing driven)
60
Execution Time (s)

VPR (TD)
ROCR
50
40
30
20
Logic Synthesis
10
0
y
2
4
4
alu apex apex bigke

Tech. Mapping/Packing
des
eq dsip
dif f
p
e64 llipt ic ex5
e
c
1
8
3
fris isex3 s142 s29 8417 584.
m
s3 s38
Benchmark
Produces circuits with critical path 10% shorter than
VPR (Routablilty driven)
Frank Vahid, UC Riverside
e
seq tseng erag
Av
JIT FPGA
Compilation
Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04
Placement
Routing
42/57
Outline

FPGAs






Overview
Hard to program --> Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary
Frank Vahid, UC Riverside
43/57
Experiments with Warp Processing

Warp Processor


ARM/MIPS plus our fabric
Riverside on-chip CAD tools to map
critical region to configurable
fabric


Profiler
I$
D$
ARM
FPGA
On-Chip
CAD
ARM
I$
D$
Requires less than 2 seconds on
lean embedded processor to
perform synthesis and JIT FPGA
compilation
Traditional HW/SW Partitioning



ARM/MIPS plus Xilinx Virtex-E
FPGA
Manually partitioned software
using VHDL
VHDL synthesized using Xilinx ISE
4.1
Frank Vahid, UC Riverside
Xilinx Virtex-E
FPGA
44/57
Warp Processors
Performance Speedup (Most Frequent Kernel Only)
Average kernel speedup of 41,
vs. 21 for Virtex-E
191 113
80
WCLA simplicity results
in faster HW circuits
130
Warp Proc.
70
Xilinx Virtex-E
Speedup
60
50
40
30
20
10
SW Only
Execution
Frank Vahid, UC Riverside
Av ul
er
ag
e:
at
m
fir
m
ro
cm
pk
tfl
ow
ca
nr
d
bi r
tm
np
tb
lo
ok
tt
sp
rk
m
at
rix
id
ct
g7
21
m
pe
g2
ur
l
br
e
v
g3
fa
x
0
45/57
Warp Processors
Performance Speedup (Overall, Multiple Kernels)
Assuming 100 MHz ARM,
and fabric clocked at rate
determined by synthesis
Average speedup of 7.4

Energy reduction of 38% - 94%
18
Warp Proc.
16
14
Speedup
12
10
8
6
4
2
SW Only
Execution
Frank Vahid, UC Riverside
Av ul
er
ag
e:
at
m
fir
m
ro
cm
pk
tfl
ow
ca
nr
d
bi r
tm
np
tb
lo
ok
tt
sp
rk
m
at
rix
id
ct
g7
2
m 1
pe
g2
ur
l
br
e
v
g3
fa
x
0
46/57
Warp Processors - Results
Execution Time and Memory Requirements
Xilinx ISE
9.1 s
60 MB
DPM (CAD)
0.2 s
3.6MB
DPM (CAD) (75MHz ARM7)
1.4s
3.6MB
Frank Vahid, UC Riverside
47/57
Outline

FPGAs






Overview
Hard to program --> Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary
Frank Vahid, UC Riverside
48/57
Direction: Coding Guidelines for
Partitioning?
Speedup from C Partititioning
Ideal Speedup (Zero-time Hw Execution)
Speedup from C Partititioning
Speedup from C Partititioning
99
10
88
9
77
8
Speedup
Binary
Partitioning
Speedup
from
CPartitioning
Partititioning
Speedup
fromfrom
Binary
66
7
51
51
49
49
47
47
45
45
43
43
41
41
39
39
37
37
35
35
33
33
31
31
29
29
27
27
25
25
23
23
21
21
19
19
17
17
15
15
13
13
11
11
99
77
55
00
33
56
5
5
44
4
33
3
22
2
11
1
11
Speedup
Speedup
Speedup
10
10
Number
Number of
of Functions
Functions in
in Hardware
Hardware


In-depth H264 study led to a question: Why aren’t speedups (from
binary or C) closer to “ideal” (0-time per fct)
We thus examined dozens of benchmarks in more detail

Are there simple coding guidelines that result in better speedups when
kernels are synthesized to circuits?
Frank Vahid, UC Riverside
49/57
Synthesis-Oriented Coding Guidelines
Pass by value-return


Declare a local array and copy in all data needed by a
function (makes lack of aliases explicit)
Function specialization


Create function version having frequent parameter-values as
constants
Rewritten
Original
void f(int width, int height ) {
void f_4_4() {
....
....
for (i=0; i < width, i++)
for (i=0; i < 4, i++)
for (j=0; j < height; j++)
for (j=0; j < 4; j++)
...
...
...
}
Frank Vahid, UC Riverside
Bounds are
explicit so
loops are now
unrollable
...
}
50/57
Synthesis-Oriented Coding Guidelines

Algorithmic specialization


Hoisting and sinking of error checking


Use parallelizable hardware algorithms when possible
Keep error checking out of loops to enable unrolling
Lookup table avoidance

Use expressions rather than lookup tables
Original
Rewritten
int clip[512] = { . . . }
void f() {
void f() {
...
...
for (i=0; i < 10; i++)
for (i=0; i < 10; i++)
if (val[i] > 255) val[i] = 255;
val[i] = clip[val[i]];
else if (val[i] < 0) val[i] = 0;
...
} Frank Vahid, UC Riverside
...
}
Comparisons can now be parallelized
0
val[0]
<
255
>
0
val[1]
<
255
>
3x1
3x1
val[0]
val[1]
...
51/57
Synthesis-Oriented Coding Guidelines

Use explicit control flow

Replace function pointers with if statements and
static function calls
Original
void (*funcArray[]) (char *data)
= { func1, func2, . . . };
Rewritten
void f(char *data) {
...
void f(char *data) {
if (i == 0)
...
func1(data);
funcPointer = funcArray[i];
else if (i==1)
(*funcPointer) (data);
func2(data);
...
}
Frank Vahid, UC Riverside
...
}
52/57
Coding Guideline Results on H.264
Ideal
IdealSpeedup
Speedup(Zero-time
(Zero-timeHw
HwExecution)
Execution)
Speedup After Rewrite (C Partitioning)
Speedup
SpeedupAfter
fromRewrite
C Partititioning
(Binary Partitioning)
Speedup from C Partititioning
Speedupfrom
fromBinary
BinaryPartitioning
Partitioning
Speedup
10
9
Speedup
8
7
6
5
4
3
2
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
1
0
Number of Functions in Hardware

Simple coding guidelines made large improvement


Rewritten software only ~3% slower than original
And, binary partitioning still competitive with C partitioning

Speedups: Binary: 6.55, C: 6.56

Frank Vahid, UC Riverside
Small difference caused by switch statements that used indirect jumps
53/57
Coding Guideline Results on Other
Benchmarks
16
573 842
10
9
8
7
6
5
4
3
2
1
0
30%
Sw

Size Overhead
10%
Hw/sw with o riginal co de
-20%
mpeg2
jpeg
brev
fir
crc
-30%
c rc
f ir
br e
v
jpeg
mp
eg 2
-10%
g3 f
ax
0%
Hw/sw with guidelines
-88% -47%
Studied guidelines further on standard benchmarks


Performance Overhead
20%
g3fax

16
Further synthesis speedups (again, independent of C vs. binary issue)
More guidelines to be developed
As compute platforms incorporate FPGAs, might these guidelines
become mainstream?
Frank Vahid, UC Riverside
54/57
Direction: New Applications – Image
Processing
60
50
Speedup
40
30
20
10
Pr
ew
itt
W FIR
av
el
et
M
ax
Bl
A n end
tia
Br lias
ig
h
R t en
ob
er
t
So s
b
Em e
l
b
Sh oss
ar
pe
n
G Blu
B u aus r
rt- sia
Ad n
el
s
M on
e
K u di a
wa n
h
Av ara
er
ag
e
0

32x average speedup compared to uP with 10x faster clock

Exploits parallelism in image processing



Window operations contain much fine-grained parallelism
And, each pixel can be determined in parallel
Performance is memory-bandwidth limited

Warp processing can output a pixel per cycle for each pixel that can be fetched from
memory per cycle

Frank Vahid, UC Riverside
Faster memory will further improve performance
55/57
Direction: Applications with ProcessLevel Parallelism
500
30
200
79.2
25
Speedup
20
15
10
5
g
ur
rti
Pr
ie
ng
rt
ot
ei
ra
n
ns
m
fo
ot
rm
if
de
te
ct
io
n
Se
is
FP
m
ic
T
Si
m
m
i
g
ul
ra
at
tio
ed
n
an
ne
al
in
g
Av
er
ag
e
So
Fo
Parallel code provides further speedup


Average 79x speedup compared to desktop uP
Use FPGA to implement 10s or 100s of processors


m
m
in
in
g
pr
o
ic
et
G
en

gr
a
lu
s
te
r
tin
C
ro
u
le
hi
c
Ve
Pl
ac
em
en
t
g
0
Can also exploit instruction-level parallelism
Warp tools will have to detect coarse-grained parallelism
Frank Vahid, UC Riverside
56/57
Summary

Showed feasibility of warp technology


Application kernels can be dynamically mapped to FPGA by
reasonable amount of on-chip compute resources
Tremendous potential applicability

Presently investigating




Embedded (w/ Freescale)
Desktop (w/ Intel)
Server (w/ IBM)
Radically-new FPGA apps may be possible


Frank Vahid, UC Riverside
Neural networks that rewire themselves? Network routers whose
queuing structure changes based on traffic patterns?
If the technology exists to synthesize circuits dynamically, what
can we do with that technology?
57/57
Descargar

Warp Processors - University of California, Riverside