Improving Embedded System Software
Speed and Energy using
Microprocessor/FPGA Platform ICs
Frank Vahid
Associate Professor
Dept. of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
This research has been supported by the National Science Foundation, NEC,
Trimedia, and Triscend
Frank Vahid, UC Riverside
1
General Purpose vs. Special Purpose

Standard
tradeoff
Amazing to think this came from
wolves
Oct. 14, 2002, Cincinnati, Ohio -physician at Cincinnati Children’s
Hospital Medical Center report duct
tape effective at treating warts.
Frank Vahid, UC Riverside
2
General Purpose vs. Single Purpose Processors

total = 0
for i = 1 to N
loop
total += M[i]
end loop
Designers have long known
that:


General-purpose processors are
flexible
General purpose
Single-purpose processors are
Controller
Datapath
Control
logic and
State
register
Register
file
fast
IR
PC
General
ALU
OR
Single purpose
Controller
Control
logic
Datapath
i
total
State
register
+
Data
memory
Program
memory
Data
memory
Assembly
code for:
total = 0
for i =1 to …
ENIAC, 1940’s
Its flexibility was the big deal
Flexibility
Design cost
Time-to-market
Frank Vahid, UC Riverside
Performance
Power efficiency
Size
3
Mixing General and Single Purpose Processors

A.k.a. Hardware/software
partitioning


Pixel coprocessor
D2A
lens
Microcontrolle
r
JPEG codec
coprocessor, accelerator, peripheral,
etc.
Multiplier/Accumulato
r
DMA controller
Software: general-purpose processors


CCD preprocessor
A2D
Hardware: single-purpose processors

Digital camera chip
CCD
Memory controller
Display control
ISA bus interface
UART
LCD control
Though hardware underneath!
Especially important for embedded
systems


Computers embedded in devices
(cameras, cars, toys, even people)
Speed, cost, time-to-market, power,
size, … demands are tough
Flexibility
Speed
Energy
Sw only
Frank Vahid, UC Riverside
Hw Only
Hw/Sw
4
How is Partitioning Done for Embedded Systems?

Partitioning into hw and sw
blocks done early



During conceptual stage
Sw design done separately from
hw design
Attempts since late 1980s to
automate not yet successful




Informal spec
Partitioning manually is reasonably
straightforward
Spec is informal and not machine
readable
Sw algorithms may differ from hw
algorithms
No compelling need for tools
Frank Vahid, UC Riverside
System Partitioning
Sw spec
Sw design
Processor
Hw spec
Hw design
ASIC
5
New Platforms Invite New Efforts in Hw/Sw
Partitioning

New single-chip platforms
contain both general-purpose
processor and an FPGA
Processor + FPGA

Informal spec
FPGA: Field-programmable
gate array



Programmable just like software
 Flexible
Intended largely to implement
single-purpose processors
Can we perform a later
partitioning to improve the
software too?
System Partitioning
Sw spec
Sw design
Hw spec
Hw design
Partitioning
Processor + FPGA
Frank Vahid, UC Riverside
ASIC
6
Commercial Single-Chip Microprocessor/FPGA
Platforms

Triscend E5: based on
8-bit 8051 CISC core
(2000)



10 Dhrystone MIPS at
40MHz
up to 40K logic gates
Cost only about $4
Configurable logic
Triscend E5 chip
8051 processor plus
other peripherals
Frank Vahid, UC Riverside
Memory
7
Single-Chip Microprocessor/FPGA Platforms

Atmel FPSLIC


Field-Programmable
System-Level IC
Based on AVR 8-bit
RISC core



20 Dhrystone MIPS
5k-40k logic gates
$5-$10
Courtesy of Atmel
Frank Vahid, UC Riverside
8
Single-Chip Microprocessor/FPGA Platforms


Triscend A7 chip
(2001)
Based on ARM7 32bit RISC processor



54 Dhrystone MIPS at
60 MHz
Up to 40k logic gates
$10-$20 in volume
Courtesy of Triscend
Frank Vahid, UC Riverside
9
Single-Chip Microprocessor/FPGA Platforms




Altera’s Excalibur EPXA 10
(2002)
ARM (922T) hard core
200 Dhrystone MIPS at 200
MHz
~200k to ~2 million logic
gates
Source: www.altera.com
Frank Vahid, UC Riverside
10
Single-Chip Microprocessor/FPGA Platforms


Xilinx Virtex II Pro
(2002)
PowerPC based






Config.
logic

420 Dhrystone MIPS at
300 MHz
1 to 4 PowerPCs
4 to 16 gigabit
transceivers
12 to 216 multipliers
Millions of logic gates
200k to 4M bits RAM
204 to 852 I/O
$100-$500 (>25,000
units)
• 622 Mbps to 3.125 Gbps
PowerPCs

Up to 16 serial transceivers
Courtesy of Xilinx
Frank Vahid, UC Riverside
11
Single-Chip Microprocessor/FPGA Platforms


Why wouldn’t future microprocessor chips include
some amount of on-chip FPGA?
One argument against – area



Lots of silicon area taken up by FPGA
 FPGA about 20-30 times less area
efficient than custom logic
FPGA used to be for prototyping, too
big for final products
But chip trends imply that FPGAs will
be O.K. in final products…
Frank Vahid, UC Riverside
12
How Much is Enough?
Perhaps a bit small
Frank Vahid, UC Riverside
13
How Much is Enough?
Reasonably sized
Frank Vahid, UC Riverside
14
How Much is Enough?
Probably plenty big for most of us
Frank Vahid, UC Riverside
15
How Much is Enough?
More than typically necessary
Frank Vahid, UC Riverside
16
How Much Custom Logic is Enough?
IC package
IC
1993: ~ 1 million logic transistors
Perhaps a bit small
8-bit processor: 50,000 tr.
Pentium: 3 million tr.
MPEG decoder: several million tr.
Frank Vahid, UC Riverside
17
How Much Custom Logic is Enough?
1996: ~ 5-8 million logic transistors
Reasonably sized
Frank Vahid, UC Riverside
18
How Much Custom Logic is Enough?
1999: ~ 10-50 million logic transistors
Probably plenty big for most of us
Frank Vahid, UC Riverside
19
How Much Custom Logic is Enough?
2002: ~ 100-200 million logic transistors
More than typically necessary
Frank Vahid, UC Riverside
20
How Much Custom Logic is Enough?
1993: 1 M
2008: >1 BILLION logic transistors
Perhaps very few people
could design this
Frank Vahid, UC Riverside
21
Very Few Companies Can Design High-End ICs
Design productivity gap
10,000
100,000
1,000
10,000
Logic transistors per 100
10
chip
(in millions)
1
1000
Gap
IC capacity
10
0.1
0.01
Productivity
(K) Trans./Staff-Mo.
1
productivity
0.001

100
0.1
0.01
Source:
ITRS’99
Designer productivity growing at slower rate


1981: 100 designer months  ~$1M
2002: 30,000 designer months  ~$300M
Frank Vahid, UC Riverside
22
Single-Chip Platforms with On-Chip FPGAs

So, big FPGAs on-chip are O.K., because mainstream
designers couldn’t have used all that silicon area anyways
Becoming out of reach of
mainstream designers

Frank Vahid, UC Riverside
But, couldn’t
designers use
custom logic
instead of FPGAs
to make smaller
chips and save
costs?
23
Shrinking Chips

Yes, but there’s a limit

Chips becoming pin limited
A football huddle can only
get so small
Shrink
Pads
connecting
to external
pins
This area will exist
whether we use it
all or not
Frank Vahid, UC Riverside
24
Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific
standard product





Domain-specific prefabricated IC
e.g., digital camera IC
ASIC: application specific IC
ASSP revenue > ASIC
ASSP design starts > ASIC

Unique IC design



Ignores quantity of same IC
ASIC design starts decreasing
Due to strong benefits of
using pre-fabricated devices
Source: Gartner/Dataquest September’01

Frank Vahid, UC Riverside
25
Microprocessor/FPGA Platforms


Trends point towards such
platforms increasing in
popularity
Can we automatically
partition the software to
utilize the FPGA?

For improved speed and
energy
Frank Vahid, UC Riverside
26
Automatic Hardware/Software Partitioning

Since late 1980s – goal has been spec in, hw/sw out

But no successful commercial tool yet. Why?
// From MediaBench’s JPEG codec
GLOBAL(void)
jpeg_fdct_ifast (DCTELEM * data)
{
DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
“Spec”
DCTELEM tmp10, tmp11, tmp12, tmp13;
Ideal
Partitioner
DCTELEM z1, z2, z3, z4, z5, z11, z13;
DCTELEM *dataptr;
int ctr;
SHIFT_TEMPS
/* Pass 1: process rows. */
dataptr = data;
for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {
tmp0 = dataptr[0] + dataptr[7];
Software
Software
tmp7 = dataptr[0] - dataptr[7];
Compilation
tmp1 = dataptr[1] + dataptr[6];
Hardware
Synthesis
…
Processor
// Thousands of lines like this in dozens of files
Frank Vahid, UC Riverside
ASIC/FPGA
27
Why No Successful Tool Yet?

Most research has focused
on extensive exploration




Roots in VLSI CAD
Decompose problem into
fine-grained operations
Apply sophisticated
partitioning algorithms
Examples


Min-cut, dynamic
programming, simulated
annealing, tabu-search,
genetic evolution, etc.
“Spec”
1000s of nodes
(like circuit
partitioning)
Partitioner
Is this overkill?
Frank Vahid, UC Riverside
28
We Really Only Need Consider a Few Loops – Due
to the 90-10 Rule
Recent appearance of embedded benchmark suites




Enables analysis  understanding of the real problem
We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone
Currently examining EEMBC (embedded equivalent of SPEC)
UCR loop analysis tools based on SimpleScalar and Simics
// From MediaBench’s JPEG codec
GLOBAL(void)
jpeg_fdct_ifast (DCTELEM * data)
{
DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
DCTELEM tmp10, tmp11, tmp12, tmp13;
DCTELEM z1, z2, z3, z4, z5, z11, z13;
DCTELEM *dataptr;
int ctr;
SHIFT_TEMPS
/* Pass 1: process rows. */
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10

dataptr = data;
Assigned each
loop a number,
sorted by
fraction of
contribution to
total execution
time
for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {
tmp0 = dataptr[0] + dataptr[7];
tmp7 = dataptr[0] - dataptr[7];
tmp1 = dataptr[1] + dataptr[6];
…
Frank Vahid, UC Riverside
29
The 90-10 Rule Holds for Embedded Systems
6
7
8
7
8
9
5
6
10
4
5
9
3
4
10
2
3
8
8
1
7
7
2
6
6
1
5
5
9
4
4
10
3
10
2
3
9
1
8
2
7
8
1
6
7
9
5
6
10
4
5
10
3
4
9
2
3
9
10
10
1
8
9
2
7
8
1
6
4
4
7
3
3
5
2
2
6
1
1
5
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
% executio n time
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
% size o f pro gram
1
2
3
4
5
6
7
Frank Vahid, UC Riverside
8
9
10
In fact, the
most frequent
loop alone
took 50% of
time, using
1% of code
30
So Need We Only Consider the First Few Loops?
Not Necessarily
What if programs were self-similar w.r.t. 90-10 rule?



Remove most frequent loop – 90-10 rule still hold?
Intuition might say yes – remove loop, and we have another program.
1
1
0.9
0.9

0.8
% Remaining
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
Execution Time
% Remaining
Execution Time
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
1
2
3
4
5
Loop
6
7
8
9
10
Loop
So we need
only speedup
the first few
loops

1000
Speedup
Speedup
1000
500
0
500

0
0
1
2
3
4
5
Loop
6
7
8
9
10
0
1
2
3
4
5
6
Loop
Frank Vahid, UC Riverside
7
8
9
10
After that,
speedups are
limited
Good from
tool
perspective!
31
Manually Partitioned Several PowerStone
Benchmarks onto Triscend A7 and E5 Chips

Used multimeter and timer to measure
performance and power

E5 IC
Obtained good speedups and energy
savings by partitioning software among
microprocessor and on-chip FPGA
Benchmark
PS_g3fax
PS_crc
PS_brev
Time orig
Time sw/hw
11.47
7.44
10.92
4.51
9.84
3.28
Average:
Benchmark
PS_g3fax
PS_crc
PS_brev
Time orig
Time sw/hw
15.16
7.11
10.64
4.64
17.81
1.81
Average:
A7 results
Sp.
Porig
Psw/hw Eorig
Esw/hw
E sav
1.5 1.320 1.332
15.140
9.910
35%
2.4 1.320 1.320
14.414
5.953
59%
3.0 1.332 1.344
13.107
4.408
66%
2.3
Average:
53%
E5 results
Sp.
Porig
Psw/hw Eorig
Esw/hw
E sav
2.1 0.252 0.270
3.820
1.920
50%
2.3 0.207 0.225
2.202
1.044
53%
9.8 0.252 0.270
4.488
0.489
89%
4.8
Average:
Frank Vahid, UC Riverside
Triscend A7
development board
64%
32
Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results matched reasonably well)
Example
PS_g3fax
PS_crc
PS_summin
PS_brev
PS_matmul
PS_g3fax
PS_adpcm
PS_crc
PS_des
PS_engine
PS_jpeg
PS_summin
PS_v42
PS_brev
MB_g721
MB_adpcm
MB_pegwit
NB_dh
NB_md5
NB_tl
Archit
Cycles orig
Cycles sw
Cycles hw
8051
19,675,456
10,812,544
176,562
8051
291,196
180,224
7,168
8051
109,821,892
20,394,080
384,416
8051
330,064
305,768
1,360
8051
119,420
101,576
2,560
MIPS
15,600,000
4,720,000
599,000
MIPS
113,000
29,300
5,440
MIPS
5,040,000
3,480,000
460,800
MIPS
142,000
70,700
15,100
MIPS
915,000
145,000
28,100
MIPS
7,900,000
646,000
171,000
MIPS
2,920,000
1,270,000
266,000
MIPS
3,850,000
846,000
216,000
MIPS
3,566
2,499
138
MIPS
838,230,002
457,674,179 9,985,261
MIPS
32,894,094
32,866,110 1,183,260
MIPS
42,752,919
33,276,287 2,167,651
MIPS 1,793,032,157 1,349,063,192 45,156,767
MIPS
5,374,034
3,046,881
289,877
MIPS
57,412,470
29,244,221 2,479,552
Loop
Sp.
61
25
53
225
40
8
5
8
5
5
4
5
4
18
46
28
15
30
11
12
Average:
30
Total
Clkhw Sp. Psw
Phw
25
2.2 0.05 0.032
25
2.5 0.05 0.028
25
1.2 0.05 0.033
25 12.9 0.05 0.034
25
5.9 0.05 0.035
100
1.4 0.07 0.111
100
1.3 0.07 0.181
100
2.5 0.07 0.061
100
1.6 0.07 0.197
100
1.1 0.07 0.082
100
1.1 0.07 0.092
100
1.5 0.07 0.111
100
1.2 0.07 0.102
100
3.0 0.07 0.107
100
2.1 0.07 0.152
42 11.6 0.07 0.130
50
3.1 0.07 0.170
69
3.5 0.07 0.121
47
1.8 0.07 0.251
58
1.8 0.07 0.059
3.2
Frank Vahid, UC Riverside
Eorig
Esw/hw
ESav
0.1142 0.05408
53%
0.0017 0.00071
58%
0.6376 0.53657
16%
0.0019 0.00015
92%
0.0007 0.00012
82%
0.0265 0.02163
18%
0.0002 0.00018
6%
0.0086 0.00379
56%
0.0002 0.00019
20%
0.0016 0.00146
6%
0.0134 0.01360
-1%
0.0050 0.00375
24%
0.0065 0.00605
7%
0.0000 0.00000
62%
1.4250 0.75035
47%
0.0559 0.00821
85%
0.0727 0.03241
55%
3.0482 1.00547
67%
0.0091 0.00722
21%
0.0976 0.05930
39%
Speedup
of 3.2
and
energy
savings
of 34%
obtained
with only
10,500
gates
(avg)
Average: 34%
33
Looking at Multiple Loops per Benchmark
Manually created several partitioned versions of each benchmarks
Most speedup gained with first 20,000 gates
Surprisingly few gates!


27.2
5 .0
4 .0
S pe e dup

G7 2 1( M B )
A DP CM ( M B )
P EGWIT( M B )
3 .0
DH( NB )
M D5( NB )
TL ( N B )
URL( NB )
2 .0
2 .0 5 a t 9 0 ,0 0 0
1 .0
0
5 ,0 0 0
1 0 ,0 0 0
1 5 ,0 0 0
2 0 ,0 0 0
2 5 ,0 0 0
Ga te s



Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002
J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded
Systems, 2002 (to appear).
34
Frank Vahid, UC Riverside
Ideal Speedups for Different Architectures

Varied loop speedup ratio (sw time / hw time of loop itself) to see impact
of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2

Loop speedups of 5 or more work fine for first few loops, not hard to achieve
10
9
8
7
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
Frank Vahid, UC Riverside
35
Ideal Energy Savings for Different Architectures

Varied loop power ratio (FPGA power / microprocessor power) to account
for different architectures – 2.5, 2.0, 1.5 (base case), 1.0

Energy savings quite resilient to variations
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Frank Vahid, UC Riverside
36
How is Automated Partitioning Done?
Previous data
obtained
manually
Informal spec
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Partitioning
Processor + FPGA
ASIC
Frank Vahid, UC Riverside
37
Source-Level Partitioning
SW Source
_______
_______
_______
Front-end converts code into intermediate
format, such as SUIF (Stanford University
Intermediate Format)
Compiler
Front-End
Intermediate format explored for hardware
candidates
Hw/Sw
Partitioning
Assembly &
object files
Compiler
Back-End
Hw source
Assembler
& Linker
Synthesis
Binary
Netlists
Processor
FPGA
Binary is generated from assembling
and linking. Hw source is generated
and synthesized into netlist
Frank Vahid, UC Riverside
38
Problems with Source-Level Partitioning

Though technically superior, source-level partitioning

Disrupts standard commercial tool flow significantly



Requires special compiler (ouch!)
Multiple source languages, changing source languages
How deal with library code, assembly code, object code
Compiler
Front-end
C Source
C++ Source
Java Source
C SUIF
Compiler
C++ SUIF
Compiler
?
Frank Vahid, UC Riverside
39
Binary Partitioning
SW Source
_______
_______
_______
Assembly &
object files
Compilation
Source code is first compiled and linked in
order to create a binary.
Assembler
& Linker
Binary
Candidate hardware regions (a few small,
frequent loops) are decompiled for
partitioning
Hw/Sw
Partitioning
Hw source
Updated
Binary
Synthesis
HDL is generated and synthesized, and
binary is updated to use hardware
Netlists
Processor
FPGA
Frank Vahid, UC Riverside
40
Binary-Level Partitioning Results (ICCAD’02)
• Source-Level
• Average speedup, 1.5
• Average energy
savings, 27%
• Average 4,361 gates
• Binary-Level
• Average speedup, 1.4
• Average energy savings,
13%
• Large area overhead
averaging 10,325 gates
Frank Vahid, UC Riverside
41
Binary Partitioning Could Eventually Lead to
Dynamic Hw/Sw Partitioning

Dynamic software
optimization gaining interest



e.g., HP’s Dynamo
What better optimization than
moving to FPGA?
Add component on-chip:






Proces
sor
Mem
D$
I$
Mem
Profile
r
DMA
Detects most frequent sw loops
Decompiles a loop
Proc.
Config. Logic
Performs compiler optimizations
Synthesizes to a netlist
Places and routes the netlist
 Self-improving IC
onto (simple) FPGA
 Can be invisible to designer
Updates sw to call FPGA
 Appears as efficient processor
 HARD! Much future work.
Frank Vahid, UC Riverside
42
Conclusions

Hardware/software partitioning can significantly improve
software speed and energy


Successful commercial tool still on the horizon





Single-chip microprocessor/FPGA platforms, increasing in popularity,
make such partitioning even more attractive
Binary-level partitioning may help in some cases
Source-level can yield massive parallelism (Profs. Najjar/Payne)
Future dynamic hw/sw partitioning possible?
Distinction between sw/hw continually being blurred!
Many people involved:



Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid
Najjar, Jason Villarreal, Tom Payne, several others…
Support from NSF, Triscend, and soon SRC…
Exciting new directions!
Frank Vahid, UC Riverside
43
Descargar

Document