A First-step Towards an
Architecture Tuning Methodology
for Low Power
Greg Stitt, Frank Vahid*, Tony Givargis
Roman Lysecky
Dept. of Computer Science & Engineering
University of California, Riverside
Department of IP
Management
Conexant
Newport Beach
*also with the Center for Embedded Computer Systems,
UC Irvine
This work was supported by the National Science Foundation under grants CCR9811164 and CCR-9876006, and by a Design Automation Conference graduate
scholarship.
Introduction: advent of cores
 In the past, board-level
embedded systems were
built using discrete IC’s
 Today, single-IC systems
are increasingly being
built, using IP’s
(Intellectual Property)
A.k.a. “cores”
Hard core: layout
Firm core: structure (HDL)
Soft core: synthesizable
behavior (HDL)
 “System-on-a-chip” (SOC)




Board
Processor
Memory Peripheral
Core library
PeripheralA
PeripheralB
ProcessorX
Peripheral
Mem
Processor
IP cores
Introduction: embedded systems
 SOC’s implementing an embedded system have a
unique feature
 Implements a particular application
 Thus, the processor may execute a single fixed program that
never changes
 Unlike desktop systems, which execute a variety of programs
 Examples: digital camera, automobile cruise-controller
 We can exploit this fixed-program feature
 For example, by using mask-programmed ROM
 But much more can be done
Introduction: architecture tuning
 Architecture tuning
 A way to exploit the fixed-
program feature of
embedded systems
 First, do architecture
design for the particular
application
 Then, “tune” the corebased system architecture
to the particular application
program, before IC
fabrication
 Goals: better performance,
power, size
Fixed
program
Core library
PeripheralA
PeripheralB
Architecture
design
ProcessorX
Peripheral
Architecture
tuning
Peripheral
Processor
Prog.
Processor
HDL
Prog.
Fabrication
HDL
Tuned
cores
Peripheral
Processor
IC
Prog.
Introduction: architecture tuning
 Examples of tuning optimizations
 Memory hierarchy: no cache, L1 cache, L1+L2 cache
 Cache organization: size, associativity, line size
 Bus structure, data/address encoding
 Microprocessor optimizations
 Internal small-loop table
 Controller partitioning
 Datapath shortcuts
 Register file copies
Introduction: Tuning is a special case
of Y-Chart iteration
 Philips/TriMedia approach of simultaneously developing
architecture and its applications
Architecture
Applications
Mapping
Analysis
Our focus
Numbers
Problem description
 Focus of this work:
 Tuning a microcontroller to its program
 Goal is reduced power without performance loss
 Restrict tuning to maintain exact instruction set
compatibility
 No instructions may be added or deleted
 Thus, no modification to software development environment
 Also, no problems with porting software to/from other versions
of the microcontroller
 Instruction set incompatibility can be a show stopper
Previous work
 Application-specific instruction-set processors
[Fisher99]
 Customize a microprocessor to its application(s)
 e.g., Tensilica
 Customized instruction-set, requiring customized tools
 Tuning compiler to architecture [Tiwari et al 94]
 Architectural description languages to inform compiler of
architecture features [Halambi et al 99]
 Tuning cache and cache/bus [Givargis et al 99]
organization to application
Tuning environment
 Currently for the 8051 microcontroller
 Starts from VHDL synthesizable model of 8051 (soft core)
 Uses Synopsys synthesis, simulation and power analysis
 Uses 8051 instruction-set simulator
 Uses numerous scripts
 Goal of the enviroment
 Understand how power is being consumed for a particular
application, so that modifications to the architecture (or
application) can be made to minimize that power
 Three main tools
 Architectural view
 Instruction-set view
 Program/data memory view
Tuning environment: architectural
view tool
Microprocessor soft core
RT-synthesizer
Microprocessor structure
Program binary
ROM
1.04 mW
ROM generator
ALU
1.62 mW
ROM entity
Total
7.66 mW
RAM
1.42 mW
Simulator and power analyzer
“Flat” power data
CTRL
2.69 mW
DECODER
0.07 mW
Structural hierarchical power data translator and xdu
display
Tuning environment: instruction-set
view tool
Binaries to exercise
instruction
Binaries to1exer
instruction
Binaries to2 exe
instruction 3
ROM generator
Microprocessor structure
ROM entity
Simulator and power analyzer
Flat power data for instruction 1
Flat power data for instruction 2
Flat power data for instruction 3
Power data collector, structural power data translator, and
xdu display
Instruction Power (mW)
ADDC_1 7.340834
ADD_1 7.350741
ANL_1 6.631394
CLR_1
3.76228
CPL_1
5.481627
DA
5.28897
DEC_1
5.368807
DIV
7.716592
INC_1
4.662862
MOVC_1 6.078014
MOVC_2 5.021021
MOV_1 5.577664
MOV_2 6.164267
MUL
5.522886
NOP
4.900275
ORL_1
6.954121
POP
8.103867
PUSH
8.7116
Tuning environment: program/data
memory view tool
Per-instruction power
data
Program binary
Instruction-set simulator
Program/data memory access frequencies and
power
Program hierarchy power translator and xdu
display
Addr
00000
00003
00005
00007
00009
00011
00012
00014
00016
00018
00020
00022
Ins
LJMP
MOV_9
MOV_9
MOV_9
MOV_9
RET
MOV_9
MOV_9
MOV_9
MOV_9
MOV_4
LCALL
Addr
00128
00129
00130
00131
00144
00208
00224
00240
Freq
1
108
108
108
108
108
27
27
27
27
27
27
Purpose
P0
SP
DPL
DPH
P1
PSW
ACC
B
Pwr
0
5.46067
5.46067
5.46067
5.46067
0
5.46067
5.46067
5.46067
5.46067
4.83507
0
Accesses
1311
70317
31189
7977
161
413527
360949
2598
Freq*Pwr
0
589.752
589.752
589.752
589.752
0
147.438
147.438
147.438
147.438
130.547
0
Tuning environment
Program binary
Microprocessor core
Program/data
memory view
tool
(seconds)
Architectural
view tool
(1 hour)
Instruction-set
power view tool
(1 day)
Program
power data
Architecture
power data
Instruction-set
power data
Design flow using the tuning
environment
Change
application
Run program /
data memory
view tool
Change
architecture
Run
architecture
view tool
Run
instruction-set
view tool
No
Satisfied?
Yes
DONE
Sample tuning optimization
 Observation
ROM
1.04 mW
 RAM consumes much power
 Address 224 accessed frequently
 Possible tuning optimization
 Replace this RAM location by a
ALU
1.62 mW
Total
7.66 mW
RAM
1.42 mW
register inside the CTRL module
CTRL
2.69 mW
 Steps
DECODER
0.07 mW
 Modify VHDL model
 Run all three view tools
 Results
 Power reduction: 7.67 to 7.27 mW
Addr
00128
00129
00130
00131
00144
00208
00224
00240
Purpose
P0
SP
DPL
DPH
P1
PSW
ACC
B
Accesses
1311
70317
31189
7977
161
413527
360949
2598
Some recent data
 Applied the tuning environment for a particular
application
 Converted two frequently-accessed RAM locations to registers
 15% total power savings
 Introduced datapath shortcuts for the two most common
register-to-register moves of the application, thus bypassing
the ALU
 10% total power savings
 Partitioned the controller into two, one small one
implementing the frequently-executed instructions
 10-15% power savings, but we expect much more if we do a
better job partitioning the design
Conclusions
 Described an environment for tuning a microprocessor
to its application for low power
 Full instruction set compatibility
 Multiple views helps find power hogs
 Fully automated
 Focus is now on developing tuning optimizations
 Controller partitioning, small-loop table, datapath shortcuts,
register-file copies, etc.
 Investigate possibility of automating tuning optimizations,
develop more general tuning methodology
 Environment for the 8051 is available on the web:
 http://www.cs.ucr.edu/~dalton
Descargar

Instruction-based System-level Power Evaluation of …