Architectural Exploration:
Area-Power tradeoff in 802.11a
transmitter design
Arvind
Computer Science & Artificial Intelligence Lab
Massachusetts Institute of Technology
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-1
This lecture has two
purposes
Illustrate how area-power tradeoff can
be studied at a high-level for a realistic
design

Example: 802.11a transmitter
Illustrate some features of BSV




Static elaboration
Combinational circuits
Simple synchronous pipelines
Valid bits as the Maybe type in BSV
No prior understanding of 802.11a is
necessary to follow this lecture
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-2
Bluespec: Two-Level Compilation
Bluespec
(Objects, Types,
Higher-order functions)
Lennart Augustsson
@Sandburst 2000-2002
• Type checking
• Massive partial evaluation
and static elaboration
Level 1 compilation
Now we call this
Guarded Atomic
Actions
Rules and Actions
(Term Rewriting System)
• Rule conflict analysis
• Rule scheduling
Level 2 synthesis
Object code
(Verilog/C)
February 16, 2007
James Hoe & Arvind
@MIT 1997-2000
http://csg.csail.mit.edu/6.375/
L05-3
Static Elaboration
At compile time



Inline function calls and datatypes
Instantiate modules with specific parameters
Resolve polymorphism/overloading
Software
Toolflow: source
elaborate
w/params
Hardware
source Toolflow:
compile
run w/
params
run1
February 16, 2007
.exe
run2
run w/
params
run1
run3
run1
…
design1 design2 design3
run1.1
run1
run1
…
http://csg.csail.mit.edu/6.375/
run2.1
run1
run1
…
run3.1
run1
run1
…
L05-4
802.11a Transmitter Overview
headers
24
Uncoded
bits
Controller
data
Scrambler
Interleaver
Mapper
Cyclic
Extend
IFFT
IFFT Transforms 64 (frequency domain)
complex numbers into 64 (time domain)
complex numbers
February 16, 2007
Encoder
Must produce
one OFDM
symbol every
4 msec
Depending
upon the
transmission
rate,
consumes 1,
2 or 4 tokens
to produce
one OFDM
symbol
One OFDM symbol
(64 Complex Numbers)
http://csg.csail.mit.edu/6.375/
accounts for 85% area
L05-5
Preliminary results
[MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind
Design
Block
Controller
Scrambler
Conv. Encoder
Interleaver
Mapper
IFFT
Cyc. Extender
Lines of
Code (BSV)
49
40
113
76
112
95
23
Relative
Area
0%
0%
0%
1%
11%
85%
3%
Complex arithmetic libraries constitute another 200
lines of code
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-6
Combinational IFFT
in0
out0
in1
Bfly4
Bfly4
in3
x16
in4
Bfly4
…
Bfly4
Bfly4
…
t0
t1
t2
t
February 16,32007
+
-
-
*
+
+
*
-
*j
-
out3
out4
…
out63
+
*
…
out2
Bfly4
in63
*
Bfly4
out1
Permute_3
Bfly4
Permute_2
Permute_1
in2
Bfly4
All numbers are complex
and represented as two
sixteen bit quantities.
Fixed-point arithmetic is
used to reduce area,
power, ...
http://csg.csail.mit.edu/6.375/
L05-7
4-way Butterfly Node
t0
t1
t2
t3
k0
k1
k2
k3
*
+
+
*
-
-
*
+
+
*
-
*i
-
function Vector#(4,Complex) Bfly4
(Vector#(4,Complex) t,
Vector#(4,Complex) k);
BSV has a very strong notion of types


February 16, 2007
Every expression has a type. Either it is declared by
the user or automatically deduced by the compiler
The compiler verifies that the type declarations are
compatible
http://csg.csail.mit.edu/6.375/
L05-8
BSV code: 4-way Butterfly
function Vector#(4,Complex) Bfly4
(Vector#(4,Complex) t,
Vector#(4,Complex) k);
Vector#(4,Complex) m = newVector(),
y = newVector(),
z = newVector();
m[0] = k[0] * t[0]; m[1] = k[1] * t[1];
m[2] = k[2] * t[2]; m[3] = k[3] * t[3];
y[0] = m[0] + m[2]; y[1] = m[0] – m[2];
y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);
z[0] = y[0] + y[2]; z[1] = y[1] + y[3];
z[2] = y[0] – y[2]; z[3] = y[1] – y[3];
return(z);
endfunction
Note: Vector does not mean storage
February 16, 2007
http://csg.csail.mit.edu/6.375/
*
+
+
*
-
-
*
+
+
*
m
*i
-
y
z
Polymorphic code:
works on any type
of numbers for
which *, + and have been defined
L05-9
Combinational IFFT
in0
in1
…
x16
Bfly4
Bfly4
…
Bfly4
Bfly4
…
Bfly4
in63
out1
Permute_3
in4
Bfly4
Bfly4
Permute_2
in3
Bfly4
Bfly4
Permute_1
in2
out0
out2
out3
out4
…
out63
stage_f function
repeat it three times
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-10
BSV Code: Combinational
IFFT
function SVector#(64, Complex) ifft
(SVector#(64, Complex) in_data);
//Declare vectors
SVector#(4,SVector#(64, Complex)) stage_data =
replicate(newSVector);
stage_data[0] = in_data;
for (Integer stage = 0; stage < 3; stage = stage + 1)
stage_data[stage+1] = stage_f(stage,stage_data[stage]);
return(stage_data[3]);
The for loop is unfolded and stage_f
is inlined during static elaboration
Note: no notion of loops or procedures during execution
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-11
BSV Code: Combinational
IFFT- Unfolded
function SVector#(64, Complex) ifft
(SVector#(64, Complex) in_data);
//Declare vectors
SVector#(4,SVector#(64, Complex)) stage_data =
replicate(newSVector);
stage_data[0] = in_data;
for
(Integer stage
= 0; stage < 3; stage = stage + 1)
stage_data[1]
= stage_f(0,stage_data[0]);
stage_data[stage+1]
= stage_f(stage,stage_data[stage]);
stage_data[2]
= stage_f(1,stage_data[1]);
stage_data[3] = stage_f(2,stage_data[2]);
return(stage_data[3]);
Stage_f can be inlined now; it could have been inlined
before loop unfolding also.
Does the order matter?
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-12
Bluespec Code for stage_f
function SVector#(64, Complex) stage_f
(Bit#(2) stage, SVector#(64, Complex) stage_in);
begin
for (Integer i = 0; i < 16; i = i + 1)
begin
Integer idx = i * 4;
let twid = getTwiddle(stage, fromInteger(i));
let y = bfly4(twid, stage_in[idx:idx+3]);
stage_temp[idx]
= y[0]; stage_temp[idx+1] = y[1];
stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3];
end
//Permutation
for (Integer i = 0; i < 64; i = i + 1)
stage_out[i] = stage_temp[permute[i]];
end
return(stage_out);
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-13
Architectural Exploration
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-14
Design Alternatives
Reuse a block over multiple cycles
f
f
g
f
g
we expect:
Throughput to decrease – less parallelism
Area to decrease – reusing a block
Energy/unit work ?
more on power
issues later
February 16, 2007
The clock needs to run faster for the
same throughput  hyper-linear
increase in energy
http://csg.csail.mit.edu/6.375/
L05-15
Combinational IFFT
Opportunity for reuse
in0
in1
…
x16
Bfly4
Bfly4
…
Bfly4
Bfly4
…
Bfly4
in63
out1
Permute_3
in4
Bfly4
Bfly4
Permute_2
in3
Bfly4
Bfly4
Permute_1
in2
out0
out2
out3
out4
…
out63
Reuse the same circuit three times
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-16
Circular pipeline: Reusing the
Pipeline Stage
in0
Bfly4
…
in3
Bfly4
in4
in63
February 16, 2007
out3
out4
out63
Stage
Counter
Permute_3
16 Bfly4s can be shared
but not the three
permutations. Hence the
need for muxes
out2
…
Permute_2
…
out1
64, 4-way
Muxes
in2
Permute_1
in1
out0
http://csg.csail.mit.edu/6.375/
L05-17
Superfolded circular pipeline:
Just one Bfly-4 node!
in0
in4
Permute_3
February 16, 2007
http://csg.csail.mit.edu/6.375/
out2
out3
out4
…
Permute_2
in63
Index
Counter
0 to 15
4, 16-way
DeMuxes
…
out1
64, 4-way
Muxes
in3
Bfly4
Permute_1
in2
4, 16-way
Muxes
in1
out0
out63
Stage
Counter
0 to 2
L05-18
Algorithmic Improvements
in0
in1
…
x16
Bfly4
Bfly4
…
Bfly4
Bfly4
…
Bfly4
in63
out1
Permute_3
in4
Bfly4
Bfly4
Permute_2
in3
Bfly4
Bfly-4
Permute_1
in2
out0
out2
out3
out4
…
out63
1. All the three permutations can be made identical
 more saving in area in the folded case
2. One multiplication can be removed from Bfly-4
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-19
Area improvements because
of change in Algorithm
Design
Combinational
Simple Pipe
Folded Pipe
February 16, 2007
Old Area
(mm2)
4.69
5.14
5.89
http://csg.csail.mit.edu/6.375/
New Area
(mm2)
4.91
5.25
3.97
L05-20
Which design consumes the least
energy to transmit a symbol?
Can we quickly code up all the
alternatives?

single source with parameters?
Not practical in traditional hardware
description languages like Verilog/VHDL
5-minute break to stretch you legs
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-21
Pipelining a block
f1
C
f2
f3
inQ
P
Combinational
outQ
f1
f2
Pipeline
f3
inQ
outQ
f
FP
inQ
Clock? C < P  FP
Clock:
February 16, 2007
outQ
Area? FP < C < P
Area:
http://csg.csail.mit.edu/6.375/
Folded
Pipeline
Throughput? FP < C < P
Throughput:
L05-22
Synchronous pipeline
f1
f2
f3
x
inQ
sReg1
sReg2
rule sync-pipeline (True);
inQ.deq();
sReg1 <= f1(inQ.first());
sReg2 <= f2(sReg1);
outQ.enq(f3(sReg2));
endrule
This is real IFFT code; just
replace f1, f2 and f3 with stage_f
code
February 16, 2007
outQ
This rule can fire only if
- inQ has an element
- outQ has space
Atomicity: Either all or
none of the state
elements inQ, outQ,
sReg1 and sReg2 will be
updated
http://csg.csail.mit.edu/6.375/
L05-23
Stage functions f1, f2 and f3
function f1(x);
return (stage_f(1,x));
endfunction
function f2(x);
return (stage_f(2,x));
endfunction
The stage_f
fucntion is
given on
slide 12
function f3(x);
return (stage_f(3,x));
endfunction
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-24
Problem: What about
pipeline bubbles?
f1
f2
f3
x
inQ
sReg1
sReg2
rule sync-pipeline (True);
inQ.deq();
sReg1 <= f1(inQ.first());
sReg2 <= f2(sReg1);
outQ.enq(f3(sReg2));
endrule
outQ
Red and Green tokens
must move even if there
is nothing in the inQ!
Also if there is no token in
sReg2 then nothing
should be enqueued in
the outQ
Modify the rule to deal with these conditions
February 16, 2007
http://csg.csail.mit.edu/6.375/
Valid bits or
the Maybe type
L05-25
The Maybe type data in the
pipeline
typedef union tagged {
void Invalid;
data_T Valid;
} Maybe#(type data_T);
data
valid/invalid
Registers contain Maybe
type values
rule sync-pipeline (True);
if (inQ.notEmpty())
begin sReg1 <= Valid f1(inQ.first()); inq.deq(); end
else sReg1 <= Invalid;
case (sReg1) matches
tagged Valid .sx1: sReg2 <= Valid f2(sx1);
tagged Invalid:
sReg2 <= Invalid;
case (sReg2) matches
tagged Valid .sx2: outQ.enq(f3(sx2));
endrule
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-26
Folded pipeline
The same code will work
for superfolded pipelines
by changing n and stage
function f
f
x
inQ
stage
sReg
outQ
rule folded-pipeline (True);
if (stage==1)
begin sxIn= inQ.first(); inQ.deq(); end
else
sxIn= sReg;
notice stage
sxOut = f(stage,sxIn);
is a dynamic
if (stage==n) outQ.enq(sxOut);
parameter
else sReg <= sxOut;
now!
stage <= (stage==n)? 1 : stage+1;
endrule
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-27
802.11a Transmitter Synthesis
results (Only the IFFT block is changing)
The same
source
code
IFFT Design
Area
(mm2)
Throughput
Latency
(CLKs/sym)
Min. Freq
Required
Pipelined
5.25
04
1.0 MHz
Combinational
4.91
04
1.0 MHz
Folded
(16 Bfly-4s)
3.97
04
1.0 MHz
Super-Folded
(8 Bfly-4s)
3.69
06
1.5 MHz
SF(4 Bfly-4s)
2.45
12
3.0 MHz
SF(2 Bfly-4s)
1.84
24
6.0 MHz
SF (1 Bfly4)
1.52
48
12 MHZ
All these
designs
were done
in less than
24 hours!
TSMC .18 micron; numbers reported are before place and route.
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-28
Why are the areas so similar
Folding should have given a 3x
improvement in IFFT area
BUT a constant twiddle allows lowlevel optimization on a Bfly-4 block

February 16, 2007
a 2.5x area reduction!
http://csg.csail.mit.edu/6.375/
L05-29
Summary
It is essential to do architectural
exploration for better (area, power,
performance, ...) designs.
It is possible to do so with new design
tools and methodologies, i.e., Bluespec
Better and faster tools for estimating
area, timing and power would
dramatically increase our capability to
do architectural exploration.
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-30
Bluespec Learnings
How to write highly parameterized
combinational codes
How to write rules for simple
synchronous pipelines
Effect of dynamic vs static values on
generated circuits
Using Maybe types to express
valid/invalid data
Thanks
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-31
Backup slides
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-32
Function f for the folded pipeline is
the same stage_f function but ...
function SVector#(64, Complex) stage_f
(Bit#(2) stage, SVector#(64, Complex) stage_in);
begin
for (Integer i = 0; i < 16; i = i + 1)
begin
Integer idx = i * 4;
let twid = getTwiddle(stage, fromInteger(i));
let y = bfly4(twid, stage_in[idx:idx+3]);
stage_temp[idx]
= y[0]; stage_temp[idx+1] = y[1];
stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3];
end
will cause a
//Permutation
mux to be
for (Integer i = 0; i < 64; i = i + 1)
generated
stage_out[i] = stage_temp[permute[i]];
end
return(stage_out);
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-33
Folded pipeline:
stage function f
getTwiddle
getTwiddle2
twid
getTwiddle3
The rest of
stage_f, i.e.
Bfly-4s and
permutations
(shared)
stage
sx
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-34
Function f for the Superfolded
pipeline (One Bfly-4 case)
f will be invoked for 48 dynamic
values of stage


February 16, 2007
each invocation will modify 4
numbers in sReg
after 16 invocations a permutation
would be done on the whole sReg
http://csg.csail.mit.edu/6.375/
L05-35
Code for the Superfolded
pipeline stage function
function SVector#(64, Complex) f
(Bit#(6) stage, SVector#(64, Complex) stage_in);
begin
let idx = stage `mod` 16;
let twid = getTwiddle(stage `div` 16, idx);
let y = bfly4(twid, stage_in[idx:idx+3]);
stage_temp = stage_in;
stage_temp[idx]
= y[0];
stage_temp[idx+1] = y[1];
stage_temp[idx+2] = y[2];
stage_temp[idx+3] = y[3];
One Bfly-4 case
for (Integer i = 0; i < 64; i = i + 1)
stage_out[i] = stage_temp[permute[i]];
end
return((idx == 15) ? stage_out: stage_temp);
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-36
Experimental Results
Nirav Dave, Mike Pellauer, Steve Gerding,
Arvind
MEMOCODE 2006
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-37
Expressing these designs in
Bluespec was easy
All these designs
were done in less
than one day!
Combinational
Pipelined
Folded (16 Bfly-4s)
Designers were
experts in
Bluespec
Super-Folded (8 Bfly-4s)
Area and power
estimates?
Super-Folded (2 Bfly-4s)

Super-Folded (4 Bfly-4s)
Super-Folded (1 Bfly-4)
The same source code
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-38
Bluespec Tool flow
Bluespec SystemVerilog source
Blueview
Bluespec Compiler
Verilog 95 RTL
C
Bluesim
Cycle
Accurate
Verilog sim
VCD output
Legend
files
Bluespec tools
3rd party tools
February 16, 2007
Sequence Design
PowerTheater
Debussy
Visualization
RTL synthesis
gates
Power
estimatio
n tool
http://csg.csail.mit.edu/6.375/
FPGA
L05-39
802.11a Transmitter Synthesis
results (Only the IFFT block is changing)
IFFT Design
Area
(mm2)
Symbol
Latency
(CLKs)
Throughput
Latency
(CLKs/sym)
Min. Freq
Required
Average
Power
(mW)
Pipelined
5.25
12
04
1.0 MHz
4.92
Combinational
4.91
10
04
1.0 MHz
3.99
Folded
(16 Bfly-4s)
3.97
12
04
1.0 MHz
7.27
Super-Folded
(8 Bfly-4s)
3.69
15
06
1.5 MHz
10.9
SF(4 Bfly-4s)
2.45
21
12
3.0 MHz
14.4
SF(2 Bfly-4s)
1.84
33
24
6.0 MHz
21.1
SF (1 Bfly4)
1.52
57
48
12 MHZ
34.6
TSMC .18 micron; numbers reported are before place and route.
(DesignCompiler), Power
numbers are from Sequence PowerTheater
http://csg.csail.mit.edu/6.375/
L05-40
February 16, 2007
Power can be reduced further
Right now all blocks in the
transmitter run on the same clock
 if we run IFFT faster then all other
blocks also run faster
Bluespec has facilities for Multiple
Clock Domains and the design can
be easily modified to run the
earlier blocks at a lower clock rate
February 16, 2007
http://csg.csail.mit.edu/6.375/
L05-41
Descargar

Bluespec technical deep dive - Massachusetts Institute of