Embedded Computer Architecture
ASIP
Application Specific Instruction-set
Processor
5KK73
Bart Mesman and Henk Corporaal
Application domain specific processors (ADSP or ASIP)
DSP
Programmable
CPU
Programmable
DSP
Application domain
specific
Application
specific processor
flexibility
efficiency
10/4/2015
Embedded Computer Archtiecture
H.Corporaal and B. Mesman
2
Application domain specific processors (ADSP or ASIP)
takes a well defined application domain as a starting point
• exploits characteristics of the domain (computation kernels)
• still programmable within the domain
e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ...
implementation
Appl. domain
GP
Appl. domain
performance: clock speed + ILP
flexible dev. (new apps.)
problems
manual design,
large effort
10/4/2015
ADSP
implementation
ILP,DLP, tuning to domain
cost effective (high volume)
- specification
- design time and effort
=> synthesized cores
Embedded Computer Architecture
H.Corporaal and B. Mesman
3
www.adelantetech.com
P art
C lo ck
(M H z)
D escrip tio n
S ize
ROM
RAM
(g a tes
(K b y te) (K b y te)
)
S p eech C o m p o n en ts
ADPCM
F u ll d u p lex IT U -T G .7 2 6 co m p lian t an d 4 0 k b it/s sp eech -co m p ressio n en co d er/d eco d er.
4
5 ,1 0 0 1 .3
0 .1 2 8
A D P C M -1 6
F u ll d u p lex 1 6 C h an n el IT U -T G .7 2 6 co m p lian t 1 6 , 2 4 , 3 2 an d 4 0 k b it/s sp eech -co m p ressio n en co d er/d eco d er.
32
1 0 ,2 0 0 1 .3
2 .0 4 8
IW -A S R
S p eech
R eco gn itio n
T em p late-b ased sp eak er-d ep en d en t, iso lated -w o rd au to m atic sp eech reco gn itio n
1 .3
9 ,0 0 0 6
ap p ro x .
1 k b yte/
w o rd
G .7 2 3 .1
L o w b it-rate IT U -T G .7 2 3 .1 co m p lian t sp eech -co m p ressio n at 6 .3 k b it/s; can b e co m b in ed w ith G .7 2 3 .1 A .
20
2 4 ,0 0 0 2 2
2 .3
G .7 2 3 .1 A
E x ten d ed v ersio n o f G .7 2 3 .1 to red u ce b it rate b y a silen ce co m p ressio n sch em e. U ses v o ice activ ity d etectio n an d
co m fo rt-n o ise gen eratio n . F u lly co m p lian t w ith A n n ex A o f sp eech -co m p ressio n stan d ard C O D E C G .7 2 3 .1 .
20
Y ield s n o ad d itio n al h ard w are co st.
2 4 ,0 0 0 2 2
2 .3
S p eech
S yn th esis
P h rase-co n caten ated sp eech s yn th esis
D ep en d s o n co m p ressio n
req u irem en ts
T eleco m m u n ica tio n s
E ch o
C an cellatio n
H igh -p erfo rm an ce E ch o -can cellatio n an d su p p ressio n p ro cesso r.
4
6 ,0 0 0 2 .8 0
0 .1 5
DTM F
F u ll-d u p lex D T M F tran sceiv er.
2
4 ,0 0 0 1 .0 0
0 .1 5
C aller-ID
O n -h o o k an d o ff-h o o k caller lin e id en tificatio n . In clu d es D T M F an d V .2 3 .
6 ,0 0 0 2 .1 0
0 .1 5
R eed -S o lo m o n F u ll-d u p lex R eed -S o lo m o n co d ec
7 ,0 0 0 3 .7 5
0 .1 5
V iterb i
D eco d er
C o n figu rab le rate, co d e an d co n strain t-len gth . (d ep en d in g o n th ro u gh p u t) C o n figu rab le traceb ack d ep th . S u p p o rts
so ft & h ard d ecisio n m ak in g. S u p p o rts co d e p u n ctu rin g.
5 ,0 0 0
--to
9 ,0 0 0
---
V .2 3 m o d em
IT U -T V 2 3 co m p lian t 1 2 0 0 b au d F S K m o d em
6 ,0 0 0 0 .8 0
0 .1 5
L o w -rip p le p in k n o ise filter w ith filter ch aracteristic o f -3 ± 0 .0 8 d B p er o ctav e o v er th e b an d w id th 2 0 H z to 2 0 k H z
4 ,0 0 0 0 .1 0
0 .1 0
1 ,5 0 0 n o n e
none
3
O th er
P in k N o ise
G en erato r
C C IR 6 5 6 /6 0 1 D igital v id eo co n v erter : C C IR to raw -v id eo d ata an d v ice v ersa.
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
4
Design process
application(s)
instance
processor
model
e.g. VLIW with
shared RFs
parameters
SW (code
generation)
Estimations
cycles/alg
occupation
HW
design
Estimations
nsec/cycle,
area, power/instr
OK?
yes
yes
10/4/2015
more appl.?
no
no
Embedded Computer Architecture
H.Corporaal and B. Mesman
3 phases
1. exploration
2. hw design (layout)
+ processing
3. design appl. sw
Fast, accurate and
early feedback
go to phase 2
5
ASIP/VLIW architectures: list scheduling
Candidate
LIST
IPB
*
+
1
*
2
4
+
0
*
3
*
*
OPB
0
+
1
1
1
*
5
2
*
3
*
*
*
1
*
4
Scheduled
Operation
*
3
*
3
+
1
2
*
4
+
3
6
*
2
3
2
Conflict &
Priority Comp.
4
*
6
+
3
6
MULT
+
*
7
3
3
*
+
5
8
*
7
*
8
*
5
*
8
+
8
7
ALU
*
IPB
+
9
10
OPB
10/4/2015
4
4
*
*
5
5
*
*
9
+
9
*
5
*
9
5
*
10
Embedded Computer Architecture
H.Corporaal and B. Mesman
+
9
10
6
Application examples (1)
x4
# d efine N T A P S 4
int fir(int in)

int i;
static int state[N T A P S ];
static int co eff[N T A P S ];
int o u t[N T A P S ];
Z
c4
-1
x3
Z
c3
*
-1
x2
Z
c2
*
-1
x1
Z
c1
*
x0
-1
c0
*
*
+
y
state[N T A P S ] = in;
o u t[0 ] = state[0 ] * co eff[0 ];
fo r ( i = 1 ; i < N T A P S + 1 ; i+ + ) 
o u t[i] = o u t[i-1 ] + state[i] * co eff[i];
state[i-1 ] = state[i];

retu rn(o u t[N T A P S ]);

7
Application examples (1)
.L1000006
sll
addu
lw
addiu
addu
lw
nop
m ult
addu
lw
addiu
m flo
addu
sw
addu
sw
slti
bne
addiu
$3, $2, 2
$14, $15, $3
$24, 0($14)
$12, $6, -4
$11, $12, $3
$13, 0($11)
R 3= R 2> > 2
R 14= R 15+ R 3
R 24= load(*R 14)
R 12= R 6-4
R 11= R 12+ R 3
R 13= load(*R 11)
$24, $13
$25, $sp, $3
$9, -4($25)
$2, $2, 1
$13
$10, $9, $13
$10, 0($25)
$25, $7, $3
$24, 0($25 )
$24, $2, 10
$24, $0, .L100006
$15, $7, -4
R 24= R 24*R 13
R 25= sp+ R 3
R 9= load(R 25-4)
R 2= R 2+ 1
R 10= R 9+ R 13
m em (*R 25)= R 10
R 25= R 7+ R 3
m em (*R 25)= R 24
R 3= i-1
R 24= coeff[i-1]
R 13= state[i-1]
R 9= out[i-1]
i= i+ 1
R 13= m ove from low m py reg
R 10= out[i]
19 instructions per tap!!
Embedded Computer Architecture
H. Corporaal, and B. Mesman
8
Application examples (2)
Bit level operations:
finite field arithmetic
n o n zero
co m m o n
r1 = LB in p u t
r2 = S LL r1
r3 = A N D I r1 , m ask
r4 = A D D I r3 , -1
B N E ( r4 != r0 )
n op
R 5 = X O R I(r1 , 2 9 )
J co m m o n
n op
r5 = X O R (r1 ,r0 )
…
tem p 1 = in p u t < < 1
tem p 2 = if (b it(in p u t,7 ) = = 1
th en 2 9
else 0
o u t = tem p 1 exo r tem p 2
Load b yte
S h ift left logical
A N D im m ed iate
A D D im m ed iate
B ran ch on != to n o n zero
E xclu sive or im m ed iate
Ju m p
in[0]
in[1]
in[2]
in[3]
in[4]
exor
exor
exor
in[5]
in[6]
in[7]
out[6]
out[7]
E xclu sive O R
10 instructions!!
Very simple in hardware
out[0]
out[1] out[2] out[3]
out[4] out[5]
9
Application examples (2)
Bit level operations : DES example
source register ($2)
272625 2322 20
srl
andi
srl
andi
or
srl
andi
or
sll
$13,
$25,
$14,
$24,
$15,
$13,
$14,
$25,
$24,
$2, 20
$13, 1
$2, 21
$14, 6
$25, $24
$2, 22
$13, 56
$15, $14
$25, 2
7 6 5 4 3 2
destination register
($24)
Embedded Computer Architecture
H. Corporaal and B. Mesman
10
Application examples (2)
Bit level operations : A5 example (GSM encryption)
181716
13
$5
srl
srl
xor
srl
xor
srl
xor
andi
xor
$24, $5, 18
$25, $5, 17
$8, $24, $25
$9, $5, 16
$10, $8, $9
$11, $5, 13
$12, $10, $11
$13, $12, 1
… 0 ...
$13
1
11
ASIP/VLIW architectures: feedback
resource load
resource load
architecture view
architecture view
cycle-count
cycle-count
bus load
bus load
life-tim e analysis
life-tim e analysis
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
12
Low power aspects
Implementation
Independent
Design Database
• Estimation
+
area
speed
power
Mistral2
Estimation Database
Architecture
E XU
a lu_ 1
a cs _ a s u_ 1
o r_ a s u_ 1
ro m ctrl_ 1
a cu_ 1
ipb_ 1
o pb_ 1
ctrl
to ta l
10/4/2015
A C T I VI T Y
20%
83%
10%
16%
36%
20%
11%
A R EA
261
2382
611
65
294
107
163
1864
5747
PO W E R
105
3816
122
21
205
43
35
3597
7944
Embedded Computer Architecture
H.Corporaal and B. Mesman
13
GSM viterbi decoder : default solution
13750
EXU
alu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
96%
48%
26%
5%
23%
AREA
3469
39
327
131
1804
9821
15591
POWER
46196
259
1209
105
5801
135035
188605
• controller responsible for 70% of power
consumption
– maximum resource-sharing
– heavy decision-making : “main” loop with 16
metrics-computations per iteration
• EXU-numbers include Registers for local storage
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
14
GSM viterbi decoder : no loop-folding
14247
EXU
alu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
92%
45%
25%
5%
22%
AREA
3411
39
294
107
1661
4919
10431
POWER
45073
255
1087
86
5340
70087
121928
• area down by 33%
• power down by 35%
• next step: reduce # of program-steps with
second ALU
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
15
GSM viterbi decoder : 2 ALU’s
9739
EXU
alu_1
alu_2
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
69%
65%
67%
37%
8%
33%
AREA
1797
1393
39
294
149
2136
8957
14766
POWER
12248
8916
255
1087
119
6871
87235
116731
 cycle count down 30%
 area up 42%
 power down by 5%
 next step: introduce ASU
to reduce ALU-load
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
16
GSM viterbi decoder : 1 x ACS-ASU
func ACS ( M1, M2, d ) MS, MS8 =
begin
MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi;
MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;
end;
EXU
alu_1
acs_asu_1
or_asu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
20%
83%
10%
16%
36%
20%
11%
AREA
261
2382
611
65
294
107
163
1864
5747
POWER
105
3816
122
21
205
43
35
3597
7944
=
1930
 cycle count down 5X
 power down 20X !
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
17
GSM viterbi decoder : 4 x ACS-ASU
425
EXU
alu_1
acs_asu_1
acs_asu_2
acs_asu_3
acs_asu_4
split_asu_1
or_asu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
94%
95%
95%
95%
95%
47%
47%
28%
98%
23%
50%
AREA
243
1041
1041
1041
1041
90
592
48
212
60
369
1306
7084
POWER
97
420
420
420
420
18
118
6
85
6
80
555
2645
 cycle count down another 5X
 area up 23%
 power down another 3X !
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
18
GSM viterbi example : summary
Implementation
Independent
Design Database
20000
power
18000
area
cycles
16000
14000
Mistral2
12000
10000
8000
6000
4000
72x !
2000
0
default
10/4/2015
loop
2 ALU
Embedded Computer Architecture
H.Corporaal and B. Mesman
1 ACS
4 ACS
19
Discussion: phase 3
processor
model
application(s)
SW (code
generation)
HW
design
no
no
OK?
application(s)
Freeze
processor
model
no
yes
yes
OK?
yes
no
more appl.?
Exploration phase
10/4/2015
SW (code
generation)
Application software
development:
constraint driven compilation
Embedded Computer Architecture
H.Corporaal and B. Mesman
20
RF1
RF2
RF3
RF4
FU1
FU2
FU3
FU4
flags
IR1
IR2
IR3
Instruction memory
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
IR4
Control
21
Discussion: problems with VLIWs
code size and instruction bandwidth
• code compaction = reduce code size after scheduling
possible compaction ratio ?
e.g. p0 = 0.9 and p1 = 0.1
information content (entropy) = - pi log2 pi = 0.47
maximum compression factor  2
• control parallelism during scheduling = switch between
different processor models (10% of code = 90% runtime)
• architecture
reduce number of control bits for operand addresses
e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only
=> use stacks and fifos
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
22
23
GPU basics
• Synthetic objects are represented with a bunch of
triangles (3d) in a language/library like OpenGL
or DirectX plus texture
• Triangles are represented with 3 vertices
• A vertex is represented with 4 coordinates with
floating-point precision
• Objects are transformed between coordinate
representations
• Transformations are matrix-vector multiplications
24
GPU DirectX 10 pipeline
25
NVIDIA GeForce 6800 3D
Pipeline
26
GeForce 8800 GPU
330 Gflops, 128 processors with 4-way SIMD
27
GPU: Why more general-purpose programmable?
• All transformations are shading
• Shading is all matrix-vector multiplications
• Computational load varies heavily between
different sorts of shading
• Programmable shaders allow dynamic resource
allocation between shaders
Result:
• Modern GPUs are serious competitor for
general-purpose processors!
Fully serial
n
n
n
n
n
F
n
n
n
B
n
n
n
n
n
n
A
n
n
n
n
n
n
n
n
n
n
n
E
n
n
n
n
n
n
n
n
n
n
n
n n n
n n n
C n n
Dn n
n n n
n n n
nGn
n nH
ABCDE F GH
0 0 0 0 0 0 0 0
Classical encoding:
fetching many nops
Mixed serial/parallel
n
n
F
n
B
n
n
n
A
n
n
n
n
E
n
n
n
n
n
n
C n n
Dn n
n n n
n GH
ABCDE F GH
ABCDE F GH
1 1 0 1 0 0 1 0
ABCDE F GH
1 1 1 1 1 1 1 0
Fully parallel
Velocity encoding
10/4/2015
Embedded Computer Architecture
28
Conclusions
• ASIPs provide efficient solutions for well-defined application
domains (2 orders of magnitude higher efficiency).
• The methodology is interesting for IP creation.
• The key problem is retargetable compilation.
• A (distributed) VLIW model is a good compromise between
HW and SW.
• Although an automatic process can generate a default
solution, the process usually is interactive and iterative for
efficiency reasons. The key is fast and accurate feedback.
• GPUs are ASIPs
10/4/2015
Embedded Computer Architecture
H.Corporaal and B. Mesman
29
Descargar

No Slide Title