Architectural Enhancements for Efficient
Operand Transport in Multimedia Systems
ECE7102 Class Presentation
Date: 2006. 4. 13
Hongkyu Kim
[email protected]
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
2/40
Interconnect Complexity
S to ra g e
S to ra g e
S to ra g e
S to ra g e
FU
In te rco n n e ct
FU
FU
FU
FU
FU
In te rco n n e ct
FU
FU
FU
FU
S to ra g e
S to ra g e
S to ra g e
S to ra g e
• Exponential increase of chip capacity  More devices
• Exponential decrease of feature size  Interconnect limitation
J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35,
May/June 2003.
3/40
Interconnect Bottleneck
• Disparity between wire delay and gate delay
100
Relative Delay
1/α2
10
1/α2
1
α
0.1
250
180
130
90
65
45
42
Process Technology Node (nm)
ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.
4/40
Problem Statement
• High-performance interconnect
– Interconnect organizations
– Interconnect technologies
• Why architectural responses are limited?
– Compatibility with old ISAs
• Sequentially-specified operations
• Restricted register file-based operand namespace
– ILP mechanisms
• Operand bypass network, register renaming, and instruction
scheduling
• Poorly scaling broadcast buses
5/40
Research Objective and Approach
• Objective
Reduce latency of operand transport for multimedia
– Development of dynamic execution techniques
– Development of low-cost operand bypass networks
• Approach summary
B a c k g ro u n d w o rk
G e n e ra l a p p ro a c h
A p p lic a tio n -s p e c ific a p p ro a c h
D y n a m ic e x e c u tio n te c h n iq u e
A n a ly s is o f o p e ra n d s
E xa m in e o p e ra n d u sa g e p ro p e rtie s
E xp lo re th e im p a ct o f a rch ite ctu ra l
te ch n iq u e s o n th e o p e ra n d tra n sp o rt
T e c h n o lo g y m o d e l-b a s e d
e v a lu a tio n o n ta rg e t p la tfo rm s
G E N E S Y S
S im p le S ca la r
In stru ctio n clu ste rin g
R e co g n itio n o f re g u la r o p e ra n d tra n sp o rt p a tte rn s
E fficie n t e xe cu tio n u n it
C lu s te r m a p p in g o n
in te r-A L U n e tw o rk
B a sic in stru ctio n clu ste rin g
R a w clu ste r m a p p in g
L o ca l o p e ra n d m a p p in g o n
d e d ica te d in te r-A L U p a th
O p tim izin g o p e ra n d tra n s p o rt
fo r m u ltim e d ia s y s te m s
R e g u la r p a tte rn re co g n itio n
C lu ste r re o rg a n iza tio n
F u n ctio n re m a p p in g
D yn a m ic S IM D iza tio n
6/40
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
7/40
Motivation and Approach
• Motivation
– Shift of microarchitectural design focus
Operand computation  Operand communication
– Recognizing and understanding of operand usage and
transport properties  Efficiently controlling operand traffic
• Approach summary
– Operand usage characteristics
• How often operands are used  Examine temporal property
• Where operands are used  Examine spatial property
– Operand transport properties
• What accounts for the majority of communication needs
Explore the impact of architectural techniques on the operand
transport
8/40
Operand Usage Analysis
• General terms
– Operands: values in registers, memory locations, or memory
addresses
– Operand transport: buffering and delivery of operands to FUs
• Operands’ temporal characteristics
– Which inst. consumes operands after they are produced
– Metrics: Degree of use, Age, Lifetime
• Operands’ spatial characteristics
– From/to which FU operands are moved in the execution model
– Metrics: Degree of functionality, Transport pattern
9/40
Operand Transport Analysis
• Operand transport model
G lo b a l S to ra g e
tra n s rd _ g lo b a l
tra n s rd _ b y p a s s
tra n s w r_ g lo b a l
B yp a ss N e tw o rk
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
FU
FU
FU
FU
FU
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
L o ca l
S to ra g e
FU
FU
FU
FU
FU
10/40
Preliminary Results
• Operand usage properties (MediaBench average)
d e g re e o f u s e
0
1
age
1
life tim e
1
d e g re e o f
fu n c tio n a lity
0
0%
2
2
1(same)
20%
2
3~5
3~5
>5
6~10
>10
1(different)
40%
60%
3 >3
80%
>1
100%
H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia
applications,” Proc of the International Workshop on System-on-Chip for Real-Time
Applications, pp. 168-171, July 2004.
11/40
Preliminary Results (cntd.)
• Operand transport pattern (MediaBench average)
integer  integer
43.0%
Others
8.1%
integer  branch
14.9%
ld/st  ld/st
6.6%
ld/st  integer
13.8%
integer  ld/st
13.6%
12/40
Preliminary Results (cntd.)
• Effective architectural techniques on operand
transport
–
–
–
–
Storage hierarchy: local buffering
Dedicated transport network
Lifetime detection: compile-time/run-time
Smart instruction steering
13/40
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
14/40
Motivation and Approach
• Motivation
Multimedia applications
– Operand movement is highly regular
– Most operands are short lived, transient operands
Develop dynamic execution technique exploiting regular
operand distribution patterns and local properties
• Approach summary
– Instruction clustering: dynamic instruction grouping
– Recognition of regular operand transport pattern
– Efficient execution unit: reduce transport latency
15/40
Related Work
• Solutions for multimedia processing
– Multimedia-specific ISA extensions
• Exploit data-level parallelism at subword level
• General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!,
Sun’s VIS, IBM’s Altivec
• Application-specific signal processing domain: Analog Device’s
TigerShark, Trimedia
– Vectorization and retargeting
• Manual assembly coding
• Hand-optimization: in-lined assembly code, library routines
• Automatic vectorization: compiler/retargeting technology
16/40
Related Work (cntd.)
• Solutions for reducing operand transport complexity
– Communication-aware execution
• Network-connected tile architecture: RAW, GPA
• Transport triggered architecture: MOVE
– Resource partitioning: Clustered architectures
• Heterogeneous: decoupled architecture
• Commercial: DEC Alpha21264
• Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP
– Dynamic optimizations
• Fill unit: reform instructions in H/W, and cache them
• Small-scale dependence collapsing: combine dependences
among multiple instructions  macro instruction
17/40
Related Research Landscape
M u ltim e d ia p ro c e s s in g :
in d e p e n d e n t c o m p u ta tio n
R e g u la r p a tte rn o f
d e p e n d e n t in s tru c tio n s
C o m m u n ic a tio n -a w a re e x e c u tio n :
e ffic ie n t o p e ra n d tra n s p o rt
B in a ry -c o m p a tib ility ,
ru n -tim e o p tim iz a tio n
D yn a m ic e xe cu tio n te ch n iq u e e xp lo itin g re g u la r
o p e ra n d tra n sp o rt p a tte rn s in m u ltim e d ia
S te e rin g b u rd e n o ff
th e c ritic a l p a th
R e s o u rc e p a rtitio n in g :
C le v e r in s tru c tio n s te e rin g
L a rg e r, m o re g e n e ra l
in s tru c tio n g ro u p in g
D y n a m ic o p tim iz a tio n s :
in s tru c tio n g ro u p in g , s m a ll-s c a le
d e p e n d e n c e c o lla p s in g
18/40
Research Methodology
A p p lica tio n co d e
(C so u rce )
In stru ctio n tra ce
g cc
cro ss-co m p ile r
C lu ste r fo rm a tio n
lo g ic
P IS A b in a ry
C lu ste r sto ra g e
(ca ch e )
In stru ctio n stre a m
M a tch e d ?
E xe cu tio n p la tfo rm
N
In stru ctio n q u e u e
Y
C lu ste r q u e u e
N o rm a l
e xe cu tio n u n it
C lu ste r
e xe cu tio n u n it
19/40
Dynamic Instruction Clustering
• Instruction Cluster
– A connected subgraph of instructions joined by local operands
– Dataflow graph  Dependence edge classification
 Instruction grouping
• Dependence edge types
– External: produced/consumed by previous/next blocks
– Non-clusterable: operands from/to memory
– Local: produced and consumed within the same block
20/40
Instruction Clustering Example
• Color conversion block in JPEG encoder
0 : lb u r4 , 0 (r9 )
0 , 1 (r9 )
1 : lb u r5
2 : lb u r6 , 2 (r9 )
3 : s ll r4 , r4 , 0 x2
3
4 : a d d u r4 , r4 , r1 0
5 : s ll r5 , r5 , 0 x2
6 : a d d u4 r5 , r5 , r1 0
7 : lw r2 , 0 (r4 )
87: lw r3
1 7, 1 0 2245(r5 ) 8
9 : s ll r6 , r6 , r1 0
1 0 : a d d u r6 , r6 , r1 0
1 1 : a d d u r2 , r2 , r3
11
1 2 : lw r3 , 2 0 4 8 (r6 )
1 3 : a d d u r7 , r2 5 , r8
1 4 : a d d u r2 , r2 , r3
1 5 : s ra r2 , r2 , 0 x 1 0
1
6 : s b r2 , 0 (r7 )
15
13
21
1 7 : lw r2 , 3 0 7 2 (r4 )
1 8 : lw r3 , 4 0 9 6 (r5 )
16
24
1
5
6
18
19
22
23
1 9 : a d d u r2 , r2 , r3
2 0 : lw r3 , 5 1 2 02(r6 )
2 1 : a d d u r7 , r1 5 , r8
2 2 : a d d u r2 , r2 , r3
9
27
2 3 : s ra r2 , r2 , 0 x 1 0
2 4 : s b r2 , 0 (r7 )
0 )
2 5 : lw r2 , 5 1 2 01 (r4
2 6 : lw r3 , 6 1 4 4 (r5 )
2 72 :6 a d d iu
1 2r9 , r9
2 0, 3 2 9
2 8 : a d d u r2 , r2 , r3
2 9 : lw r3 , 7 1 6 8 (r6 )
3 0 : a d d u r7 , r1 2 , r8
28
31
3 1 : a d d iu r8 , r8 , 1
3 2 : a d d u r2 , r2 , r3
3 3 :2 s ra r2 , r2 , 0 x 1 0 3 5
3 4 : s b r2 , 0 (r7 )
3 5 : s ltu r2 , r8 , r1 6
33
30
36
3 6 : b n e r2 , r0 , 0 x 4 1 2 1 8 8
x te rn a l
E xte
Local
N onc lu ste
s te ra b le
clu
In s tru c tio n
C lu s te r
34
21/40
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
22/40
Implementation Example - I
• Raw cluster execution on inter-ALU network
– Focus on intermediate, short-lived operands
• Local operands: inter-ALU dedicated bypass network
• Others: traditional global bypass network
– Organization
• Instruction cluster formation
• Cluster queue and scheduling
• Cluster execution: inter-ALU network
H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction
clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia,
December 2005.
23/40
Cluster Queue and Scheduling
• Organization of cluster queue
– Single entry per cluster (2D)
– Ready flag for local operands are always set
– Issue pointer for each entry, in-order issue
C lu s te r q u e u e
C o n v e n tio n a l in s tu c tio n q u e u e
H ead
T a il
H ead
T a il
I0
I1
I2
I3
d e p th
w id th
C 0 :I0
C 1 :I0
C 2 :I0
C 0 :I1
C 1 :I1
C 2 :I1
C 0 :I2
C 1 :I2
C 1 :I3
Issu e
p o in te r
2
0
1
24/40
Cluster Execution Unit
• Cluster mapping on inter-ALU network
– Local operands: dedicated bypass network
– Others: traditional global bypass network
In stru ctio n clu ste r
n e tw o rk A L U
co l 0
co l 1
co l 2
ro w 0
I0
I1
I6
ro w 1
I2
I4
In s tru c tio n
D e p th
0
1
2
I0
I2
I1
I4
I3
I3
ro w 2
3
I5
ro w 3
4
co l 3
I5
I6
25/40
Experimental Setup
• Simulation Environment
– SimpleScalar sim-outorder simulator
– MediaBench application programs
• Processor Configurations
Queues
FU resources
Operand
bypass
(latency)
8-way
24 instruction queue,
8 cluster queue,
16 load/store queue
4 integer ALUs,
1 (4x4) network ALU,
2 integer MULs,
2 floating ALUs
1 floating MUL,
2 memory ports
Local (0),
pass-through (1),
Global (1)
16-way
48 instruction queue,
16 cluster queue,
32 load/store queue
8 integer ALUs,
2 (4x4) network ALUs,
2 integer MULs,
2 floating ALUs
1 floating MUL,
2 memory ports
Local (0),
pass-through (1),
Global (max 3)
26/40
a v e ra g e
ra w d a u d io
ra w c a u d io
m peg2encode
3 2 e ntrie s
1 2 8 e ntrie s
5 1 2 e ntrie s
m peg2decode
70%
g721encode
80%
g721decode
e p ic u n
e p ic
d jp e g
c jp e g
c lu s te re d in s t./to ta l c o m m ite d in s t.
Experimental Result
• Dynamic instruction coverage
6 4 e ntrie s
2 5 6 e ntrie s
1 K e ntrie s
60%
50%
40%
30%
20%
10%
0%
27/40
0.0
8-way
average
rawdaudio
rawcaudio
mpeg2encode
mpeg2decode
g721encode
g721decode
epicun
epic
djpeg
cjpeg
0.8
average
rawdaudio
rawcaudio
mpeg2encode
1.2
mpeg2decode
g721encode
g721decode
epicun
epic
djpeg
cjpeg
average dependence edge per inst.
Experimental Result (cntd.)
• Operand transport types
1.4
global
pass-through
local
1.0
0.6
0.4
0.2
59.5%
57.8%
11.0%
10.6%
29.5%
31.5%
16-way
28/40
8 -w a y
a ve ra g e
ra w d a u d io
ra w ca u d io
m p e g 2 e n co d e
m p e g 2 d e co d e
g 7 2 1 e n co d e
g 7 2 1 d e co d e
e p icu n
e p ic
d jp e g
cjp e g
a ve ra g e
ra w d a u d io
ra w ca u d io
m p e g 2 e n co d e
m p e g 2 d e co d e
g 7 2 1 e n co d e
g 7 2 1 d e co d e
e p icu n
e p ic
d jp e g
cjp e g
IP C s p e e d u p
Experimental Result (cntd.)
• IPC speedup
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 6 -w a y
29/40
Summary
• Summary of approach
– Dynamically group dependent instructions into clusters
– Store regular operand transport patterns
– Execute them on inter-ALU network where intermediate values
are propagated among ALUs w/o/ using global buses
• Summary of results (MediaBench average)
– Dynamic instruction coverage
57.3%
@ 256 entry cluster cache
– Shortest transport rate
8-way
30%
16-way
32%
– IPC speedup
8-way
16.2% 16-way
35.2%
30/40
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
31/40
Implementation Example - II
• Data parallel execution using dynamic SIMDization
– Observation (Image processing applications)
• Operand movement w/in a loop iteration is highly regular
• Small # of inner loops covers most of execution time
– Focus on regular operand transport pattern between
iterations of innermost loop
• Stride prediction: break loop-carried dependences  dataparallel execution
• Operand lifetime detection  operand traffic control
– Organization
• Instruction cluster formation
• SIMD instruction queue and scheduling
• SIMD PE array
32/40
Dynamic Instruction Clustering
• External dependence edge types
– External-input: serving only as input
– External-output: serving only as output
– External-updated: serving as both input and output
• Parallel and non-parallel region detection
– p-cluster: producing no external-updated output and not
having unpredicted external-updated input
– np-cluster
33/40
Instruction Clustering Example
• Image convolution code in TI’s IMGLIB
r1 1
IC 0
r8
r1 5
r9 r1 0
r1 3
IC 1
0
2
1
3
8
13
4
9
14
6
11
16
7
12
20
IC 2
IC 3
21
IC 4
5
10
15
17
18
19
IC 5
r6
r5
r4
r3
e x te rn a l-in p u t = {r1 0 , r1 1 , r1 3 , r1 5 }
e x te rn a l-o u tp u t = {r2 , r3 , r4 , r5 , r6 , r7 }
e x te rn a l-u p d a te d = {r8 , r9 }
r9 r7
r8
r2
p -c lu s te rs = {IC 0 , IC 1 , IC 2 , IC 3 }
n p -c lu s te rs = {IC 4 , IC 5 }
34/40
SIMD Execution Unit
• Cluster scheduling on SIMD PE array
8 [0 :3 ]
PE0
20
30
PE1
21
31
PE2
22
32
PE3
23
33
0
1
1 3 [0 :3 ]
160
2
(a ) p - c lu ste r s c h e d u lin g
t
0
200
210
161
201
211
162
202
212
163
203
213
3
4
4
1
2
t
(b ) n p - c lu ste r s c h e d u lin g
35/40
SIMD Execution Unit (cntd.)
• Operand transport model
S c a la r re s o u rc e s
P
c o n ve n tio n a l IL P p ro c e s s o r
PE
P
IL P
S IM D
P
e xte rn a l- in p u t
P
e xte rn a l- o u tp u t
P
lo c a l
e xte rn a l- u p d p a te d
36/40
Summary of Approach
• Dynamic parallelization
– Detect regular operand transport pattern on externalupdated
– Compute stride  predict external-update values
• Optimizing operand transport
– Identify the lifetime of operands
– Remove needless communication  localize transport
• Execute the clusters on 1-D mesh SIMD PE array
37/40
Overview
• Introduction
• Characterization and modeling of operand usage
and transport
• Dynamic execution technique exploiting regular
operand transport patterns in multimedia
– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
38/40
Summary
• Characterization and modeling of operand
– Examine the operand usage properties
– Explore the impact of architectural techniques on the operand
transport
• Development of a dynamic execution technique
– Instruction clustering
– Recognition of regular operand transport pattern
– Efficient execution unit
39/40
Thank you. Any questions?
40/40
Descargar

Partitioning of Long Trace