Advanced Topics in Pipelining
- SMT and Single-Chip
Multiprocessor
Priya Govindarajan
CMPE 200
Introduction

Researchers have proposed two
alternative microarchitectures that
exploit multiple threads of control:
 simultaneous multithreading SMT [1]
 chip multiprocessors CMP [2]
CMP Vs SMT


Why software and hardware trends will
favor the CMP microarchitecture.
Conclusion on the performance results
from comparison of simulated
superscalar, SMT, and CMP
microarchitectures.
SMT Discussion Outline







Introduction
Mutithreading MT
Approaches of Multithreading
Motivation for introducing SMT
Implementation of SMT CPU
Performance estimates
Architectural abstraction
Introduction to SMT





SMT processors augment wide (issuing many
instructions at once) superscalar processors with
hardware that allows the processor to execute
instructions from multiple threads of control
concurrently
Dynamically selecting and executing instructions from
many active threads simultaneously.
Higher utilization of the processor’s execution
resources
Provides latency tolerance in case a thread stalls due
to cache misses or data dependencies.
When multiple threads are not available, however,
the SMT simply looks like a conventional wide-issue
superscalar.
Introduction to SMT


SMT uses the insight that a dynamically
scheduled processor already has many of h/w
mechanisms needed to support the
integrated exploitation of TLP through MT.
MT can be built on top of out-of-order
processor by adding a per thread register
renaming, PCs and providing capability for
instructions from multiple threads to commit.
Mutithreading: Exploiting
Thread-Level Parallelism

Multithreading




Multiple threads to share the functional units of a
single processor in an overlapping fashion.
The processor must duplicate the independent
state of each thread. (register file, a separate PC,
page table)
Memory can be shared through the virtual
memory mechanisms, which already support
multiprocessing
Needs hardware support for changing the threads.
Multithreading….

Two main approaches to multithreading


Fine-grained multithreading
Coarse-grained multithreading
Fine-grained .. Coarse-grained
multithreading


Switches between
threads on each
instruction, causing
interleaving
Interleaving in
round-robin.
Skipping any threads
that r stalled

Switches threads
only on costly stalls.
Fine-grained multithreading
Advantages
 Hides throughput losses that arise from both
short and long stalls.
Disadvantages
 Slows down the execution of an individual
threads, since a thread that is ready to
execute without stalls will be delayed by
instructions from other threads.
Coarse-grained multithreading
Advantages
 Relieves the need to have thread
switching be essentially free and is
much less likely to slow down the
execution of an individual threads
Coarse-grained multithreading
Disadvantages



Throughput losses, especially from shorter
stalls.
This is because coarse grained issues
instructions from a single thread, when a stall
occurs, the pipeline must be emptied or
frozen.
New thread begins executing after the stall
must fill the pipeline before instructions will
be able to complete.
Simultaneous Multithreading
Is a variation on multithreading that uses the
resources of a multiple-issue processors, dynamically
scheduled processor to exploit TLP at the same time
it exploits ILP.
Why ?
 Modern multiple-issue processors often have more
functional unit parallelism available than a single
thread can effectively use.
 With register renaming and dynamic scheduling,
multiple instructions from independent threads can
be issued without any dependences among them.

Basic Out-of-order Pipeline
SMT Pipeline
Challenges for SMT processor



Dealing with a larger register file
needed to hold multiple contexts
Maintaining low overhead on the clock
cycle, particularly in issue , completion
Ensuring cache conflicts by
simultaneous execution of multiple
threads do not cause significant
performance degradation.
SMT

SMT will significantly enhance
multistream performance across a wide
range of applications without significant
hardware cost and without major
architectural changes
Instruction Issue
Reduced function unit utilization due to dependencies
Superscalar Issue
Superscalar leads to more performance, but lower utilization
Simultaneous Multithreading
Maximum utilization of function units by independent operations
Fine Grained Multithreading
Interleaving – no empty slot
Intra-thread dependencies still limit performance
Architectural Abstraction


1 CPU with 4 Thread Processing Units (TPUs )
Shared hardware resources
System Block Diagram
Changes for SMT


Basic pipeline – unchanged
Replicated resources



Program counters
Register maps
Shared resources





Register file (size increased)
Instruction queue Instruction queue
First and second level caches
Translation buffers
Branch predictor
Multithreaded applications
Performance
Single-Chip Multiprocessor


CMPs use relatively simple single-thread
processor cores to exploit only moderate
amounts of parallelism within any one thread,
while executing multiple threads in parallel
across multiple processor cores.
If an application cannot be effectively
decomposed into threads, CMPs will be
underutilized.
Comparing Alternative
Architectures
Issue up to 12 instructions per cycle
Super scalar Architecture
Comparing … SMT
Architecture
8 separate PCs , executes instructions from 8 diff thread concurre
Multi ban
caches
Chip multiprocessor
architecture
8 small 2 issue superscalar processors. Depend on TLP
SMT and Memory




Large demands on memory
SMT require more bandwidth from
primary cache (MT allows more load
and store)
To allow this they have 128-kbye cache
Complex MESI(modified , exclusive,
shared and invalid) cache-coherence
protocol
CMP and Memory



Eight cores are independent and
integrated with their individual pairs of
caches – another form of clustering
leads to high-frequency design for
primary cache system
Small cache size and tight connection to
these caches allows single-cycle access.
Need simpler coherence scheme
Quantitative performance..
CPU cores

To keep the processors execution units
busy, SMT features




advanced branch prediction
register renaming
out-of-order issue
non blocking data caches.
Which makes it inherently complex
CMP Approach…h/w simple


number of registers
increases
Number of ports on
each register must
increase

CMP
Solution
Exploit ILP using
more processors
instead of large
issue widths within
single processor
SMT Approach






Longer cycle times
Long, high capacitance I/O wires span the large buffers, queues
and register files
Extensive use of multiplexers and crossbars to interconnect
these units adds more capacitance
Delays associates dominate delay along CPU’s critical path
The cycle time impact of these structures can be mitigated by
careful design using deep pipelining, by breaking the structures
with small,fast clusters of closely related components by short
wires.
But deep pipelining increases branch misprediction penalities
and clustering tends to reduce the ability of the processor to
find and exploit instruction level parallelism.
CMP Solution



Short cycle time to be be targeted with relatively
little design effort, since its h/w is naturally clusteredeach of the small CPUs is already a very small fast
cluster of components.
Since OS allocates a single s/w thread of control to
each processor, the partitioning of work among the
“clusters” is natural and requires no h/w to
dynamically allocate instructions to different clusters
Heavy reliance on s/w to direct instructions to
clusters limits the amount of ILP of CMP but allows
the clusters within CMP to be small and fast.
SMT and CMP

Architectural point of view, the SMT processor’s
flexibility makes it superior.



However, the need to limit the effects of interconnect delays,
which are becoming much slower than transistor gate delays,
will also drive the billion-transistor chip design.
Interconnect delays will force the microarchitecture to be
partitioned into small, localized processing elements.
CMP is much more promising because it is already
partitioned into individual processing cores.

Because these cores are relatively simple, they are amenable
to speed optimization and can be designed relatively easily.
Compiler support for SMT and
CMP




Programmers must find TLP in order to
maximize CMP performance
SMT requires programmers to explicitly divide
code into threads to get maximum
performance but unlike CMP, it can
dynamically find more ILP if TLP is limited.
But with multithreaded OS these problems
should prove to be less daunting
Having all eight of the CPUs on a single chip
allows designers to exploit TLP even when
threads communicate frequently
Performance results

A comparison of three architectures
indicates that a multiprocessor on a
chip will be easiest to implement while
still offering excellent performance.
Disadvantages of CMP


When code cannot be MT, only one
processor can be targeted to the task
However, a single 2 issue processor on
CMP is only moderately slower than
superscalar or SMT, since applications
with little thread-level parallelism also
lack ILP
Conclusion on CMP





CMP is promising candidate for a billion-transistor
architecture.
Offers superior performance using simple h/w
Code that can be parallelized into multiple threads,
the small CMP cores will perform comparable or
better
Easier to design and optimize
SMTs use resources more efficiently than CMP, but
more execution units can be included in a CMP of
similar area, since less die area need be devoted to
wide-issue logic.










D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann.
Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403.
J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor for Commercial Servers, IBM
Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98.
J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-level parallelism into instruction-level
parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip
multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996.
LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip Multiprocessor. IEEE September1997
GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd
International Symposium on High-Performance Computer Architecture(Feb.). 291–301.
KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND SUK-HAN YOON. On-Chip Multiprocessor
with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf
NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor
microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77.
OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip
multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating
Systems (Oct.). ACM, New York, 2–11.
LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU, MICHAEL CHEN, KUNLE OLUKOTUN.
The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2)






S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous Multithreading: A Platform for Next-generation
Processors. In IEEE Micro, pages 12-18, September/October 1997
V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative Execution of Sequential Binaries on ChipMultiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998.
goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt
http://www.acm.uiuc.edu/banks/20/6/page4.html
Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/
The Stanford Hydra Chip
Multiprocessor
Kunle Olukotun
The Hydra Team
Computer Systems Laboratory
Stanford University
Technology Architecture

Transistors are cheap, plentiful and fast



Wires are cheap, plentiful and slow



Moore’s law
100 million transistors by 2000
Wires get slower relative to transistors
Long cross-chip wires are especially slow
Architectural implications


Plenty of room for innovation
Single cycle communication requires localized
blocks of logic
Exploiting Program Parallelism
Levels of Parallelism
Process
Thread
Loop
Instruction
1
10
100
1K
10K
Grain Size (instructions)
100K
1M
Hydra Approach


A single-chip multiprocessor
architecture composed of simple fast
processors
Multiple threads of control


Memory renaming and thread-level
speculation


Exploits parallelism at all levels
Makes it easy to develop parallel programs
Keep design simple by taking advantage
Outline







Base Hydra Architecture
Performance of base architecture
Speculative thread support
Speculative thread performance
Improving speculative thread
performance
Hydra prototype design
Conclusions
The Base Hydra Design
Centralized Bus Arbitration Mechanisms
CPU 0
L1 Inst.
Cac he
CPU 1
L1 Inst.
Cac he
L1 Data Cac he
CPU 0 M emory Controller
L1 Data Cac he
CPU 1 Memory Controller
CPU 2
L1 Inst.
Cac he
L1 Data Cac he
CPU 2 M emory Controller
CPU 3
L1 Inst.
Cac he
L1 Data Cac he
CPU 3 Memory Controller
Write-through Bus (64b)
Read/Replace Bus (256b)
On-chip L2




Cache
Single-chip multiprocessor
Four processors
Separate primary caches
Write-through data caches
to maintain coherence
Rambus M emory Interface
I/O Bus Interface
DRAM M ain Memory
I/O Devices



Shared 2nd-level cache
Low latency interprocessor
communication (10 cycles)
Separate read and write
Hydra vs. Superscalar
4
Hydra 4 x 2-way issue
3.5

Superscalar 6-way issue
3
2

1.5
1

pmake
OLTP
tomcatv
swim
applu
MPEG2
apsi
m88ksim
0
eqntott
0.5
compress
Speedup
2.5
ILP only
SS 30-50%
better than single
Hydra processor
ILP & fine thread
SS and Hydra
comparable
ILP & coarse
thread
Hydra 1.5–
2better
Problem: Parallel Software

Parallel software is limited



Hand-parallelized applications
Auto-parallelized dense matrix FORTRAN
applications
Traditional auto-parallelization of Cprograms is very difficult


Threads have data dependencies
synchronization
Pointer disambiguation is difficult and
expensive
Solution: Data Speculation

Data speculation enables parallelization
without regard for data-dependencies





Loads and stores follow original sequential
semantics
Speculation hardware ensures correctness
Add synchronization only for performance
Loop parallelization is now easily automated
Other ways to parallelize code

Break code into arbitrary threads (e.g.
speculative subroutines )
Data Speculation
Requirements I
Forward data between parallel threads
Detect violations when reads occur too
Data Speculation
Requirements II
Writes after Violations
Writes after Successful Iterations
Iteration i
Iteration i
Iteration i+1
Iteration i+1
write A
read X
write B
write X
TIM E
write X
write X
1
TRASH
2
PERMANENT
STATE
Safely discard bad state after violation
Correctly retire speculative state
Data Speculation
Requirements III
Maintain multiple “views” of memory
Hydra Speculation Support
Centralized Bus Arbitration Mechanisms
CPU 0
L1 Inst.
Cac he
CP2
CPU 1
L1 Data Cac he &
Spe culation Bits
L1 Inst.
Cac he
L1 Data Cac he &
Spe culation Bits
CPU 1 Memor y Controller
CPU 0 Memor y Controller
CP2
CP2
CPU 2
L1 Inst.
Cac he
L1 Data Cac he &
Spe culation Bits
CPU 2 Memor y Controller
CPU 3
L1 Inst.
Cac he
CP2
L1 Data Cac he &
Spe culation Bits
CPU 3 Memor y Controller
Write-through Bus (64b)
Read/Replace Bus (256b)
Speculation W rite Buffers
#0
#1
#2
#3
On-chip L2
Cache
retire
Rambus M emory Inter face
I/O Bus Interface
DRAM M ain Memor y
I/O Devices
 Write bus and L2 buffers provide forwarding
 “Read” L1 tag bits detect violations
 “Dirty” L1 tag bits and write buffers provide backup
 Write buffers reorder and retire speculative state
 Separate L1 caches with pre-invalidation & smart L2 forwarding
Speculative Reads
Nonspeculative
“Head” CPU
CPU
#i-2
S peculative
earlier CPU
“Me”
S peculative
later CPU
CPU
#i-1
CPU
#i
CPU
#i+1
1
2
L1
Cache
D
L2
Cache
C
Write
Buffer
B
Write
Buffer
A
Write
Buffer

L1 hit
The read bits are
set
Write
Buffer
L1 miss
L2 and write buffers are checked in parallel
The newest bytes written to a line are pulled in by
priority encoders on each byte (priority A-D)
Speculative Writes
 A CPU writes to its L1 cache & write buffer
 “Earlier” CPUs invalidate our L1 & cause RAW
hazard checks
 “Later” CPUs just pre-invalidate our L1
 Non-speculative write buffer drains out into the L2
Speculation Runtime System

Software Handlers





Control speculative threads through CP2
interface
Track order of all speculative threads
Exception routines recover from data
dependency violations
Adds more overhead to speculation than
hardware but more flexible and simpler to
implement
Complete description in “Data Speculation
Creating Speculative Threads

Speculative loops



Typically one speculative thread per
iteration
Speculative procedures



for and while loop iterations
Execute code after procedure speculatively
Procedure calls generate a speculative
thread
Compiler support
Base Speculative Thread
Performance
4
3.5
Base
3

2

1.5

1
0.5

sparse1.3
simplex
ear
cholesky
alvin
mpeg2
ijpeg
wc
m88ksim
grep
eqntott
0
compress
Speedup
2.5
Entire
applications
GCC 2.7.2 -O2
4 single-issue
processors
Accurate
modeling of all
aspects of
Hydra
Improving Speculative
Runtime System

Procedure support adds overhead to
loops





Threads are not created sequentially
Dynamic thread scheduling necessary
Start and end of loop: 75 cycles
End of iteration: 80 cycles
Performance


Best performing speculative applications
use loops
Procedure speculation often lowers
Improved Speculative
Performance
4
3.5
Optimized RTS
Base
3

2
1.5

1
0.5
sparse1.3
simplex
ear
cholesky
alvin
mpeg2
ijpeg
wc
m88ksim
grep
eqntott
0
compress
Speedup
2.5

Improves
performance of all
applications
Most
improvement for
applications with
fine-grained
threads
Eqntott uses
procedure
Optimizing Parallel
Performance

Cache coherent shared memory





No explicit data movement
100+ cycle communication latency
Need to optimize for data locality
Look at cache misses (MemSpy, Flashpoint)
Speculative threads


No explicit data independence
Frequent dependence violations limit
performance
Feedback and Code
Transformations

Feedback tool



Synchronization



Collects violation statistics (PCs, frequency,
work lost)
Correlates read and write PC values with
source code
Synchronize frequently occurring violations
Use non-violating loads
Code Motion
Code Motion
Rearrange reads and writes to increase
parallelism
iteration i
iteration i
 Delay reads and advance writes
read x
read x
iteration i+1
write x
readearlier
x
 Create local copies to allow
data
read x
write x
read x’
forwarding
read x’

write x
iteration i+1
read x
write x
Optimized Speculative
Performance
4
3.5
3
 Base performance
 Optimized RTS with
no manual
intervention
2
1.5
1
0.5
sparse1.3
simplex
ear
cholesky
alvin
mpeg2
ijpeg
wc
m88ksim
grep
eqntott
0
compress
Speedup
2.5
 Violation statistics
used to manually
transform code
Size of Speculative
Write
State
Max no. lines of write state



Max size determines
size of write buffer for
max performance
Non-head processor
stalls when write
buffer fills up
Small write buffers (<
64 lines) will achieve
good performance
compress
24
eqntott
40
grep
11
m88ksim
28
wc
8
ijpeg
32
mpeg
56
alvin
158
cholesky
4
ear
82
simplex
14
32 byte cache lines
Hydra Prototype

Design based on Integrated Device Technology
(IDT) RC32364
Conclusions

Hydra offers a new way to design
microprocessors




Single-chip MP exploits parallelism at all
levels
Low overhead support for speculative
parallelism
Provides high performance on applications
with medium to large-grain parallelism
Allows performance optimization migration
path for difficult to parallelize fine-grain
Hydra Team

Team
Monica Lam, Lance Hammond, Mike Chen,
Ben Hubbert, Manohar Prahbu, Mike Siu,
Melvyn Lim and Maciek Kozyrczak (IDT)

URL

http://www-hydra.stanford.edu
Descargar

Slide 1