Multicore Computing
- Evolution
ECE 4100/6100
(1)
Performance Scaling
10000000
1000000
Pentium® 4 Architecture
100000
MIPS
10000
1000
Pentium® Pro Architecture
Pentium® Architecture
100
10
1
286
386
486
8086
0.1
0.01
1970
1980
1990
2000
2010
2020
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100
(2)
Intel
 Homogeneous cores
 Bus based on chip interconnect
 Shared Memory
 Traditional I/O
Classic OOO:
Reservation Stations,
Issue ports,
Schedulers…etc
Source: Intel Corp.
Large, shared set associative, prefetch,
etc.
ECE 4100/6100
(3)
IBM Cell Processor
Heterogeneous MultiCore
High speed I/O
High bandwidth,
multiple buses
Source: IBM
Classic (stripped down) core
Co-processor accelerator
ECE 4100/6100
(4)
AMD Au1200 System on Chip
Embedded processor
Custom cores
On-Chip I/O
On-Chip Buses
Source: AMD
ECE 4100/6100
(5)
PlayStation 2 Die Photo (SoC)
Floating point MACs
Source: IEEE Micro, March/April 2000
ECE 4100/6100
(6)
Multi-* is Happening
Source: Intel Corp.
ECE 4100/6100
(7)
DC 4MB
DC 2/4MB
shared
DC 2/4MB
DC 3 MB/6
MB shared
(45nm)
DC 2/4MB
shared
DC 2MB
SC 1MB
Enterprise processors
DC 3MB /6MB
shared (45nm)
8C 12MB
shared
(45nm)
Mobile processors
Desktop processors
Intel’s Roadmap for Multicore
8C 12MB
shared
(45nm)
QC 8/16MB
shared
QC 4MB
DC 16MB
DC 4MB
DC 2MB
SC 512KB/
1/ 2MB
2006
2007
2008
2006
2007
• Drivers are
2008
2006
2007
2008
Source: Adapted from Tom’s Hardware
– Market segments
– More cache
– More cores
ECE 4100/6100
(8)
Distillation Into Trends
• Technology Trends
– What can we expect/project?
• Architecture Trends
– What are the feasible outcomes?
• Application Trends
– What are the driving deployment scenarios?
– Where are the volumes?
ECE 4100/6100
(9)
Technology Scaling
GATE
GATE
DRAIN
SOURCE
tox
DRAIN
SOURCE
BODY
L
• 30% scaling down in dimensions  doubles
transistor density
• Power per transistor
P  CVdd f  Vdd I st  Vdd Ileak
2
– Vdd scaling  lower power
• Transistor delay = Cgate Vdd/ISAT
– Cgate, Vdd scaling  lower delay
ECE 4100/6100
(10)
Fundamental Trends
High Volume
Manufacturing
2004
2006
2008
2010
2012
2014
2016
2018
Technology
Node (nm)
90
65
45
32
22
16
11
8
Integration
Capacity (BT)
2
4
8
16
32
64
128
256
Delay = CV/I
scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Energy/Logic Op
scaling
Bulk Planar
CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
Medium
ILD (K)
~3
<3
RC Delay
1
1
1
6-7
7-8
8-9
Metal Layers
High
Very High
Reduce slowly towards 2-2.5
1
1
1
1
1
0.5 to 1 layer per generation
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100
(11)
Moore’s Law
Source: Intel Corp.
• How do we use the increasing number of transistors?
• What are the challenges that must be addressed?
ECE 4100/6100
(12)
Impact of Moore’s Law To Date
Increase
Frequency 
Deeper Pipelines
IBM Power5
Source: IBM
Manage Power 
clock gating, activity
minimization
Source: IBM
Increase ILP 
Concurrent Threads,
Branch Prediction
and SMT
Push the
Memory Wall 
Larger caches
ECE 4100/6100
(13)
Shaping Future Multicore
Architectures
• The ILP Wall
– Limited ILP in applications
• The Frequency Wall
– Not much headroom
• The Power Wall
– Dynamic and static power dissipation
• The Memory Wall
– Gap between compute bandwidth and memory bandwidth
• Manufacturing
– Non recurring engineering costs
– Time to market
ECE 4100/6100
(14)
The Frequency Wall
• Not much headroom left in the stage to stage times
(currently 8-12 FO4 delays)
• Increasing frequency leads to the power wall
Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for
conventional microarchitectures. In ISCA 2000
ECE 4100/6100
(15)
Options
• Increase performance via parallelism
– On chip this has been largely at the instruction/data level
• The 1990’s through 2005 was the era of instruction
level parallelism
– Single instruction multiple data/Vector parallelism
• MMX, SSIMD, Vector Co-Processors
– Out Of Order (OOO) execution cores
– Explicitly Parallel Instruction Computing (EPIC)
• Have we exhausted options in a thread?
ECE 4100/6100
(16)
The ILP Wall - Past the Knee of the
Curve?
Performance
Made sense to go
Superscalar/OOO:
good ROI
Very little gain for
substantial effort
Scalar
In-Order
Moderate-Pipe
Superscalar/OOO
Very-Deep-Pipe
Aggressive
Superscalar/OOO
“Effort”
Source: G. Loh
ECE 4100/6100
(17)
The ILP Wall
• Limiting phenomena for ILP extraction:
– Clock rate: at the wall each increase in clock rate has a
corresponding CPI increase (branches, other hazards)
– Instruction fetch and decode: at the wall more instructions
cannot be fetched and decoded per clock cycle
– Cache hit rate: poor locality can limit ILP and it adversely
affects memory bandwidth
– ILP in applications: serial fraction on applications
• Reality:
– Limit studies cap IPC at 100-400 (using ideal processor)
– Current processors have IPC of only 1-2
ECE 4100/6100
(18)
The ILP Wall: Options
• Increase granularity of parallelism
– Simultaneous Multi-threading to exploit TLP
• TLP has to exist  otherwise poor utilization results
– Coarse grain multithreading
– Throughput computing
• New languages/applications
– Data intensive computing in the enterprise
– Media rich applications
ECE 4100/6100
(19)
The Memory Wall
1000
CPU
µProc
60%/yr.
“Moore’s Law”
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
7%/yr.
1
Time
ECE 4100/6100
(20)
The Memory Wall
Average
access
time
Year?
• Increasing the number of cores increases the
demanded memory bandwidth
• What architectural techniques can meet this
demand?
ECE 4100/6100
(21)
The Memory Wall
CPU0
CPU1
IBM Power5
AMD Dual-Core Athlon FX
• On die caches are both area intensive and power intensive
– StrongArm dissipates more than 43% power in caches
– Caches incur huge area costs
• Larger caches never deliver the near-universal performance
boost offered by frequency ramping (Source: Intel)
ECE 4100/6100
(22)
The Power Wall
P  CVdd f  Vdd I st  Vdd Ileak
2
• Power per transistor scales with frequency
but also scales with Vdd
– Lower Vdd can be compensated for with
increased pipelining to keep throughput constant
– Power per transistor is not same as power per
area  power density is the problem!
– Multiple units can be run at lower frequencies to
keep throughput constant, while saving power
ECE 4100/6100
(23)
Leakage Power Basics
• Sub-threshold leakage I sub  K1WeVth / nkT (1  eV / kT )
– Increases with lower Vth , T, W
• Gate-oxide leakage
– Increases with lower Tox, higher W
– High K dielectrics offer a potential solution
2
I ox
 V   Tox / V
 K 2W 
 e
 Tox 
• Reverse biased pn junction leakage
– Very sensitive to T, V (in addition to diffusion area)
I pn  J leakage, p n (e
qV
kT
 1) A
ECE 4100/6100
(24)
The Current Power Trend
Sun’s
Surface
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
Nuclear
Reactor
100
8086
Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
1980
P6
Pentium®
486
1990
Year
2000
2010
Source: Intel Corp.
ECE 4100/6100
(25)
Improving Power/Performance
P  CVdd f  Vdd I st  Vdd Ileak
2
• Consider constant die size and decreasing
core area each generation = more cores/chip
– Effect of lowering voltage and frequency  power reduction
– Increasing cores/chip  performance increase
Better power performance!
ECE 4100/6100
(26)
Accelerators
1.E+06
TCP/IP Offload Engine
MIPS
1.E+05
1.E+04
1.E+03
GP MIPS
@75W
TOE MIPS
@~2W
1.E+02
1995 2000 2005 2010 2015
2.23 mm X 3.54 mm,
260K transistors
Opportunities: Network processing engines
MPEG Encode/Decode engines, Speech engines
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100
(27)
Low-Power Design Techniques
• Circuit and gate level methods
–
–
–
–
–
–
Voltage scaling
Transistor sizing
Glitch suppression
Pass-transistor logic
Pseudo-nMOS logic
Multi-threshold gates
Two decades worth of
research and
development!
• Functional and architectural methods
–
–
–
–
–
Clock gating
Clock frequency reduction
Supply voltage reduction
Power down/off
Algorithmic and software techniques
ECE 4100/6100
(28)
The Economics of
Manufacturing
• Where are the costs of developing the next
generation processors?
– Design Costs
– Manufacturing Costs
• What type of chip level solutions is the
economics implying?
• Assessing the implications of Moore’s Law is
an exercise in mass production
ECE 4100/6100
(29)
The Cost of An ASIC
Estimated Cost $85 M -$90 M
Example: Design with
80 M transistors in
100 nm technology
• Cost and Risk rising to unacceptable levels
• Top cost drivers
– Verification (40%)
– Architecture Design (23%)
– Embedded Software Design
• 1400 man months (SW)
• 1150 man months (HW)
– HW/SW integration
12 – 18 months
*Handel H. Jones,
“How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com
ECE 4100/6100
(30)
The Spectrum of Architectures
Customization fully
in Software
Customization fully
in Hardware
Design NRE Effort
Decreasing Customization
Hardware Development
Increasing NRE and Time to Market
Software Development
Compilation
Synthesis
Custom
ASIC
Structured
ASIC
LSI Logic
Leopard
Logic
Polymorphic Computing
Microprocessor
Fixed +
Architectures
Variable ISA
Tiled architectures
Xilinx
MONARCH
PACT,
Tensilica
Altera
SM,RAW,
PICOChip
Stretch Inc.
TRIPS
ECE 4100/6100 (31)
FPGA
Interlocking Trade-offs
bandwidth
Power
dynamic power
ILP
dynamic penalties
leakage power
Memory
Frequency
ECE 4100/6100
(32)
Multi-core Architecture Drivers
• Addressing ILP limits
– Multiple threads
– Coarse grain parallelism  raise the level of abstraction
• Addressing Frequency and Power limits
– Multiple slower cores across technology generation
– Scaling via increasing the number of cores rather than frequency
– Heterogeneous cores for improved power/performance
• Addressing memory system limits
– Deep, distributed, cache hierarchies
– OS replication  shared memory remains dominant
• Addressing manufacturing issues
– Design and verification costs
 Replication  the network becomes more important!
ECE 4100/6100
(33)
Parallelism
ECE 4100/6100
(34)
Beyond ILP
• Performance is limited by the serial fraction
parallelizable
1CPU
2CPUs
3CPUs
4CPUs
• Coarse grain parallelism in the post ILP era
– Thread, process and data parallelism
• Learn from the lessons of the parallel processing
community
– Revisit the classifications and architectural
techniques
ECE 4100/6100
(35)
Flynn’s Model
• Flynn’s Classification
– Single instruction stream, single data stream (SISD)
• The conventional, word-sequential architecture including
pipelined computers
– Single instruction stream, multiple data stream (SIMD)
• The multiple ALU-type architectures (e.g., array processor)
– Multiple instruction stream, single data stream (MISD)
• Not very common
– Multiple instruction stream, multiple data stream (MIMD)
• The traditional multiprocessor system
M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.
ECE 4100/6100
(36)
SIMD/Vector Computation
IBM Cell SPE Organization
IBM Cell SPE pipeline diagram
Source: IBM
•
•
Source: Cray
SIMD and Vector models are spatial and temporal analogs of each other
A rich architectural history dating back to 1953!
ECE 4100/6100
(37)
SIMD/Vector Architectures
• VIRAM - Vector IRAM
– Logic is slow in DRAM process
– put a vector unit in a DRAM and provide a port between a traditional
processor and the vector IRAM instead of a whole processor in DRAM
Source: Berkeley Vector IRAM
ECE 4100/6100
(38)
MIMD Machines
P+C
P+C
P+C
P+C
Dir
Dir
Dir
Dir
Memory
Memory
Memory
Memory
Interconnection Network
• Parallel processing has catalyzed the development of
a several generations of parallel processing machines
• Unique features include the interconnection network,
support for system wide synchronization, and
programming languages/compilers
ECE 4100/6100
(39)
Basic Models for Parallel Programs
• Shared Memory
– Coherency/consistency are driving concerns
– Programming model is simplified at the expense
of system complexity
• Message Passing
– Typically implemented on distributed memory
machines
– System complexity is simplified at the expense of
increased effort by the programmer
ECE 4100/6100
(40)
Shared Memory Model
Main Memory
Write X
CPU0
Read X
CPU1
• That’s basically it…
– need to fork/join threads, synchronize (typically locks)
ECE 4100/6100
(41)
Message Passing Protocols
Send
CPU0
Recv
CPU1
• Explicitly send data from one thread to another
– need to track ID’s of other CPUs
– broadcast may need multiple send’s
– each CPU has own memory space
• Hardware: send/recv queues between CPUs
ECE 4100/6100
(42)
Shared Memory Vs. Message Passing
• Shared memory doesn’t scale as well to larger
number of nodes
• communications are broadcast based
• bus becomes a severe bottleneck
• Message passing doesn’t need centralized bus
• can arrange multi-processor like a graph
– nodes = CPUs, edges = independent links/routes
• can have multiple communications/messages in transit
at the same time
ECE 4100/6100
(43)
Two Emerging Challenges
Programming Models and
Compilers?
Source: Intel Corp.
Source: IBM
Interconnection Networks
ECE 4100/6100
(44)
Descargar

CREST Overview - Georgia Institute of Technology