Lecture on
High Performance Processor Architecture
(CS05162)
Technology, Applications, Frontier
Problems and Research Directions
An Hong
[email protected]
Fall 2009
School of Computer Science and Technology
University of Science and Technology of China
Outline
 The Challenges and Opportunities of the MultiCore/Many-Core System
− The state-of-the-art of multi-core and many-core
− What are multi-core and many-core?
− Why multi-core and many-core?
− Problem with Performance,Technology,Power, Frequency and
Programming
− What is the fundamental issue that multiple cores faced?
2015/10/3
USTC CS AN Hong
2
CPU芯片结构的大趋势:单核->多核(Multicore)&众
核(Manycore)
2015/10/3
USTC CS AN Hong
3
Multicore Products Nowadays
 Lots of dual-core products now:
− Intel:Pentium D and Pentium Extreme Edition, Core Duo,
Woodcrest, Montecito
− IBM Power5,6,7
− AMD Opteron / Athlon 64
− Sun UltraSPARC IV.
 Systems with more than two cores are here with more
coming:
− IBM Cell (asymmetric).
 1 core PowerPC plus 8 “synergistic processing elements”.
− Sun Niagara
 8 cores, 4 hyper-threaded threads per core.
− nVidia
 General Purpose Computation on Graphics Processors (GPGPU)
− Intel expects to produce 16- or even 32-core chips within a
decade.
2015/10/3
USTC CS AN Hong
4
Architecture of Dual-Core Chips
 INTEL CORE DUO
− Two physical cores in a package
− Each with its own execution resources
− Each with its own L1 cache
 32K instruction and 32K data
− Both cores share the L2 cache
 2MB 8-way set associative; 64-byte line size
 10 clock cycles latency; Write Back update policy
FP Unit
FP Unit
EXE Core
EXE Core
L1 Cache
L1 Cache
L2 Cache
System Bus
(667MHz, 5333MB/s)
 AMD Opteron
− Separate 1 Mbyte L2 caches
− Improvement for Memory affinity
and Thread affinity
2015/10/3
USTC CS AN Hong
5
Intel Multi-core Plan
2015/10/3
USTC CS AN Hong
6
6
Intel Multi-core Plan
2015/10/3
USTC CS AN Hong
7
7
Cell from IBM and Sony
2015/10/3
USTC CS AN Hong
8
8
Cell from IBM and Sony
2015/10/3
USTC CS AN Hong
9
9
Niagara from SUN
2015/10/3
USTC CS AN Hong
10
10
GPU Fundamentals: The Modern Graphics Pipeline
Vertex
Processor
Geometry
Processor
Fragment
Processor
Graphics State
GPU
Shade
Final Pixels (Color, Depth)
Rasterize
Fragments (pre-pixels)
Assemble
Primitives
Screenspace triangles (2D)
Transform
Xformed, Lit Vertices (2D)
CPU
Vertices (3D)
Application
Video
Memory
(Textures)
Render-to-texture
Programmable vertex processor!
Programmable pixel processor!
2015/10/3
USTC CS AN Hong
11
11
Sea Change in Chip Design
 Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
 RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
 125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
− RISC II shrinks to ~ 0.02 mm2 at 65 nm
CMOS
− Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ?
− Proximity Communication via capacitive
coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
Processor is the new transistor?
2015/10/3
USTC CS AN Hong
12
Era of multi-core and many-core is coming
 Transistor count still
rising
 Clock speed stopped
Increasing
 Issues
− Heat and Power
− Complexity
− Hard to exploit ILP
 Intel’s multi-core and
many-core roadmap
Borkar, Dubey, Kahn, et al. “Platform 2015.” Intel
White Paper, 2005.
2015/10/3
USTC CS AN Hong
13
Multicore vs. Manycore
 Multicore: 2X / 2 yrs  ≈ 64 cores in 8 years
 Manycore: 8X to 16X multicore
1000
512
256
128
100
Automatic
Parallelization,
Thread Level
Speculation
10
64
64
32
16
8
4
2
1
1
2003
2015/10/3
2005
2007
2009
2011
USTC CS AN Hong
2013
2015
14
14
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5
years,called “Moore’s Law”
Microprocessors have become
smaller, denser, and more
powerful.
Not just processors, bandwidth,
storage, etc
2015/10/3
Dr. Gordon E. Moore co-founded
Intel in 1968.
Gordon Moore (co-founder of
Intel in 1968), His observation in
1965 that number of transistors
doubled roughly every 18 months
became known as “Moore’s Law”
USTC CS AN Hong
15
15
Moore定律是资源定律,不是性能定律
 新的Moore定律预测片上核
数每18 –24月将增加一倍
− 2007 - 8 核
 E.g. IBM Cell 9核,Sun
cell 8核
− 2009 - 16 核
− 2011 - 32核
− 2013 - 64 核
− 2015 - 128 核
− 2021 - 1k 核
 要让新Moore定律成立,必
须回答的问题是:谁需要这
么多的核? 实际可获得的芯
片性能否随核数的倍增而倍
增?? 如何才能做到???
2015/10/3
USTC CS AN Hong
16
Moore’s Law Still Holds
 No Exponential is Forever,
but perhaps we can Delay it Forever
2015/10/3
USTC CS AN Hong
17
纳米工艺下微处理芯片设计面临的机遇
 单片上数10亿个晶体管
− 单核单线程CPU=> System on a Chip,CMP, Supercomputer on a
Chip,…….
 片上大量的存储器, 逻辑和存储做在一起
− CPU与Memory分离= >PIM,IRAM
 单片上多种不同类型的晶体管(P型和N型沟道MOS晶体管,
PNP和NPN双极性晶体管),浮动栅器件,熔丝和反熔丝集成
到同一个衬底上
− ASIC实现 => FPGA实现(前后端设计分离)?
2015/10/3
USTC CS AN Hong
18
纳米工艺下微处理芯片设计面临的挑战
 功耗问题(Power Consumption)
− 深度流水线设计
− 许多耗能的结构设计
 线延迟问题(Wire delays)
− 线延迟超过门延迟,片上通信代价十分昂贵(有估计说,在35纳米工
艺下,一个时钟周期内信号只能穿过芯片上1%的区域)。
− 线延迟决定最大有效芯片面积,需要通过增加面积来提高性能的处理
器单元设计方案没有前途。
− 指令集必需支持直接数据通信,核的执行驱动也应由控制流驱动变为
数据流驱动,从而使关键路径上的逻辑和线延迟达到最小。
− 单处理器面积不能太大,只能跑细粒度线程
2015/10/3
USTC CS AN Hong
19
纳米工艺下微处理芯片设计面临的挑战
 I/O带宽问题(Pin)
− 单片上的I/O引脚数目不能同晶体管数目同比例增长,使得片上计算能
力的增长与片外带宽的增长不匹配
 从180纳米到35纳米,晶体管数与信号引脚数的比率增加了45倍。
 晶体管数目的增加=>单片上处理器核数目增加
 I/O引脚数不能同比例增加=>芯片与系统中其它部分的通信带宽不
能同比例增加=>多个核不得不共享同一条访存或通信通路。
 有限的片外带宽将实际上限制单片上处理器数目的有效增加(尽
管有更多的晶体管资源可用)。
− 寻找在固定带宽上可容纳的处理器核数、处理器核的复杂度、不同应
用的存储模式、以及片上存储容量之间的平衡关系
2015/10/3
USTC CS AN Hong
20
纳米工艺下微处理芯片设计面临的挑战
 “存储墙”问题(Processor-to-memory performance gap
)
− Microprocessor performance:
60% each year
− Dynamic RAM:
7% each year
− 根据目前片上并行处理技术的发展趋势,不久就会出现当处理器发射
数百甚至上千条指令时只能取一个值到片上存储器的情况。因此,要
求将处理器和存储器能做在同一块芯片上,把应用执行期间的工作集
尽可能放到片上。
 串行程序的并行化问题(Parallelism inside a program)
− 需要将更多相互独立的工作检测出来放到处理器上并行执行
− 设计和验证高昂的花费使得将来的多种应用功能要在一个设计中完成
− 未来的指令集必须是多态的,其能够利用执行和存储单元在不同的模
式下运行不同的应用。
2015/10/3
USTC CS AN Hong
21
技术发展趋势总结
 软件现状:软件开发者即将面临的系统环境
− 片上计算能力:> 1 TFLOPS
− 异构芯片 :CPUs+GPUs,CPUs+NPUs,CPUs+DSPs, CPUs+应
用专用的加速器
− 16+ 核, 3层+ 存储层次,100+ 硬件线程
 硬件现状:通用多核->高性能,可扩展性;专用多核->通用性,可
编程性
− 通用多核(同构)
 没有面向应用的高性能核,核的设计可重用,易于负载平衡,资源
利用率高,可编程性好;
 可扩展性差
− 专用多核(异构)
 有面向应用的高性能核,可扩展性好(相对);
 核的设计不重用,难以负载平衡,资源利用率低,可编程性差
2015/10/3
USTC CS AN Hong
22
技术发展趋势总结
 硬件的可扩展性问题:如何获得计算资源,存储层次,通信带
宽,性能/功耗的平衡设计,使得核数的持续倍增对提升应用的
性能有实际意义?
− 计算资源:能否面向不同的应用可配置?
− 存储层次:能否实现软件管理的数据分布和移动?
− 片上通信带宽和封装带宽:能否实现软件管理的通信带宽分配,保障各
个级别的计算资源的供数带宽?
− 性能/功耗+性能/面积:在45nm-22nm工艺下,满足1GHz, 100瓦,
200mm2 的物理设计约束,达到应用可用的1TFLOPS性能
 软件的可编程性问题:如何为超大规模的片上并行计算系统写
软件,提供应用软件开发的生产率?
− 可移植性(portability): 在每一代核数倍增的众核系统上保持并行编程
模型不变。即,在保持软件的可移植的同时,获得随核数增加的性能。
− 适应性 (adaptability):提供对广泛应用的并行编程支持。
− 可用性(usability):提供面向非计算机科学家的可用性。
− 软构件的可重用性(reusability):支持并行系统和应用软件的生长期。
2015/10/3
USTC CS AN Hong
23
技术发展趋势总结
结构效率(性能/功耗,性能/面积)+ 编程效率
 最核心的问题:编程效率(Productivity)
− 众核体系结构及其操作系统如何对并行软件的开发提供支持?
− 能否实现规则与非规则计算性能的可扩展性,提供普适并行计算能力?
− 如何改变先有硬件系统环境再来修补并行软件生存环境的被动局面,使
得在众核平台上编写高效正确的并行程序变得简单容易?
− 如何使并行程序的实际性能随核数的倍增而扩放?
2015/10/3
USTC CS AN Hong
24
What is the fundamental issue that multiple cores
faced ?
 Writing correct and efficient parallel programs is very
difficult and time consuming
Parallel programming
= Decomposition of computation in tasks +
Assignment of tasks to threads +
Orchestration of data access, comm, synch. +
Mapping threads to cores
 Designing easy-to-implement and scalable and adaptive
microarchitecture is very challenging
Multiple Cores Arch.
= Computing Arch. + Communication Arch.
= Cores/Processors Arch. +
On-chip Memory Arch. +
On-chip Interconnection and I/O Arch.
2015/10/3
USTC CS AN Hong
25
4 Steps in creating and executing a parallel
program on multiprocessor system architecture
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
O
r
c
h
e
s
t
r
a
t
i
o
n
Pr ocesses
p0
p1
p2
p3
M
a
p
p
i
n
g
Parallel
pr ogram
P0
P1
P2
P3
Pr ocessors
 Decomposition of computation in tasks
 Assignment of tasks to processes
 Orchestration of data access, comm, synch.
 Mapping processes to processors
2015/10/3
USTC CS AN Hong
26
UC Bekeley的观点:七个关键问题
2015/10/3
USTC CS AN Hong
27
A view from UC Berkeley: seven critical questions
for 21st century parallel computing
2008-8-24
2015/10/3
USTC CS AN Hong
28
7 Questions for Parallelism
 Applications:
− 1. What are the apps?
− 2. What are kernels of apps?
 Hardware:
− 3. What are the HW building blocks?
− 4. How to connect them?
 Programming Model & Systems Software:
− 5. How to describe apps and kernels?
− 6. How to program the HW?
 Evaluation:
− 7. How to measure success?
2015/10/3
USTC CS AN Hong
29
Par Lab Research Overview
Easy to write correct programs that run efficiently on manycore
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Dwarfs
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Intel Multicore/GPGPU
2015/10/3
Correctness
Composition & Coordination Language (C&CL)
RAMP Manycore
USTC CS AN Hong
30
7 Questions for Parallelism
 Applications:
− 1. What are the apps?
− 2. What are kernels of apps?
 Hardware:
− 3. What are the HW building blocks?
− 4. How to connect them?
 Programming Model & Systems Software:
− 5. How to describe apps and kernels?
− 6. How to program the HW?
 Evaluation:
− 7. How to measure success?
2015/10/3
USTC CS AN Hong
31
Apps and Kernels Tower: What are the problems?
 Who needs 100s of cores?
− Failure of imagination? (CS education?)
− Need compelling apps that use 100s of cores
 What about parallel benchmarks?
− Few examples (e.g., SPLASH, NAS)
 Optimized to old models, languages, architectures…
 How invent parallel systems of future when tied to old
code, programming models of the past?
2015/10/3
USTC CS AN Hong
32
Can find patterns widely used?
 Look for common patterns of communication and
computation
− 1.Embedded Computing (EEMBC benchmark)
− 2.Desktop/Server Computing (SPEC2006)
− 3.Data Base / Text Mining Software
 Advice from Jim Gray of Microsoft and Joe Hellerstein of
UC
− 4.Games/Graphics/Vision
− 5.Machine Learning
 Advice from Mike Jordan and Dan Klein of UC Berkeley
− 6.High Performance Computing (Original “7 Dwarfs”)
 Result: 13 “Dwarfs”
2015/10/3
USTC CS AN Hong
33
Dwarf Popularity (Red Hot,Blue Cool)
2015/10/3
USTC CS AN Hong
34
13 Dwarfs (so far)
 1. Finite State Machine
 Claim: parallel arch., lang.,
compiler … must do at least
these well to do future
parallel apps well
 2. Combinational Logic
 3. Graph Traversal
 4. Structured Grids
 5. Dense Linear Algebra
 Note: MapReduce is
embarrassingly
parallel;perhaps FSM is
embarrassingly sequential?
 6. Sparse Linear Algebra
 7. Spectral Methods (FFT)
 8. Dynamic Programming
 9. N-Body Methods
 10. MapReduce
 11. Back-track/Branch & Bound
 12. Graphical Model Inference
 13. Unstructured Grids
2015/10/3
USTC CS AN Hong
35
Application-Driven Research vs. CS SolutionDriven Research
 Drill down on 4 app areas to guide research agenda
 Dwarfs to represent broader set of apps to guide
research agenda
 Dwarfs help break through traditional interfaces
− Benchmarking, multidisciplinary conversations, target for
libraries, and parallelizing parallel research
2015/10/3
USTC CS AN Hong
36
7 Questions for Parallelism
 Applications:
− 1. What are the apps?
− 2. What are kernels of apps?
 Hardware:
− 3. What are the HW building blocks?
− 4. How to connect them?
 Programming Model & Systems Software:
− 5. How to describe apps and kernels?
− 6. How to program the HW?
 Evaluation:
− 7. How to measure success?
2015/10/3
USTC CS AN Hong
37
How do we program the HW?
What are the problems?
 For parallelism to succeed, must provide productivity,
efficiency, and correctness simultaneously for scalable
hardware
− Can‟t make SW productivity even worse!
− Why do in parallel if efficiency doesn‟t matter?
− Correctness usually considered orthogonal problem
− Productivity slows if code incorrect or inefficient
 Most programmers not ready to produce correct parallel
programs
− IBM SP customer escalations: concurrency bugs worst, can take
months to fix
2015/10/3
USTC CS AN Hong
38
处理应用的多样性:
层次分解和利用应用中固有的并行性
 大多数应用中固有的并行性是错纵复杂的,既包含易于向量/
流化,线程化执行的部分,也包括只能串行执行的部分。
 携带任意复杂并行性的应用都可能被层次地分解成易于并行
执行的部分
和只能串行执行的部分
,通过适当地映射
到不同的结构上分别加以处理。
串行代码调用并行组件的情况:
例如:FFT,矩阵运算并行库
并行代码调用串行组件:
例如:MapReduce
串行组件
2015/10/3
并行层次复合的情况:
例如:并行代码调用串行组件,
串行代码调用并行组件
并行组件
USTC CS AN Hong
39
How do we describe apps and kernels?
 Observation: Use Dwarfs. Dwarfs are of 2 types
 Algorithms in the dwarfs can either be implemented as:
− Compact parallel computations within a traditional library
 Dense matrices,Sparse matrices,Spectral,Combinational,Finite state
machines
− Compute/communicate pattern implemented as
pattern/framework
 MapReduce,Graph traversal, graphical models,Dynamic
programming,Backtracking/B&B,N-Body,(Un) Structured Grid
 Computations may be viewed a multiple levels: e.g., an
FFT library may be built by instantiating a Map-Reduce
framework, mapping 1D FFTs and then transposing
(generalize reduce)
2015/10/3
USTC CS AN Hong
40
处理编程模型和语言的多样性:
分开实现程序性能和编程效率
 程序性能/效率(Performance/ Efficiency ):由20%会并
行的“新”程序员来保证
− 会用多种并行编程模型和语言(现有的和新开发的)
− 精通面向多核硬件结构写并行程序
− 精通采用适当的并行化方法(算法,结构,编程),找到应用的性能瓶颈,
发掘应用中固有的并行性,有效地映射到执行硬件上。
− 为优化常用的核心计算设计并行框架和库
 模式/框架(Patterns/Frameworks): 常用核心计算的并行软件结构
 库(Libraries): 常用核心计算的优化并行软件代码
 编程效率(Productivity):允许80%不会并行的“老”程序
员容易做到
−
−
−
−
允许仍然沿用传统的编程模型和语言写并行程序
允许不精通面向多核硬件结构写并行程序
但要学会调用并行组件(框架和库)来构造应用程序
会用辅助并行程序设计的工具
2015/10/3
USTC CS AN Hong
41
Ensuring Correctness
 Productivity Layer:
− Enforce independence of tasks using decomposition
(partitioning) and copying operators
− Goal: Remove concurrency errors (nondeterminism from
execution order, not just low level data races)
 E.g., the race-free program “atomic delete” + “atomic insert” does
not compose to an “atomic replace”; need higher level properties,
not solved with transactions (or locks)
 Efficiency Layer: Check for subtle concurrency bugs
(races, deadlocks, and so on)
− Mixture of verification and automated directed testing
− Error detection on framework and libraries; some techniques
applicable to third-party software
2015/10/3
USTC CS AN Hong
42
7 Questions for Parallelism
 Applications:
− 1. What are the apps?
− 2. What are kernels of apps?
 Hardware:
− 3. What are the HW building blocks?
− 4. How to connect them?
 Programming Model & Systems Software:
− 5. How to describe apps and kernels?
− 6. How to program the HW?
 Evaluation:
− 7. How to measure success?
2015/10/3
USTC CS AN Hong
43
Hardware Tower:
What are the problems?
 Multicore (2, 4…) vs. Manycore (64, 128…)?
 How can novel architectural support improve
productivity, efficiency, and correctness for scalable
hardware?
− Efficiency instead of performance to capture energy as well as
performance
 Also, power, design and verification costs, low yield,
higher error rates
2015/10/3
USTC CS AN Hong
44
HW Solution: Small is Beautiful
 Expect modestly pipelined (5-to 9-stage) CPUs, FPUs,
vector, SIMD PEs
− Small cores not much slower than large cores
 Parallel is energy efficient path to performance:CV2F
− Lower threshold and supply voltages lowers energy per op
 Redundant processors can improve chip yield
− Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs
 Small, regular processing elements easier to verify
 One size fits all?
− Amdahl‟s Law Heterogeneous processors
− Special function units to accelerate popular functions
2015/10/3
USTC CS AN Hong
45
HW features supporting Parallel SW
 Want Composable Primitives, Not Packaged Solutions
− Transactional Memory is usually a Packaged Solution
 Partitions
 Fast Barrier Synchronization & Atomic Fetch-and-Op
 Active messages plus user-level event handling
− Used by parallel language runtimes to provide fast communication,
synchronization, thread scheduling
 Configurable Memory Hierarchy (Cell v. Clovertown)
− Can configure on-chip memory as cache or local store
− Programmable DMA to move data without occupying CPU
− Cache coherence: Mostly HW but SW handlers for complex cases
− Hardware logging of memory writes to allow rollback
2015/10/3
USTC CS AN Hong
46
Partitions
 Partition: hardware-isolated
group
− Chip divided into hardwareisolated partition, under control
of supervisor software
− User-level software has almost
complete control of hardware
inside partition
− Power-of-2 size, naturally aligned
InfiniCore chip with 16x16 tile array
2015/10/3
USTC CS AN Hong
47
7 Questions for Parallelism
 Applications:
− 1. What are the apps?
− 2. What are kernels of apps?
 Hardware:
− 3. What are the HW building blocks?
− 4. How to connect them?
 Programming Model & Systems Software:
− 5. How to describe apps and kernels?
− 6. How to program the HW?
 Evaluation:
− 7. How to measure success?
2015/10/3
USTC CS AN Hong
48
Measuring Success:
What are the problems?
 1.Only companies can build HW, and it takes years
 2.Software people don’t start working hard until
hardware arrives
− 3 months after HW arrives, SW people list everything that must be
fixed, then we all wait 4 years for next iteration of HW/SW
 3.How to quickly get 100-1000 CPU systems in hands of
researchers to let them innovate in algorithms,
compilers, languages, OS, architectures, … ASAP?
 4.Can avoid waiting 4 years + 3 months between HW/SW
iterations?
2015/10/3
USTC CS AN Hong
49
Build Academic Manycore from FPGAs
 As ~10 CPUs will fit in Field Programmable Gate Array
(FPGA), 1000-CPU system from 100 FPGAs?
− 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II)
− FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate
 HW research community does logic design (“gate
shareware”) to create out-of-the-box, Manycore
− E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cachecoherent supercomputer @ 150 MHz/CPU in 2007
− Ideal for heterogeneous chip architectures
− RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and
Washington
 “Research Accelerator for Multiple Processors” as a
vehicle to lure more researchers to parallel challenge
and decrease time to parallel salvation
2015/10/3
USTC CS AN Hong
50
768 CPU “RAMP Blue” (Wawrzynek, Krasnov,… at
Berkeley)
 768 = 12 32-bit RISC cores / FPGA, 4
FGPAs/board, 16 boards, $10k/bd
− Simple MicroBlaze soft cores @ 90 MHz
 Full star-connection between modules
− 1008 node RAMP Blue soon (21 boards)
 NASA Advanced Supercomputing (NAS)
Parallel Benchmarks (all class S)
− UPC versions (C plus shared-memory abstraction) CG, EP,
IS, MG; DEMO?
 RAMPants creating HW & SW for many-core
community using next gen FPGAs
− Chuck Thacker & Microsoft designing next boards
− 3rd party to manufacture and sell boards: 1H08
2015/10/3
USTC CS AN Hong
51
Why do we care RAMP?
 Traditional simulators proved to be inefficient
− Four years and many millions of dollars to prototype a new
architecture in hardware
− Software engineers are ineffective with simulators until the new
hardware actually shows up
− Feedback from software engineers can not impact the immediate
next generation
 RAMP use Field-Programmable Gate Arrays (FPGAs) to
emulate highly parallel architectures at hardware
speeds. It is a practical approach to modularizing the
model, separation of the functional and timing aspects
of the simulation
− Fast enough
− Easy to “tape out” a design everyday
− Can include research features that would be impractical or
impossible to include in real hardware systems
2015/10/3
USTC CS AN Hong
52
Why do we care RAMP?
2008-8-20
2015/10/3
USTC CS AN Hong
53
The Stanford Pervasive Parallelism Laboratory
Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
SIMD Cores
OOO Cores
Scalable
Interconnects
2015/10/3
Partitionable
Hierarchies
Scalable
Coherence
USTC CS AN Hong
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
54
Summary: A Berkeley View 2.0
 Our industry has bet its future on parallelism (!)
 Goal: Productive, Efficient, Correct Parallel Programs while
doubling number of cores every 2 years (!)
 Try Apps-Driven vs. CS Solution-Driven Research
− Laptop-Handhelds/Datacenters as modern Client/Server
 13 Dwarfs as lingua franca , anti-benchmarks
 Composition is critical to parallel computing success
− Composition is open problem for Transactional Memory
 Productivity layer for ≈90% today’s programmers
− Use C&C Lang to reuse expert‟s code
 Efficiency layer for ≈10% today’s programmers
− Create libraries, frameworks, … for use in productivity layer
 Autotuners over Parallelizing Compilers
 OS & HW: Composable Primitives over CS Solutions
2015/10/3
USTC CS AN Hong
55
重新思考传统体系结构的影响因素,
做全系统的创新?
2015/10/3
USTC CS AN Hong
56
The Computing Problem
 Aspect 1: computation representation
− How to create a static representation for the desired computation?
 HLL, compilers, instruction set, etc.
 Aspect 2: program execution ! ! !
− How is the dynamic computation recreated and performed?
 Program Execution model and microarchitecture
 Aspect 3: Interface
− How are “program external”interactions performed?
 OS and run time environment
10s billion transistors on a chip can (fundamentally)
change Aspect 2? How?
How might change in Aspect 2 affect Aspects 1 and 3?
2015/10/3
USTC CS AN Hong
57
Program Execution (Aspect 2)
 Program Sequencing
− Sequence through static representation of program to create
dynamic stream of operations
− Q: How to create the dynamic sequence of operations from the
static representation?
 Operation Execution
− Execute operations in the dynamic stream
 Determine dependence relationships
 Schedule operations for execution
 Execute operations
 Communicate values
− Q: How to perform the effects of the operation?
Higher performance means speeding up the above
2015/10/3
USTC CS AN Hong
58
Impact on Aspect 1
 What does the static program representation look like?
 How does static representation impact software used to
create it ?
− e. g., compilers,HLL,ISA
2015/10/3
USTC CS AN Hong
59
Impact on Aspect 3
 Operating systems allow a program to interact with
external entities
 Existing Operating system mindset is sequential
execution, i. e., single sequencer, with little or no
speculation
 Operating system mindset will have to change to adapt
to the multiple sequencer, heavily speculative hardware
− what does it mean to handle multiple exceptions
simultaneously?
− what does it mean to handle exceptions speculatively?
2015/10/3
USTC CS AN Hong
60
对芯片结构发展现状和趋势的认识和思考
 新的应用特征:不规则
− 控制流难以或不可预测
− 数据访问模式难以或不可预测
 程序执行模型:
− 支持开发多层次多种类型的并行性( ILP, TLP,DLP)
 微体系结构模型:单芯片多核CMP(同构或异构)
 设计实现:计算/存储/互连可重构
 程序设计语言: 易于编写程序和编译
− 保持串行语义很重要!
 编译器: 能识别出应用中多种类型的并行性,实现自动并行化
 操作系统:运行时支持环境和库
哪些要动?哪些动起来难度较大?
2015/10/3
USTC CS AN Hong
61
对芯片结构发展现状的认识
 应用的复杂性=>结构多样化,NRE费用增大,系统复杂性增大
− 专用芯片,专向性能,设计重用性小
− 单片异构,面积增大,资源利用率差
 结构的复杂性=>体积,功耗,成本(设计,验证,制造)增大
− ILP墙, 存储墙,延迟墙,功耗墙:单处理器性能每年增长20%
− 体系结构失衡:计算,存储,通信,I/O速度越来越不匹配
 编程的复杂性=>并行程序开发难,软件效率,正确性难以保证
− 串行ISA:对并行计算没有支持
− 编程模型单一:缺乏表达复杂并行计算的能力,难于支持异构,数万/数
十万处理器规模
 系统使用和管理的复杂性=>系统规模增大,可靠性下降
− P级系统处理器规模达可能达1万以上,10P系统规模10万以上
− 全系统的平均故障间隔时间(MTBF)明显下降
2015/10/3
USTC CS AN Hong
62
对芯片结构发展趋势的判断
 现在是集中控制的结构,未来是分布控制的结构
− 集中控制的结构:超标量,VLIW,DSP,……
− 分布控制的结构:RAW,SmartMemory, TRIPS,……
 现在只支持单一编程模型,未来要支持混合编程模型
− 串行编程模型:仅只支持ILP开发和利用
− 并行编程模型:支持ILP,DLP,TLP的开发和利用, 更多的相互独立的工作被检
测出来放到多核上并行执行
 现在使用实体机,未来使用虚拟机
− 实体机:特定的体系结构绑定特定的操作系统和软件开发环境
− 虚拟机:利用系统虚拟机和进程虚拟机技术解决系统难使用问题
 现在是多核(Multicore),未来是众核(Manycore)
− 多核(2007 - 8 核 ,2009 - 16 核, 2011- 32 核)vs. 众核(2013 - 64 核,2015 128 核,2021 - 1k 核)
− 新的Moore定律预测片上核数每18 –24月将增加一倍, 但实际的性能是否随核数
的倍增而倍增? 如何才能做到??
 现在是同构和异构多核,未来是可重构多核
− 设计和验证高昂的花费使得将来的多种应用功能要在一个设计中完成
− 多核通用芯片:同构 vs. 异构
vs.可重构
2015/10/3
USTC CS AN Hong
63
通用芯片:同构 vs. 异构 vs.可重构
 同构多核:由相同类型的处理器核组成
− 同构:超标量多核,VLIW多核, ……
− 没有面向应用的高性能核:不好
− 易于负载平衡,通用性好:好
− 单核设计重用,硬件简单:好
 异构多核:由几种不同类型的处理器核组成
− 异构多核:通用多核+专用多核(向量核,流核,DSP核
,网络处理核)
− 有面向应用的高性能核:好
− 难以负载平衡,通用性差:不好
− 单核设计不重用,硬件复杂, 资源利用率差:不好
 可重构多核:由可配置成异构或同构的多型处理
器核构成
− 定制前:同构的无模式核(粗粒度的FPGA),芯片功能
不确定
− 定制后:根据应用场合的需要定制出同构或异构的有模
2015/10/3式核
USTC CS AN Hong
SS
SS
SS
SS
SS
NP
DSP
SS
P
P
P
P
PI
PT
PD
PI
64
研究问题:能否在单片上实现支持广泛应用的高效能
通用微处理芯片?
 应用的多样性:桌面,科学和工程计算,多媒体,网络,移
动计算,事务处理,嵌入式, ……
 结构的多样性:超标量/超长指令字,向量/流,多线程,多核
/众核,……
 编程模型和语言的多样性:OpenMP, MPI, HPF, UPC,……
 操作系统的多样性:Linux, Solaris, Unix, Mixed, BSD
Based, Mac OS,Windows,……
 功能的多样性: 桌面,服务器,移动,嵌入式,……
 高效能:性能,可编程性,可移植性,可靠性
在支持多样性的同时实现高效能是主要的挑战!
现在
将来
多样性(复杂,低效)=>多样性(简单,高效)
2015/10/3
USTC CS AN Hong
65
Key architectural question
Can single-Chip Multiprocessors(CMP)
be designed to be easier to use
efficienctly than today’s MPs?
2015/10/3
USTC CS AN Hong
66
Descargar

Lecture on High Performance Processor Architecture …