CS267/E233
Applications of Parallel
Computers
Lecture 1: Introduction
James Demmel
[email protected]
www.cs.berkeley.edu/~demmel/cs267_Spr06
01/16/2006
CS267-Lecture 1
1
Outline
• Introduction
• Large important problems require powerful computers
Even computer games
• Why powerful computers must be parallel processors
Including your laptop
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
01/16/2006
CS267-Lecture 1
2
Why we need
powerful computers
01/16/2006
CS267-Lecture 1
3
Units of Measure in HPC
• High Performance Computing (HPC) units are:
- Flop: floating point operation
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…
Mega
Mflop/s = 106 flop/sec
Mbyte = 220 = 1048576 ~ 106 bytes
Giga
Tera
Peta
Exa
Zetta
Gflop/s = 109 flop/sec
Tflop/s = 1012 flop/sec
Pflop/s = 1015 flop/sec
Eflop/s = 1018 flop/sec
Zflop/s = 1021 flop/sec
Gbyte = 230 ~ 109 bytes
Tbyte = 240 ~ 1012 bytes
Pbyte = 250 ~ 1015 bytes
Ebyte = 260 ~ 1018 bytes
Zbyte = 270 ~ 1021 bytes
Yotta
Yflop/s = 1024 flop/sec
Ybyte = 280 ~ 1024 bytes
• See www.top500.org for current list of fastest machines
01/16/2006
CS267-Lecture 1
4
Simulation: The Third Pillar of Science
•
Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
•
Limitations:
-
•
Too difficult -- build large wind tunnels.
Too expensive -- build a throw-away passenger jet.
Too slow -- wait for climate or galactic evolution.
Too dangerous -- weapons, drug design, climate
experimentation.
Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
- Base on known physical laws and efficient numerical
methods.
01/16/2006
CS267-Lecture 1
5
Some Particularly Challenging Computations
• Science
-
Global climate modeling
Biology: genomics; protein folding; drug design
Astrophysical modeling
Computational Chemistry
Computational Material Sciences and Nanosciences
• Engineering
-
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Crash simulation
• Business
- Financial and economic modeling
- Transaction processing, web services and search engines
• Defense
- Nuclear weapons -- test by simulations
- Cryptography
01/16/2006
CS267-Lecture 1
6
Economic Impact of HPC
• Airlines:
- System-wide logistics optimization systems on parallel systems.
- Savings: approx. $100 million per airline per year.
• Automotive design:
- Major automotive companies use large systems (500+ CPUs) for:
- CAD-CAM, crash testing, structural integrity and
aerodynamics.
- One company has 500+ CPU parallel system.
- Savings: approx. $1 billion per company per year.
• Semiconductor industry:
- Semiconductor firms use large systems (500+ CPUs) for
- device electronics simulation and logic validation
- Savings: approx. $1 billion per company per year.
• Securities industry:
- Savings: approx. $15 billion per year for U.S. home mortgages.
01/16/2006
CS267-Lecture 1
7
$5B World Market in Technical Computing
1998 1999 2000 2001 2002 2003
100%
Other
Technical Management and
Support
Simulation
90%
Scientific Research and R&D
80%
Mechanical
Design/Engineering Analysis
70%
Mechanical Design and
Drafting
60%
Imaging
50%
Geoscience and Geoengineering
40%
Electrical Design/Engineering
Analysis
Economics/Financial
30%
Digital Content Creation and
Distribution
20%
Classified Defense
10%
Chemical Engineering
Biosciences
0%
Source: IDC 2004, from NRC Future of Supercomputing Report
01/16/2006
CS267-Lecture 1
8
Global Climate Modeling
• Problem
Problem is to compute:
f(latitude, longitude, elevation, time) 
temperature, pressure, humidity, wind velocity
• Approach:
- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+dt given t
• Uses:
- Predict major events,
e.g., El Nino
- Use in setting air
emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
01/16/2006
CS267-Lecture 1
9
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere
- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:
- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours)  56 Gflop/s
- Climate prediction (50 years in 30 days)  4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours)  288 Tflop/s
• To double the grid resolution, computation is 8x to 16x
• State of the art models require integration of
atmosphere, ocean, sea-ice, land models, plus possibly
carbon cycle, geochemistry and more
• Current models are coarser than this
01/16/2006
CS267-Lecture 1
10
High Resolution
Climate Modeling on
NERSC-3 – P. Duffy,
et al., LLNL
A 1000 Year Climate Simulation
• Demonstration of the
Community Climate Model
(CCSM2)
• A 1000-year simulation
shows long-term, stable
representation of the
earth’s climate.
• 760,000 processor hours
used
• Temperature change
shown
• Warren Washington and Jerry Meehl, National Center for
Atmospheric Research; Bert Semtner, Naval Postgraduate
School; John Weatherly, U.S. Army Cold Regions Research and
Engineering Lab Laboratory et al.
• http://www.nersc.gov/news/science/bigsplash2002.pdf
01/16/2006
CS267-Lecture 1
12
Climate Modeling on the Earth Simulator System
 Development of ES started in 1997 in order to make a
comprehensive understanding of global environmental
changes such as global warming.
 Its construction was completed at the end of February,
2002 and the practical operation started from March 1,
2002
 35.86Tflops (87.5% of the peak performance) is achieved in the
Linpack benchmark (world’s fastest machine from 2002-2004).
 26.58Tflops was obtained by a global atmospheric circulation
code.
01/16/2006
CS267-Lecture 1
13
Astrophysics: Binary Black Hole Dynamics
• Massive supernova cores collapse to black holes.
• At black hole center spacetime breaks down.
• Critical test of theories of gravity –
General Relativity to Quantum Gravity.
• Indirect observation – most galaxies
have a black hole at their center.
• Gravity waves show black hole directly
including detailed parameters.
• Binary black holes most powerful
sources of gravity waves.
• Simulation extraordinarily complex –
evolution disrupts the spacetime !
01/16/2006
CS267-Lecture 1
14
01/16/2006
CS267-Lecture 1
15
Heart
Simulation
• Problem is to compute blood flow in the heart
• Approach:
- Modeled as an elastic structure in an incompressible fluid.
- The “immersed boundary method” due to Peskin and McQueen.
- 20 years of development in model
- Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others
- Use a regularly spaced mesh (set of points) for evaluating the fluid
• Uses
- Current model can be used to design artificial heart valves
- Can help in understand effects of disease (leaky valves)
- Related projects look at the behavior of the heart during a heart
attack
- Ultimately: real-time clinical work
01/16/2006
CS267-Lecture 1
16
Heart Simulation Calculation
The involves solving Navier-Stokes equations
- 64^3 was possible on Cray YMP, but 128^3 required for accurate
model (would have taken 3 years).
- Done on a Cray C90 -- 100x faster and 100x more memory
- Until recently, limited to vector machines
- Needs more features:
- Electrical model of the
heart, and details of
muscles, E.g.,
- Chris Johnson
- Andrew McCulloch
- Lungs, circulatory
systems
01/16/2006
CS267-Lecture 1
17
Heart Simulation
Animation of lower portion of the heart
01/16/2006
CS267-Lecture 1
Source: www.psc.org
18
Parallel Computing in Data Analysis
• Finding information amidst large quantities of data
• General themes of sifting through large, unstructured data
sets:
- Has there been an outbreak of some medical condition in a
community?
- Which doctors are most likely involved in fraudulent
charging to medicare?
- When should white socks go on sale?
- What advertisements should be sent to you?
• Data collected and stored at enormous speeds (Gbyte/hour)
- remote sensor on a satellite
- telescope scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of data
- NSA analysis of telecommunications
01/16/2006
CS267-Lecture 1
19
Why powerful
computers are
parallel
01/16/2006
CS267-Lecture 1
20
Tunnel Vision by Experts
• “I think there is a world market for maybe five
computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have
a computer in their home”
- Ken Olson, president and founder of Digital Equipment
Corporation, 1977.
• “640K [of memory] ought to be enough for
anybody.”
- Bill Gates, chairman of Microsoft,1981.
Slide source: Warfield et al.
01/16/2006
CS267-Lecture 1
21
Technology Trends: Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
01/16/2006
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
CS267-Lecture 1
22
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks
by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- of which x3 is devoted either to parallelism or locality
01/16/2006
CS267-Lecture 1
23
Microprocessor Transistors per Chip
• Growth in transistors per chip
• Increase in clock rate
100,000,000
1000
10,000,000
1,000,000
i80386
i80286
100,000
R3000
R2000
100
Clock Rate (MHz)
Transistors
R10000
Pentium
10
1
i8086
10,000
i8080
i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
01/16/2006
CS267-Lecture 1
0.1
1970
1980
1990
2000
Year
24
But there are limiting forces: Increased
cost and difficulty of manufacturing
•
Moore’s 2nd law
(Rock’s law)
Demo of
0.06
micron
CMOS
01/16/2006
CS267-Lecture 1
25
More Limits: How fast can a serial computer be?
1 Tflop/s, 1
Tbyte sequential
machine
r = 0.3
mm
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory
to CPU.
- To get 1 data element per cycle, this means 1012 times
per second at the speed of light, c = 3x108 m/s. Thus
r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each bit occupies about 1 square Angstrom, or the
size of a small atom.
• No choice but parallelism
01/16/2006
CS267-Lecture 1
26
Performance on Linpack Benchmark
www.top500.org
01/16/2006
CS267-Lecture 1
27
01/16/2006
CS267-Lecture 1
28
Why writing (fast) parallel
programs is hard
01/16/2006
CS267-Lecture 1
29
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming
even harder than sequential programming.
01/16/2006
CS267-Lecture 1
30
“Automatic” Parallelism in Modern Machines
• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)
- multiple instructions execute per clock cycle
• Memory system parallelism
- overlap of memory operations with computation
• OS parallelism
- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user
to identify, schedule and coordinate parallel tasks
01/16/2006
CS267-Lecture 1
31
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
- let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
- P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly
performance is limited by the sequential part
01/16/2006
CS267-Lecture 1
32
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to
getting desired speedup
• Parallelism overheads include:
- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds
(=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
01/16/2006
CS267-Lecture 1
33
Locality and Parallelism
Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache
- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
01/16/2006
CS267-Lecture 1
potential
interconnects
L3 Cache
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
01/16/2006
CS267-Lecture 1
35
Load Imbalance
• Load imbalance is the time that some processors in the
system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter
- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load
01/16/2006
CS267-Lecture 1
36
Measuring
Performance
01/16/2006
CS267-Lecture 1
37
Improving Real Performance
Peak Performance grows exponentially,
a la Moore’s Law

In 1990’s, peak performance increased 100x;
in 2000’s, it will increase 1000x
1,000
But efficiency (the performance relative to
the hardware peak) has declined

was 40-50% on the vector supercomputers
of 1990s
now as little as 5-10% on parallel
supercomputers of today
Close the gap through ...


Mathematical methods and algorithms that
achieve high performance on a single
processor and scale to thousands of
processors
More efficient programming models and tools
for massively parallel supercomputers
01/16/2006
CS267-Lecture 1
100
Teraflops

Peak Performance
Performance
Gap
10
1
Real Performance
0.1
1996
2000
2004
38
Performance Levels
• Peak advertised performance (PAP)
- You can’t possibly compute faster than this speed
• LINPACK
- The “hello world” program for parallel computing
- Solve Ax=b using Gaussian Elimination, highly tuned
• Gordon Bell Prize winning applications performance
- The right application/algorithm/platform combination plus years of work
• Average sustained applications performance
- What one reasonable can expect for standard applications
When reporting performance results, these levels are
often confused, even in reviewed publications
01/16/2006
CS267-Lecture 1
39
Performance on Linpack Benchmark
www.top500.org
01/16/2006
CS267-Lecture 1
40
Performance Levels (for example on NERSC3)
• Peak advertised performance (PAP): 5 Tflop/s
• LINPACK (TPP): 3.05 Tflop/s
• Gordon Bell Prize winning applications performance :
2.46 Tflop/s
- Material Science application at SC01
• Average sustained applications performance: ~0.4
Tflop/s
- Less than 10% peak!
01/16/2006
CS267-Lecture 1
41
Course
Organization
01/16/2006
CS267-Lecture 1
42
Who is in the class?
• This class is listed as both a CS and Engineering class
• Normally a mix of CS, EE, and other engineering and
science students
• This class seems to be about:
- 21 grads + 6 undergrads
- 40% CS
- 30% EE
- 30% Other (BioPhys, BioStat, Civil, Mechanical, Nuclear)
• For final projects we encourage interdisciplinary teams
- This is the way parallel scientific software is generally built
01/16/2006
CS267-Lecture 1
43
First Assignment
• Home page will have details.
- Fill out class survey, applications for computer accounts
• Find an application of parallel computing and build a web page
describing it.
- Choose something from your research area.
- Or from the web or elsewhere.
• Create a web page describing the application.
- Describe the application and provide a reference (or link)
- Describe the platform where this application was run
- Find peak and LINPACK performance for the platform and its rank on the
TOP500 list
- Find performance of your selected application
- What ratio of sustained to peak performance is reported?
- Evaluate project: How did the application scale, I.e. was speed roughly
proportional to the number of processors? What were the major difficulties in
obtaining good performance? What tools and algorithms were used?
• Send us (Jim and Rajesh) the link (we will publish a list online)
• Due next week, Thursday (1/25)
01/16/2006
CS267-Lecture 1
44
Rough Schedule of Topics
• Introduction
• Parallel Programming Models and Machines
- Shared Memory and Multithreading
- Distributed Memory and Message Passing
- Data parallelism
• Sources of Parallelism in Simulation
• Tools
-
Languages (UPC)
Performance Tools
Visualization
Environments
• Algorithms
-
Dense Linear Algebra
Partial Differential Equations (PDEs)
Particle methods
Load balancing, synchronization techniques
Sparse matrices
• Applications: biology, climate, combustion, astrophysics, …
• Project Reports
01/16/2006
CS267-Lecture 1
45
Reading Materials
• Some on-line texts:
- Demmel’s notes from CS267 Spring 1999, which are similar to 2000
and 2001. However, they contain links to html notes from 1996.
- http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
- Simon’s notes from Fall 2002
- http://www.nersc.gov/~simon/cs267/
- Ian Foster’s book, “Designing and Building Parallel Programming”.
- http://www-unix.mcs.anl.gov/dbpp/
• Potentially useful texts:
- “Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, ..
- A general overview of parallel computing methods
- “Performance Optimization of Numerically Intensive Codes” by Stefan
Goedecker and Adolfy Hoisie
- This is a practical guide to optimization, mostly for those of you
who have never done any optimization
• Reports on Supercomputing (see web page)
01/16/2006
CS267-Lecture 1
46
Requirements
• Fill out on-line account request for Millennium machine.
- See course web page for pointer
• Fill out class survey
- Handout in class, or see course web page for pointer
• Build a web page
- Every week or two students will report explorations, ideas, proposed
work, and work to the TA via an organized webpage
- There will be 3-4 programming assignments geared towards hands-on
experience, interdisciplinary teams.
- There will be a Final Project
- Teams of 2-3, interdisciplinary is best.
- Interesting applications or advance of systems.
- Presentation (poster session)
- Conference quality paper
01/16/2006
CS267-Lecture 1
47
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Understanding of parallel computing hardware options.
• Overview of programming models (software) and tools.
• Some important parallel applications and the algorithms
• Performance analysis and tuning
01/16/2006
CS267-Lecture 1
48
Administrative
• Information
Instructors:
- Jim Demmel, 737 Soda, [email protected]
- TA:Rajesh Nishtala, 575 Soda, [email protected]
• Accounts – fill out forms
- Submit online form for Millennium account
- See web page for pointer:
- NERSC account forms will be available later from Rajesh
• Lecture notes are based on previous semester notes:
- Jim Demmel, David Culler, David Bailey, Bob Lucas, Kathy Yelick
and Horst Simon
• Most class material and lecture notes are at:
- http://www.cs.berkeley.edu/~demmel/cs267_Spr06
01/16/2006
CS267-Lecture 1
49
Extra slides
01/16/2006
CS267-Lecture 1
50
Transaction Processing
(mar. 15, 1996)
25000
other
Throughput (tpmC)
20000
Tandem Himalaya
IBM PowerPC
15000
DEC Alpha
SGI PowerChallenge
HP PA
10000
5000
0
0
20
40
60
80
100
120
Processors
• Parallelism is natural in relational operators: select, join, etc.
• Many difficult issues: data partitioning, locking, threading.
01/16/2006
CS267-Lecture 1
51
SIA Projections for Microprocessors
1000
100
Feature Size
(microns)
10
Transistors per
chip x 106
1
0.1
2010
2007
2004
2001
1998
0.01
1995
Feature Size
(microns) & Million
Transistors per chip
Compute power ~1/(Feature Size)3
Year of Introduction
based on F.S.Preston, 1997
01/16/2006
CS267-Lecture 1
52
Much of the Performance is from Parallelism
Thread-Level
Parallelism?
Instruction-Level
Parallelism
Bit-Level
Parallelism
Name
01/16/2006
CS267-Lecture 1
53
Performance on Linpack Benchmark
www.top500.org
100000
Earth Simulator
10000
ASCI White
ASCI Red
1000
Rmax
max Rmax
mean Rmax
min Rmax
100
System
10
n
04
03
Ju
ec
D
n
03
02
Ju
ec
D
n
02
01
Ju
ec
D
n
01
00
Ju
ec
D
n
00
99
Ju
ec
D
n
99
98
Ju
ec
D
n
98
97
Ju
ec
D
n
97
96
Ju
ec
D
n
96
95
Ju
ec
D
n
95
94
Ju
ec
94
D
n
Ju
ec
D
Ju
n
93
93
1
0.1
Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax
01/16/2006
CS267-Lecture 1
54
Descargar

CS267: Introduction