Adaptive MPI: Intelligent runtime
strategies and performance prediction via
simulation
Laxmikant Kale
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
10/3/2015
[email protected]
1
PPL Mission and Approach
• To enhance Performance and Productivity in
programming complex parallel applications
– Performance: scalable to thousands of processors
– Productivity: of human programmers
– Complex: irregular structure, dynamic variations
• Approach: Application Oriented yet CS centered research
– Develop enabling technology, for a wide collection of apps.
– Develop, use and test it in the context of real applications
• How?
–
–
–
–
Develop novel Parallel programming techniques
Embody them into easy to use abstractions
So, application scientist can use advanced techniques with ease
Enabling technology: reused across many apps
10/3/2015
[email protected]
2
Develop abstractions in context of full-scale applications
Protein Folding
Quantum Chemistry
(QM/MM)
Molecular Dynamics
Computational Cosmology
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
Crack Propagation
Rocket Simulation
Dendritic Growth
Space-time meshes
The enabling CS technology of parallel objects and intelligent
Runtime systems has led to several collaborative applications in CSE
10/3/2015
[email protected]
3
Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition
into virtual processors
Benefits
• Software engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for different modules
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
System implementation
• Message driven execution
– Adaptive overlap of communication
– Predictability :
• Automatic out-of-core
– Asynchronous reductions
• Dynamic mapping
– Heterogeneous clusters
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change set of processors used
– Automatic dynamic load balancing
– Communication optimization
User View
10/3/2015
[email protected]
4
Outline
•
•
•
•
Adaptive MPI
Load Balancing
Fault tolerance
Projections:
– performance analysis
• Performance prediction
– bigsim
10/3/2015
[email protected]
5
AMPI: MPI with Virtualization
• Each virtual process implemented as a user-level thread
embedded in a Charm object
MPI
processes
“processes”
Implemented
as virtual
processes
(user-level
migratable
threads)
Real Processors
10/3/2015
[email protected]
6
Making AMPI work
• Multiple user-level threads per processor:
– Problems with global variable
– Solution I:
• Automatic: switch GOT pointer at context switch
• Available on most machines
– Solution 2: Manual: replace global variables
– Solution 3: automatic: via compiler support (AMPIzer)
• Migrating Stacks
– Use isomalloc technique (Mehaut et al)
– Use memory files and mmap()
• Heap data:
– Isomalloc heaps
– OR user supplied pack/unpack functions for the heap data
10/3/2015
[email protected]
7
ELF and global variables
• The Executable and Linking Format (ELF)
–
–
–
–
Executable has a Global Offset Table containing global data
GOT pointer stored at %ebx register
Switch this pointer when switching between threads
Support on Linux, Solaris 2.x, and more
• Integrated in Charm++/AMPI
– Invoke by compile time option -swapglobal
10/3/2015
[email protected]
8
Adaptive overlap and modules
SPMD and Message-Driven Modules
(From A.
Gursoy, Simplified expression of message-driven programs and
quantification of their impact on performance, Ph.D Thesis, Apr 1994.)
Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh
SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995
10/3/2015
[email protected]
9
Benefit of Adaptive Overlap
1
AMPI(1)
Exec Time [sec]
AMPI(8)
0.1
0.01
0.001
1
10
100
1000
Procs
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
Shows AMPI with virtualization ratio of 1 and 8.
10/3/2015
[email protected]
10
Comparison with Native MPI
• Performance
– Slightly worse w/o optimization
– Being improved
• Flexibility
– Small number of PE available
– Special requirement by algorithm
Exec Time [sec]
100
Native MPI
AMPI
10
1
10
100
1000
Procs
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.
10/3/2015
[email protected]
11
AMPI Extensions
•
•
•
•
Automatic load balancing
Non-blocking collectives
Checkpoint/restart mechanism
Multi-module programming
10/3/2015
[email protected]
12
Load Balancing in AMPI
• Automatic load balancing: MPI_Migrate()
– Collective call informing the load balancer that the thread is
ready to be migrated, if needed.
10/3/2015
[email protected]
13
Load Balancing Steps
Regular
Timesteps
Detailed, aggressive Load
Balancing : object migration
Instrumented
Timesteps
10/3/2015
Refinement Load
Balancing
[email protected]
14
Refinement
Load
Balancing
Load
Balancing
Aggressive Load
Balancing
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
10/3/2015
[email protected]
15
Shrink/Expand
• Problem: Availability of computing platform may change
• Fitting applications on the platform by object migration
Time per step for the million-row CG solver on a 16-node cluster
Additional 16 nodes available at step 600
10/3/2015
[email protected]
16
Optimized All-to-all “Surprise”
900
60
76 bytes all-to-all on
Lemieux
700
Time (ms)
50
40
Time (ms)
Completion time vs.
computation overhead
800
30
600
500
400
300
20
200
100
10
0
0
16
32
64
96
128
192
256
76
512 1024 1280 1536 2048
Hypercube
876
1276
Mesh
3d Grid
1676
2076
3076
4076
6076
8076
Mesh Compute
CPU is free during most of the time
taken by a collective operation
Radix Sort
20
18
Sort Completion Time(s)
AAPC Completion Time(ms)
900
800
700
600
500
400
300
200
100
0
Mesh
476
Message Size (Bytes)
Processors
MPI
276
16
100B 200B 900B 4KB 8KB
2
Message Size (bytes)
Mesh
10/3/2015
Direct
Led to the development of
Asynchronous Collectives
now supported in AMPI
14
12
10
8
6
4
0
100B
200B
900B
4KB
8KB
Message Size (bytes)
Mesh
[email protected]
Direct
17
Asynchronous Collectives
• Our implementation is asynchronous
– Collective operation posted
– Test/wait for its completion
– Meanwhile useful computation can utilize CPU
MPI_Ialltoall( … , &req);
/* other computation */
MPI_Wait(req);
10/3/2015
[email protected]
18
Fault Tolerance
10/3/2015
[email protected]
19
Motivation
• Applications need fast, low cost and scalable fault
tolerance support
• As machines grow in size
– MTBF decreases
– Applications have to tolerate faults
• Our research
–
–
–
–
Disk based Checkpoint/Restart
In Memory Double Checkpointing/Restart
Sender based Message Logging
Proactive response to fault prediction
• (impending fault response)
10/3/2015
[email protected]
20
Checkpoint/Restart Mechanism
• Automatic Checkpointing for AMPI and Charm++
– Migrate objects to disk!
– Automatic fault detection and restart
– Now available in distribution version of AMPI and Charm++
• Blocking Co-ordinated Checkpoint
– States of chares are checkpointed to disk
– Collective call MPI_Checkpoint(DIRNAME)
• The entire job is restarted
– Virtualization allows restarting on different # of processors
– Runtime option
• > ./charmrun pgm +p4 +vp16 +restart DIRNAME
• Simple but effective for common cases
10/3/2015
[email protected]
21
In-memory Double Checkpoint
• In-memory checkpoint
– Faster than disk
• Co-ordinated checkpoint
– Simple
MPI_MemCheckpoint(void)
• Double checkpointing
– Each object maintains 2
checkpoints:
– Local physical processor
– Remote “buddy” processor
• For jobs with large memory
100
10
1
0. 1
0. 01
0. 001
6.
4
12
.8
25
.6
51
.2
10
2.
4
20
4.
8
40
9.
6
81
9.
16 2
38
.4
32
76
.
65 8
53
.6
– User can decide what makes up
useful state
Checkpoint overhead (s)
1000
Proble m siz e (MB)
double in-memory (Myrinet)
double in-disk (Myrinet)
NFS disk
32 processors with 1.5GB memory each
– Use local disks!
10/3/2015
[email protected]
22
Restart
• A “Dummy” process is created:
– Need not have application data or checkpoint
– Necessary for runtime
– Starts recovery on all other Processors
• Other processors:
– Remove all chares
– Restore checkpoints lost on the crashed PE
– Restore chares from local checkpoints
• Load balance after restart
10/3/2015
[email protected]
23
Restart Performance
10 crashes
128 processors
Checkpoint every 10
time steps
10/3/2015
[email protected]
24
Scalable Fault Tolerance
Motivation: When a processor out of 100,000 fails, all 99,999
shouldn’t have to run back to their checkpoints!
• How?
• Current progress
– Basic scheme idea implemented and
tested in simple programs
– General purpose implementation in
progress
10/3/2015
[email protected]
Ex e c u t i o n Ti me ( s )
– Sender-side message logging
Only failed processor’s objects recover from
– Asynchronous Checkpoints on
checkpoints, playing back their messages,
buddy processors
while others “continue”
– Latency tolerance mitigates costs
Ex ec ut i on Ti me wi t h Faul t s
– Restart can be speeded up by
800
spreading out objects from failed
600
processor
400
200
0
0
1
2
3
4
5
6
Number of f aul t s
25
7
Recovery Performance
Ex ec ut i on Ti me( s )
Ex e c u t i o n Ti me wi t h Fa u l t s
800
600
400
200
0
0
1
2
3
Nu mb e r
4
of
5
6
7
f aul t s
Execution Time with increasing number of faults on 8
processors
(Checkpoint period 30s)
10/3/2015
[email protected]
26
Projections:
Performance visualization
and analysis tool
10/3/2015
[email protected]
27
An Introduction to Projections
• Performance Analysis tool
for Charm++ based
applications.
• Automatic trace
instrumentation.
• Post-mortem visualization.
• Multiple advanced features
that support:
– Data volume control
– Generation of additional user data.
10/3/2015
[email protected]
28
Trace Generation
• Automatic instrumentation by runtime system
• Detailed
– In the log mode each event is recorded in full detail (including
timestamp) in an internal buffer.
• Summary
– Reduces the size of output files and memory overhead.
– Produces a few lines of output data per processor.
– This data is recorded in bins corresponding to intervals of size
1 ms by default.
• Flexible
– APIs and runtime options for instrumenting user events and
data generation control.
10/3/2015
[email protected]
29
The Summary View
• Provides a
view of the
overall
utilization of the
application.
• Very quick to
load.
10/3/2015
[email protected]
30
Graph View
• Features:
• Selectively
view Entry
points.
• Convenient
means to switch
to between axes
data types.
10/3/2015
[email protected]
31
Timeline
• The most detailed view in Projections.
• Useful for understanding critical path issues or unusual
entry point behaviors at specific times.
10/3/2015
[email protected]
32
Animations
10/3/2015
[email protected]
33
10/3/2015
[email protected]
34
10/3/2015
[email protected]
35
10/3/2015
[email protected]
36
Time Profile
• Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early
via the Time Profile tool.
Solved by
prioritizing
entry methods
10/3/2015
[email protected]
37
10/3/2015
[email protected]
38
10/3/2015
[email protected]
39
Overview: one line for each processor, time on X-axis
White: busy,
Black:idle
Red:
intermediate
10/3/2015
[email protected]
40
A boring but good-performance overview
10/3/2015
[email protected]
41
An interesting but pathetic overview
10/3/2015
[email protected]
42
Stretch Removal
Histogram Views
Number of function executions vs. their granularity
Note: log scale on Y-axis
After Optimizations
Before Optimizations
Over 16 large stretched calls
10/3/2015
[email protected]
About 5 large stretched calls,
largest of them much smaller, and
almost all calls take less than 3.243ms
Miscellaneous Features Color Selection
• Colors are
automatically supplied
by default.
• We allow users to
select their own colors
and save them.
• These colors can
then be restored the
next time Projections
loads.
10/3/2015
[email protected]
44
User APIs
• Controlling trace generation
– void traceBegin()
– void traceEnd()
• Tracing User (Events
–
–
–
–
int traceRegisterUserEvent(char *, int)
void traceUserEvent(char *)
void traceUserBracketEvent(int, double, double)
double CmiWallTimer()
• Runtime options:
– +traceoff
– +traceroot <directory>
– Projections mode only:
• +logsize <# entries>
• +gz-trace
– Summary mode only:
• +bincount <# of intervals>
• +binsize <interval time quanta (us)>
10/3/2015
[email protected]
45
Performance Prediction
Via
Parallel Simulation
10/3/2015
[email protected]
46
BigSim: Performance Prediction
• Extremely large parallel machines are around
already/about to be available:
– ASCI Purple (12k, 100TF)
– Bluegene/L (64k, 360TF)
– Bluegene/C (1M, 1PF)
• How to write a petascale application?
• What will be the Performance like?
• Would existing parallel applications scale?
– Machines are not there
– Parallel Performance is hard to model without actually running
the program
10/3/2015
[email protected]
47
Objectives and Simualtion Model
• Objectives:
– Develop techniques to facilitate the development of
efficient peta-scale applications
– Based on performance prediction of applications on
large simulated parallel machines
• Simulation-based Performance Prediction:
– Focus on Charm++ and AMPI programming models
Performance prediction based on PDES
– Supports varying levels of fidelity
• processor prediction, network prediction.
– Modes of execution :
• online and post-mortem mode
10/3/2015
[email protected]
48
Blue Gene Emulator/Simulator
• Actually: BigSim, for simulation of any large
machine using smaller parallel machines
• Emulator:
– Allows development of programming environment and
algorithms before the machine is built
– Allowed us to port Charm++ to real BG/L in 1-2 days
• Simulator:
– Allows performance analysis of programs running on large
machines, without access to the large machines
– Uses Parallel Discrete Event Simulation
10/3/2015
[email protected]
49
Architecture of BigNetSim
10/3/2015
[email protected]
50
Simulation Details
• Emulate large parallel machines on smaller existing
parallel machines – run a program with multi-million
way parallelism (implemented using user-threads).
– Consider memory and stack-size limitations
• Ensure time-stamp correction
• Emulator layer API is built on top of machine layer
• Charm++/AMPI implemented on top of emulator, like
any other machine layer
• Emulator layer supports all Charm features:
– Load-balancing
– Coomunication optimizations
10/3/2015
[email protected]
51
Performance Prediction
• Usefulness of performance prediction:
– Application developer (making small modifications)
• Difficult to get runtimes on huge current machines
• For future machines, simulation is the only possibility
• Performance debugging cycle can be considerably reduced
• Even approximate predictions can identify performance
issues such as load imbalance, serial bottlenecks,
communication bottlenecks, etc
– Machine architecture designer
• Knowledge of how target applications behave on it, can help
identify problems with machine design early
• Record traces during parallel emulation
• Run
(PDES)
10/3/2015trace-driven simulation
[email protected]
52
Performance Prediction (contd.)
• Predicting time of sequential code:
– User supplied time for every code block
– Wall-clock measurements on simulating machine can be used via a
suitable multiplier
– Hardware performance counters to count floating point, integer, branch
instructions, etc
• Cache performance and memory footprint are approximated by
percentage of memory accesses and cache hit/miss ratio
– Instruction level simulation (not implemented)
• Predicting Network performance:
– No contention, time based on topology & other network parameters
– Back-patching, modifies comm time using amount of comm activity
– Network-simulation, modelling the netowrk entirely
10/3/2015
[email protected]
53
Performance Prediction Validation
• 7-point stencil program with 3D decomposition
– Run on 32 real processors, simulating 64, 128,... PEs
●
NAMD benchmark Apo-Lipoprotein A1 atom dataset
with 92k atoms, running for 15 timesteps
–
For large processors, because of cache and memory effects,
the predicted value seems to diverge from actual value
10/3/2015
[email protected]
54
Performance on Large Machines
• Problem:
– How to predict performance of
applications on future machines?
(E.g. BG/L)
– How to do performance tuning
without continuous access to large
machine?
• Solution:
–
–
–
–
Leverage virtualization
Develop machine emulator
Simulator: accurate time modeling
Run program on “100,000
processors” using only hundreds of
processors
• Analysis:
– Use performance viz. suite
(projections)
10/3/2015
Molecular Dynamics Benchmark
ER-GRE: 36573 atoms
1.6 million objects
8 step simulation
16k BG processors
[email protected]
55
Projections: Performance visualization
10/3/2015
[email protected]
56
NetWork Simulation
• Detailed implementation of
interconnection networks
• Configurable network
parameters
–
–
–
–
–
Topology / Routing
Input / Output VC seclection
Bandwidth / Latency
NIC parameters
Buffer / Message size, etc
• Support for hardware
collectives in network layer
10/3/2015
[email protected]
57
Higher level programming
• Orchestration language
– Allows expressing global control flow in a charm program
– HPF like flavor, but Charm++-like processor virtualization, and
explicit communication
• Multiphase Shared Arrays
– Provides a disciplined use of shared address space
– Each array can be accessed only in one of the following modes:
• ReadOnly, Write-by-One-Thread, Accumulate-only
– Access mode can change from phase to phase
– Phases delineated by per-array “sync”
10/3/2015
[email protected]
58
Other projects
• Faucets
• ParFUM:
– Flexible cluster scheduler
– resource management across
clusters
– Multi-cluster applications
– Parallel framework for
Unstructured mesh
• Invite collaborations:
• Load balancing strategies
• Commn optimization
• POSE: Parallel discrete
even simulation
10/3/2015
[email protected]
– Virtualization of other languages
and libraries
– New load balancing strategies
– Applications
59
Some Active Collaborations
• Biophysics: Molecular
Dynamics (NIH, ..)
• Material simulation (NSF)
– Long standing, 91-, Klaus Schulten,
Bob Skeel
– Gordon bell award in 2002,
– Production program used by
biophysicists
• Quantum Chemistry (NSF)
– QM/MM via Car-Parinello method
+
– Roberto Car, Mike Klein, Glenn
Martyna, Mark Tuckerman,
– Nick Nystrom, Josep Torrelas,
Laxmikant Kale
10/3/2015
– Dendritic growth, quenching, spacetime meshes, QM/FEM
– R. Haber, D. Johnson, J. Dantzig, +
• Rocket simulation (DOE)
– DOE, funded ASCI center
– Mike Heath, +30 faculty
• Computational Cosmology
(NSF, NASA)
– Simulation:
– Scalable Visualization:
• Others
[email protected]
– Simulation of Plasma
– Electromagnetics
60
Summary
• We are pursuing a broad agenda
–
–
–
–
–
–
–
aimed at productivity and performance in parallel programming
Intelligent Runtime System for adaptive strategies
Charm++/AMPI are production level systems
Support dynamic load balancing, communication optimizations
Performance prediction capabiltiees based on simulation
Basic Fault tolerance, performance viz tools are part of the suite
Application-oriented yet Computer Science centered research
Workshop on Charm++ and Applications: Oct 18-20 , UIUC
http://charm.cs.uiuc.edu
10/3/2015
[email protected]
61
Descargar

Runtime Optimizations - University of Illinois at Urbana