Runtime Optimizations
via Processor Virtualization
Laxmikant Kale
Parallel Programming Laboratory
Dept. of Computer Science
University of Illinois at Urbana Champaign
http://charm.cs.uiuc.edu
Outline
• Where we stand
– Predictable behavior very important
for performance
• What is virtualization
– Charm++, AMPI
• Principle of persistence
• Consequences:
– Software Engineering:
• Natural unit of parallelism
• Data vs code abstraction
• cohesion/coupling: MPI vs
objects
– Message driven execution
• Adaptive overlap
• Handling jitter
• outofcore
• caches /controllable SRAM
October 9, 2015
– Flexible and dynamic mapping
• Vacating.
• Accommodating speed varn
• shrink/expand
– Measurement based lb
– Communication opts
– Learning algorithms
• New uses of compiler analysis
–
–
–
–
Processor virtualization
Apparently simple,
Threads, Global vars
Minimizing state at migration
Border fusion
2
Technical Approach
• Seek optimal division of labor between “system” and
programmer:
– Decomposition done by programmer, everything else automated
– Develop standard library of reusable parallel components
Decomposition
Mapping
Charm++
HPF
Scheduling
expression
October 9, 2015
MPI
Specialization
Processor virtualization
3
Object-based Decomposition
• Idea:
– Divide the computation into a large number of pieces
• Independent of number of processors
• Typically larger than number of processors
– Let the system map objects to processors
•This is “virtualization++”
–Language and runtime support for virtualization
–Exploitation of virtualization to the hilt
October 9, 2015
Processor virtualization
4
Object-based Parallelization
User is only concerned with interaction between objects
System implementation
User View
October 9, 2015
Processor virtualization
5
Realizations: Charm++ and AMPI
• Charm++:
– Parallel C++ with Data Driven Objects (Chares)
– Object Arrays/ Object Collections
– Object Groups:
• Global object with a “representative” on each PE
– Asynchronous method invocation
• Prioritized scheduling
– Information sharing abstractions: readonly, tables,..
– Mature, robust, portable (http://charm.cs.uiuc.edu)
• AMPI:
– Adaptive MPI, with many MPI threads on each processor
October 9, 2015
Processor virtualization
6
Chare Arrays
• Elements are data-driven objects
• Elements are indexed by a user-defined data type-[sparse] 1D, 2D, 3D, tree, ...
• Send messages to index, receive messages at element.
Reductions and broadcasts across the array
• Dynamic insertion, deletion, migration-- and
everything still has to work!
October 9, 2015
Processor virtualization
7
Object Arrays
• A collection of data-driven objects (aka chares),
– With a single global name for the collection, and
– Each member addressed by an index
– Mapping of element objects to processors handled by the
system
User’s view
A[0] A[1] A[2] A[3]
October 9, 2015
A[..]
Processor virtualization
8
Object Arrays
• A collection of chares,
– with a single global name for the collection, and
– each member addressed by an index
– Mapping of element objects to processors handled by the
system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0]
October 9, 2015
A[3]
Processor virtualization
9
Object Arrays
• A collection of chares,
– with a single global name for the collection, and
– each member addressed by an index
– Mapping of element objects to processors handled by the
system
User’s view
A[0] A[1] A[2] A[3]
A[..]
System
view
A[0] A[3]
October 9, 2015
Processor virtualization
10
Comparison with MPI
• Advantage: Charm++
– Modules/Abstractions are centered on application data
structures,
• Not processors
– Several other…
• Advantage: MPI
– Highly popular, widely available, industry standard
– “Anthropomorphic” view of processor
• Many developers find this intuitive
• But mostly:
– There is no hope of weaning people away from MPI
– There is no need to choose between them!
October 9, 2015
Processor virtualization
11
Adaptive MPI
• A migration path for legacy MPI codes
– Allows them dynamic load balancing capabilities of Charm++
• AMPI = MPI + dynamic load balancing
• Uses Charm++ object arrays and migratable threads
• Minimal modifications to convert existing MPI programs
– Automated via AMPizer
• Bindings for
– C, C++, and Fortran90
October 9, 2015
Processor virtualization
12
AMPI:
MPI
“processes”
Implemented
as virtual
processors
(user-level
migratable
threads)
Real Processors
October 9, 2015
Processor virtualization
13
II: Consequences of virtualization
• Better Software Engineering
• Message Driven Execution
• Flexible and Dynamic mapping to processors
October 9, 2015
Processor virtualization
14
Modularization
“Number of processors” decoupled from logical units
– E.g. Oct tree nodes for particle data
– No artificial restriction on the number of processors
• Cube of power of 2
• Modularity:
– Software engineering: Cohesion and coupling
– MPI’s “are on the same processor” is a bad coupling principle
– Objects liberate you from that:
• E.g. solid and fluid moldules in a rocket simulation
October 9, 2015
Processor virtualization
15
Rocket Simulation
• Large Collaboration headed Mike Heath
– DOE supported center
• Challenge:
– Multi-component code, with modules from independent
researchers
– MPI was common base
• AMPI: new wine in old bottle
– Easier to convert
– Can still run original codes on MPI, unchanged
October 9, 2015
Processor virtualization
16
Rocket simulation via virtual processors
Rocflo
Rocflo
Rocflo
Rocface
Rocsolid
Rocface
Rocface
Rocsolid
Rocsolid
Rocflo
Rocface
Rocsolid
October 9, 2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocsolid
Processor virtualization
Rocflo
Rocface
Rocsolid
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
17
AMPI and Roc*: Communication
Rocflo
Rocface
Rocsolid
October 9, 2015
Rocflo
Rocface
Rocsolid
Rocflo
Rocflo
Rocface
Rocsolid
Processor virtualization
Rocflo
Rocface
Rocsolid
Rocface
Rocsolid
18
Message Driven Execution
Scheduler
Scheduler
Message Q
Message Q
October 9, 2015
Processor virtualization
19
Adaptive Overlap via Data-driven Objects
• Problem:
– Processors wait for too long at “receive” statements
• Routine communication optimizations in MPI
– Move sends up and receives down
– Sometimes. Use irecvs, but be careful
• With Data-driven objects
– Adaptive overlap of computation and communication
– No object or threads holds up the processor
– No need to guess which is likely to arrive first
October 9, 2015
Processor virtualization
20
Adaptive overlap and modules
SPMD and Message-Driven Modules
(From A.
Gursoy, Simplified expression of message-driven programs and
quantification of their impact on performance, Ph.D Thesis, Apr 1994.)
October 9, 2015
Processor virtualization
21
Modularity and Adaptive Overlap
“Parallel Composition Principle: For effective
composition of parallel components, a
compositional programming language should
allow concurrent interleaving of component
execution, with the order of execution constrained
only by availability of data.”
(Ian Foster, Compositional parallel programming languages, ACM
Transactions of Programming Languages and Systems, 1996)
October 9, 2015
Processor virtualization
22
Handling OS Jitter via MDE
• MDE encourages asynchrony
– Asynchronous reductions, for example
– Only data dependence should force synchronization
• One benefit:
– Consider algorithm with N steps
• Each step has different load balance
• Loose dependence between steps
– (on neighbors, for example)
– Sum-of-max (MPI) vs max-of-sums (MDE)
• OS Jitter:
– Causes random processors to add delays in each step
– Handled Automatically by MDE
October 9, 2015
Processor virtualization
23
Virtualization/MDE leads to predictability
• Ability to predict:
– Which data is going to be needed and
– Which code will execute
– Based on the ready queue of object method invocations
October 9, 2015
Processor virtualization
24
Virtualization/MDE leads to predictability
• Ability to predict:
–
–
–
–
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method invocations
So, we can:
• Prefetch data accurately
• Prefetch code if needed
– Out-of-core execution
– Caches vs controllable SRAM
S
Q
October 9, 2015
Processor virtualization
S
Q
25
Flexible Dynamic Mapping to Processors
• The system can migrate objects between processors
– Vacate workstation used by a parallel program
– Dealing with extraneous loads on shared workstations
– Shrink and Expand the set of processors used by an app
• Adaptive job scheduling
• Better System utilization
– Adapt to speed difference between processors
• E.g. Cluster with 500 MHz and 1 Ghz processors
• Automatic checkpointing
– Checkpointing = migrate to disk!
– Restart on a different number of processors
October 9, 2015
Processor virtualization
26
Load Balancing with AMPI/Charm++
Turing cluster has processors with different speeds
Phase
16P3
16P2
8P3,8P2
w/o LB
8P3,8P2
w. LB
Fluid
update
75.24
97.50
96.73
86.89
Solid
update
41.86
52.50
52.20
46.83
Pre-Cor
Iter
117.16
150.08
149.01
133.76
Time Step 235.19
301.56
299.85
267.75
October 9, 2015
Processor virtualization
27
Principle of Persistence
• Parallel analog of principle of locality
– Heuristics
– Holds for most CSE applications
•Once the application is expressed in terms of
interacting objects:
–Object communication patterns and computational loads
tend to persist over time
–In spite of dynamic behavior
•Abrupt but infrequent changes
•Slow and small changes
October 9, 2015
Processor virtualization
28
Utility of the principle of persistence
• Learning / adaptive algorithms
• Adaptive Communication libraries
– Each to all individualized sends
• Performance depends on many runtime characteristics
• Library switches between different algorithms
• Measurement based load balancing
October 9, 2015
Processor virtualization
29
Dynamic Load Balancing For CSE
• Many CSE applications exhibit dynamic behavior
– Adaptive mesh refinement
– Atom migration
– Particle migration in cosmology codes
• Objects allow RTS to remap them to balance load
– Just move objects away from overloaded processors to
underloaded processors
Just??
October 9, 2015
Processor virtualization
30
Measurement Based Load Balancing
• Based on Principle of persistence
• Runtime instrumentation
– Measures communication volume and computation time
• Measurement based load balancers
– Use the instrumented data-base periodically to make new
decisions
– Many alternative strategies can use the database
• Centralized vs distributed
• Greedy improvements vs complete reassignments
• Taking communication into account
• Taking dependences into account (More complex)
October 9, 2015
Processor virtualization
31
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements
Added
Num ber of Iterations Per second
50
3. Chunks
Migrated
45
40
35
30
25
2. Load
Balancer
Invoked
20
15
10
5
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
1
0
Iteration Num ber
October 9, 2015
Processor virtualization
32
“Overhead” of Multipartitioning
Time (Seconds) per Iteration
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
4
8
16
32
64
128
256
512
1024 2048
Number of Chunks Per Processor
October 9, 2015
Processor virtualization
33
Optimizing for Communication Patterns
• The parallel-objects Runtime System can observe,
instrument, and measure communication patterns
– Communication is from/to objects, not processors
– Load balancers can use this to optimize object placement
– Communication libraries can optimize
• By substituting most suitable algorithm for each
operation
• Learning at runtime
V. Krishnan, MS Thesis, 1996
October 9, 2015
Processor virtualization
34
Example: All to all on Lemieux
60
50
Time (ms)
40
MPI
30
Converse
20
10
0
0
500
1000
1500
2000
2500
Processors
October 9, 2015
Processor virtualization
35
The Other Side: Pipelining
• A sends a large message to B, whereupon B computes
– Problem: B is idle for a long time, while the message gets
there.
– Solution: Pipelining
• Send the message in multiple pieces, triggering a
computation on each
• Objects makes this easy to do:
• Example:
– Ab Initio Computations using Car-Parinello method
– Multiple 3D FFT kernel
Recent collaboration with: R. Car, M. Klein, G. Martyna,
M, Tuckerman, N. Nystrom, J. Torrellas
October 9, 2015
Processor virtualization
36
October 9, 2015
Processor virtualization
37
Effect of Pipelining
Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux
0.25
Time (Seconds)
0.2
0.15
0.1
V. Ramkumar (PPL)
0.05
0
0
5
10
15
20
25
30
35
40
45
Objects per processor
October 9, 2015
Processor virtualization
38
Control Points: learning and tuning
• The RTS can automatically optimize the degree of
pipelining
– If it is given a control point (knob) to tune
– By the application
Controlling pipelining between a pair of objects:
S. Krishnan, PhD Thesis, 1994
Controlling degree of virtualization:
Orchestration Framework: Ongoing PhD thesis
October 9, 2015
Processor virtualization
39
Example: Molecular Dynamics in NAMD
• Collection of [charged] atoms, with bonds
• Newtonian mechanics
• At each time-step
– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Calculate velocities and advance positions
• 1 femtosecond time-step, millions needed!
• Thousands of atoms (1,000 - 500,000)
Collaboration with K. Schulten, R. Skeel, and coworkers
October 9, 2015
Processor virtualization
40
NAMD performance using virtualization
• Written in Charm++
• Uses measurement based load balancing
• Object level performance feedback
– using “projections” tool for Charm++
– Identifies problems at source level easily
– Almost suggests fixes
• Attained unprecedented performance
October 9, 2015
Processor virtualization
41
Grainsize analysis
Grainsize distribution
Solution:
Split compute
objects that may
have too much
work:
using a
heuristics based
on number of
interacting
atoms
1000
number of objects
900
800
700
600
500
400
300
200
100
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
grainsize in milliseconds
Problem
October 9, 2015
Processor virtualization
42
Integration overhead analysis
integration
Problem: integration time had doubled from sequential run
October 9, 2015
Processor virtualization
43
Improved Performance Data
Speedup on Asci Red
1400
1200
Speedup
1000
800
600
400
Published in
SC2000:
Gordon Bell
Award Finalist
200
0
0
500
1000
1500
2000
2500
Processors
October 9, 2015
Processor virtualization
44
Namd2 Speedup on Lemieux
1600
1400
1200
1000
apoa1(cutoff)
apoa1(PME)
atpase(cutoff)
atpase(PME)
800
600
400
200
October 9, 2015
Processor virtualization
2046
1536
1023
510
255
126
64
32
16
8
4
0
45
Role of compilers
• New uses of compiler analysis
– Apparently simple, but then again, data flow analysis must
have seemed simple
– Supporting Threads,
– Shades of global variables
– Minimizing state at migration
– Border fusion
– Split-phase semantics (UPC).
– Components (separately compiled)
• Compiler – RTS collaboration needed!
October 9, 2015
Processor virtualization
46
Summary
• Virtualization as a magic bullet
– Logical decomposition, better software eng.
– Flexible and dynamic mapping to processors
• Message driven execution:
– Adaptive overlap, modularity, predictability
• Principle of persistence
– Measurement based load balancing,
– Adaptive communication libraries
More info:
• Future:
– Compiler support
– Realize the potential:
• Strategies and applications
October 9, 2015
Processor virtualization
http://charm.cs.uiuc.edu
47
Component Frameworks
• Seek optimal division of labor between “system” and
programmer:
– Decomposition done by programmer, everything else automated
– Develop standard library of reusable parallel components
Domain
specific
frameworks
Decomposition
Mapping
Charm++
HPF
Scheduling
expression
October 9, 2015
MPI
Specialization
Processor virtualization
48
Descargar

An Object Based Approach to Developing Scalable