“Dynamo: A Transparent Dynamic
Optimization System”
V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000
Presented by:
David Carrillo
CS 295: Run-time Techniques for Efficient and Reliable Program Execution.
Guoqing Xu
Motivation
• Recent developments decrease effectiveness of compiler
techniques
• Software shipped as collection of DLLs not single executable
• Object oriented languages use delayed binding
• Dynamic code generation
• System vendors have to rely on Software Vendors to avail
powerful static compiler optimizations
• Still trend is to offload complexity from hardware to
compiler
• CISC vs. RISC
• Superscalar vs. VLIW (EPIC)
Motivation (contd.)
• Legacy binaries may not take advantages of modern
compilers/chip innovations
• Two opposite developments
• Static compiler taking on increased performance burden
• Obstacles to static compiler analysis increasing
• Choices
1. Go for complex specialized compilers
2. Analyze code dynamically and do optimizations (Run-time
optimization)
Dynamo’s Choice
Overview
• Dynamo is a dynamic “native-to-native” optimization
system
• Input an executing native instruction stream
• No involvement of Software Vendor
• Transparent operation
• Input instruction stream to Dynamo may be
• Statically created binary
• Dynamically generated binary (i.e. JIT, emulator)
• Present work focuses on static binary case
Overview (contd.)
• Dynamo works like a Virtual Machine
• Can observe execution behavior
• Tries to identify hot instruction sequences (traces)
• After identification of hot traces
• Generates optimized fragments for those traces
• Stored into fragment cache (a software code cache)
• Links successive traces (trace tree).
• Subsequent occurrences of a hot trace
• Stops interpreting
• Executes corresponding fragment from “fragment cache”
• Upon end of fragment, Dynamo resumes interpreting code
Overview (contd.)
• Overheads associated
• Interpreting code is very slow
• Identification of hot traces and keeping track
• Generation of optimized fragments
• Advantages of approach
• Working set of application gradually materializes into cache
• Optimized using run-time information (expected to be much faster
than original code)
• Beneficial if
• Performance benefits from reuse of fragments offset overheads
• Assumes that major working set of application is small
Dynamo: System Architecture
Identification of Hot Traces
• Uses the MRET (Most Recently Executed Trail) scheme
• Interpreter associates counters with possible start-of-trace points
• Increments whenever these points executed
• Tries to do minimal profiling
• Assumes that following instruction are also hot
• Estimates the end-of-trace point using some conditions
• When count exceeds a threshold
• Stops interpreting
• Records instruction sequence till end-of-trace condition
• This gives the hot trace
Trace Optimization and Fragment Generation
• Selected hot trace converted to intermediate representation(IR)
• IR close to underlying machine instruction set
• Performs optimizations on the IR
• Removes unconditional branches
• Other optimizations on branch instructions
• For fragment generation
• Trace might be split into multiple fragments due to branches
• Register allocator tries to preserve original mappings
• Stubs are inserted for exit from fragment
• Fragment linking done to couple contiguous fragments (highest
performance gain)
The Trace
From the
interpreter
A
A
C
B
C
D
D
call
Speculative Trace
Construction
I
G
E
G
J
F
H
I
E
J
return
F*
Back to the
interpreter
B*
H*
*
Effects of Fragment Construction
From the
interpreter
A
C
D
G
I
J
E
F*
Back to the
interpreter
B*
H*
• Hot path is linearized
• Eliminates branches
• Creates superblock
• Applies local optimization
• Cold-path branches remain
• Targets are stubs
• Send control to interpreter
• Path includes call & return
• Jumps not branches
• Interprocedural effects
• Indirect branches
• Speculate on target
• Fall back on hash table of
branch targets
Sources of Improvement
Many small things contribute to make Dynamo profitable
• Linearization eliminates branches
• Improves TLB & I-cache behavior
• 2 passes of local optimization
• Redundancy elimination, copy propagation, constant folding, simple
strength reduction, loop unrolling, loop invariant code motion, redundant
load removal
• Keep in mind that “local” includes interprocedural in Dynamo
• Focuses on Optimization hard to do on compilation, at least without
profile information. Aims to not compete with static compiler.
Nevertheless, Engineering detail makes a difference
Fragment Linking
Block B meets “start of trace”
condition (exit from fragment)
From the
interpreter
A
A
C
B
C
D
D
call
Speculative Trace
Construction
I
G
E
G
J
F
H
I
E
J
return
What happens if another path
becomes hot? (Say ABDGIJE)
F*
Back to the
interpreter
B*
H*
*
Fragment Linking
When counter reaches
From the
hot:
interpreter
• Builds a fragment
A
C
D
G
I
J
E
F*
Back to the
interpreter
B*
H*
Fragment Linking
When counter reaches
From the
hot:
interpreter
• Builds a fragment
Back to the
interpreter
A
C
B
D
D
G
G
I
I
J
J
E
E
F*
F*
B*
A*
H*
H*
Fragment Linking
When counter reaches
From the
hot:
interpreter
• Builds a fragment
• Links exit AB to new
fragment
• Links exit EA to old
fragment
A
C
B
D
D
G
G
I
I
J
J
E
E
F*
F*
H*
H*
Back to the
interpreter
Fragment Linking
When counter reaches hot:
From the
• Builds a fragment interpreter
• Links exit AB to new
fragment
• Links exit EA to old
fragment
What if B* held redundant
op?
• Have LIVE on entry to B
• Can test LIVE sets for both
exit from A & entry to B
• May show op is dead …
Back to the
• Link-time RE
A
C
B
D
D
G
G
I
I
J
J
E
E
F*
F*
H*
H*
interpreter
Fragment Cache Management
• Need to flush out cold entries regularly
• Mechanism for replacement can’t take too much overhead
• Schemes such as LRU might not be feasible
• Dynamo’s approach
• Looks for sharp increase in fragment creation range
• Indicates significant change in program working set
• Complete fragment cache flush triggered
An example justifying this assumption
Signal handling in Dynamo
• Signals/interrupts could be of two types
• Asynchronous
• Synchronous
• Asynchronous signals queued at arrival time and
processed at end of current fragment’s execution
• Synchronous signal arrival on optimized code can create
problems
• Can’t be postponed
• Might assume original program context which no longer exists due
to optimizations
Signal handling in Dynamo (contd.)
• Conservative Optimizations
• Constant propagation, constant folding, copy propagation, etc
• Allow precise signal context to be reconstructed
• Aggressive Optimizations
• Dead code removal, code sinking, loop invariant code motion, etc.
• Can’t guarantee that synchronous signals can be handled properly
Results
Figure. Speedup of +O2 optimized PA-8000 binaries running on Dynamo, relative to
the identical binaries running standalone. The contributions from dynamo inlining due
to trace selection, conservative trace optimization and aggressive trace optimization
are shown. Dynamo bails out to direct native execution on go and vortex.
Results
• Dynamo runs faster than native code in the average case
• Overhead averages 1.5% (Spec95 on HP PA-8000)
• Most significant source of speedup is trace-selection.
• Most of the overhead is spent in trace selection
• Interpret, bump target counters, test against threshold
• Optimization & code generation have minor cost
Results
Performance gain reduces V.S. highly optimized binaries.
Explanation of results
• Gains due to trace selection due to
• Selecting a trace and forming fragments out of it
• Partial procedure inlining and improved layout in cache
• Large improvements in
• Perl, compress, li and m88skim
• Have stable working sets
• No performance improvements on
• go, ijpeg and vortex
• Startup time is non-negligible fraction of the total runtime in these
programs
• Don’t run long enough to recoup Dynamo’s startup overhead
• Lack of a stable working set.
Remarks
• Dynamo project
• Started in 1996
• Tries to achieve dynamic performance improvement
• Novel native-to-native online optimization aproach
• Complements the static compiler
• Very influential paper
• 279 tracked in ACM Digital Library
Remarks
• Starting up
• Single call to library routine
• It copies the stack, creates space for interpreter’s state, and begins
Dynamo’s execution
• This model simplifies initialization of program and grants Dynamo local
access to binary memory space.
• Counter management
• 1st time at branch target allocates & initializes a counter
• 2nd & subsequent times bumps that counter
• Optimization & code generation recycles counter’s space
• With cheap breakpoint mechanism, it could spare the use of an
interpreter and use breakpoint to invoke Dynamo (similar to a
debugger).
Caveats
• Runs on PA-8000 (introduced on November 1995)
• Very simple instructions, easier to write optimizer than CISC (recall
Itanium Fiasco)
• This chip lacks most of hardware optimization features common on
modern chips.
• Low performance gain on already highly optimized binaries
• Poor handling of interruptions may cause problems in I/O
intensive scenarios. Experiments does not test this scenario
• Posterior documents from the authors market Dynamo as an
interpreter for emulators
References
• Dynamo: A Transparent Dynamic Optimization System
• Vasanth Bala, Evelyn Duesterwald and Sanjeev Banerjia
• Proceedings of the ACM SIGPLAN 2000 conference on Programming
language design and implementation
• ARS Technica report on Dynamo
• http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html
Slides based on previous slides of:
• Dynamic Optimization as typified by the Dynamo
System.
• Keith D. Cooper & Linda Torczon.
• COMP 512, Rice University.
• Mayank Agarwal. CS 812 : High Level Design and
Modeling Student Presentation
Descargar

dynamo - Donald Bren School of Information and