MAMA: Mostly Automatic Management of
Atomicity
Christian DeLozier, Joseph Devietti, Milo M. K. Martin
University of Pennsylvania
March 2nd, 2014
Start with a serial problem
Christian DeLozier - 2014
2
Find and express the parallelism
Christian DeLozier - 2014
3
Coordinate the parallel execution (synchronization)
Christian DeLozier - 2014
4
Don’t mess up!
Christian DeLozier - 2014
5
Is there another way to do this?
• Programmer currently has to:
1. Express the parallelism (Hard)
2. Coordinate the parallelism (Hard)
• Alternative:
1. Programmer expresses the parallelism
2. Machine handles coordination
Christian DeLozier - 2014
6
Coordinating Parallel Execution
• Atomicity vs. Ordering
• Types of concurrency bugs [Lu et al., ASPLOS 2008]
• Atomicity: Locks, transactions
• Ordering: Barriers, fork/join, blocking on a queue, etc.
• Atomicity constraints are more common than
ordering constraints
• Difficult to infer ordering constraints
Christian DeLozier - 2014
7
Mostly Automatic Management of Atomicity
• Toward automatically providing atomicity for
parallel programs
• Program either executes atomically –
or deadlocks
• Protect every shared variable with its own lock
• Restore progress and performance when
necessary (with help from the programmer)
Christian DeLozier - 2014
8
Related Work
• Automatic Parallelization
• [Bernstein, IEEE Transactions 1966]
•…
• Data Centric Synchronization
• [Vaziri et. al, POPL 2006]
• [Ceze et. al, HPCA 2007]
• Transactional Memory
• [Herlihy and Moss, ISCA 1993]
•…
Christian DeLozier - 2014
9
Lock-Based Atomic Sections
• What lock do we acquire?
• When do we acquire the lock?
• When should we release the lock?
Christian DeLozier - 2014
10
What lock do we acquire?
• Associate a lock with each variable
• Trade-off between parallelism and overhead
• Coarse-grained vs. Fine-grained
• Coarse-grained: 1 lock per object, 1 lock per array
• Fine-grained: 1 lock per field, 1 lock per array element
• Mutex vs. Reader-writer lock
Christian DeLozier - 2014
11
MAMA Prototype
• Uses fine-grained locking
• More parallelism
• Especially for arrays
• Optimization: Divide arrays into N chunks, 1 lock per
chunk
• Uses reader-writer locks
• More parallelism
• Read sharing is common
Christian DeLozier - 2014
12
Lock-Based Atomic Sections
• What lock do we acquire?
• One reader-writer lock per variable (fine-grained)
• When do we acquire the lock?
• Acquire before the first dynamic access
• When should we release the lock?
Christian DeLozier - 2014
13
When should we release the lock?
• Simple case: After the owning thread has exited
T1
Write
A
Exited
T2
Write
A
T1
T2
Exit
Write
A
Christian DeLozier - 2014
14
When should we release the lock?
• When the owning thread is waiting for another
thread to make progress (e.g. join, barrier)
T1
Joined
T2
Spawn T2
Write
A
Write
A
T1
T2
Join T2
Exit
Write
A
Christian DeLozier - 2014
15
When should we release the lock?
• Other deadlocks cannot be safely broken
T1
Write
B
T2
Write
A
Write
B
Write B
Write
A
T1
• Need help from the programmer
T2
Write
A
• Trusted annotations to sanction breaking a deadlock
• MAMA_release(object)
• Also used to improve performance when threads are
over-serialized
Christian DeLozier - 2014
16
Lock-Based Atomic Sections
• What lock do we acquire?
• One reader-writer lock per variable (fine-grained)
• When do we acquire the lock?
• Acquire before the first dynamic access
• When should we release the lock?
• At thread exit
• When waiting for another thread to make progress
• Or, at programmer sanctioned program points
Christian DeLozier - 2014
17
What can deadlocks tell us?
• When a thread cannot acquire a lock:
• Perform distributed deadlock detection
[Bracha and Toueg, Distributed Computing 1987]
T1
T2
void f(){
A = 1;
B = 2;
}
void g(){
B = 1;
A = 2;
}
Write B
T1
T2
Write A
Christian DeLozier - 2014
18
MAMA Prototype
• Implemented as a RoadRunner tool [Flanagan and
Freund, PASTE 2010]
• Dynamic instrumentation for Java byte-code
• Evaluated on the Java Grande benchmarks and
selected DaCapo benchmarks
• Running on one socket (8 cores) of a 4 socket
Nehalem system with 128 GB RAM
• Removed all synchronized blocks and
java.util.concurrent constructs from benchmarks
• Ensure that MAMA is providing all of the atomicity
Christian DeLozier - 2014
19
Evaluating MAMA
• Can we execute parallel programs correctly?
• How many annotations need to be added for
progress and performance?
• How is the performance of the program affected?
• Does MAMA permit thread to execute in parallel?
Christian DeLozier - 2014
20
Annotation Burden
Benchmark
Lines of Code
Progress
Annotations
Performance
Annotations
crypt
314
0
0
lufact
461
1
4
lusearch
124105
0
4
matmult
187
0
0
moldyn
487
3
0
montecarlo
1165
0
28
pmd
60062
0
4
series
180
0
0
sor
186
1
0
sunflow
21970
1
3
xalan
172300
0
0
Christian DeLozier - 2014
21
Performance
Normalized Runtime
RR-ser
MAMA-par
MAMA-ser
23x
10x
9x
8x
7x
6x
5x
4x
3x
2x
1x
0x
• MAMA incurs overhead due to locking and serial execution
• But, MAMA still allows some parallel execution as
compared to serialization
Christian DeLozier - 2014
22
Performance Breakdown
Percent Runtime
Running
Blocked
Deadlock
Acquire
Check
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
• Many benchmarks have significant portions that run in parallel
• Checking whether or not a lock is already owned incurs
significant overhead on some benchmarks
Christian DeLozier - 2014
23
Normalized Memory Usage
Memory Usage
MAMA-par
40x
30x
20x
10x
0x
• Fine-grained locking incurs significant memory overheads
• Could be optimized to save space via chunking arrays or
decreasing the size of the lock
Christian DeLozier - 2014
24
Future Directions
• Does this approach apply to other languages?
• How do we test programs running with MAMA?
• Find uncommon deadlocks
• Gain more confidence in trusted annotations
• How can we reduce the performance overheads?
• How can we infer ordering constraints?
Christian DeLozier - 2014
25
MAMA
• Provides atomicity for parallel programs
• Some help via annotations from programmer
• A step toward programming without worrying
about atomicity
• Programmer expresses parallelism
• Machine provides atomicity automatically
Christian DeLozier - 2014
26
Thank you for listening!
Christian DeLozier - 2014
27
Descargar

Document