Munin, Clouds and
Treadmarks
Distributed Shared Memory course
Taken from a presentation of:
Maya Maimon
(University of Haifa, Israel).
1
Introduction
There are two fundamental models for
parallel programming:
Shared memory model
Direct extension of the conventional uniprocessor model
wherein each processor is provided with the abstraction
that there is but a single memory in the machine.
A update to shared data therefore becomes visible to all
the processors in the system.
Distributed memory model.
each processor has a private memory to which no other
processor has direct access.
communication by explicit message passing.
2
The Goal
Distributed memory machines are easier to build.
The shared memory programming model is, however,
more attractive since the message passing paradigm
requires them to explicitly partition data and manage
communication.
Using a programming model that supports a global
address space, an applications programmer can focus
on algorithmic development rather than on managing
partitioned data sets and communicating values.
A distributed shared memory (DSM) system provides
a shared memory programming model on distributed
memory machine.
3
Distributed Shared Memory
The system consists of the same hardware
as distributed memory machine, with the
addition of a software layer that provides the
abstraction of a single shared memory.
In practice, each memory remains physically
independent, and all communication takes place
through explicit message passing performed by
the DSM software layer.
DSM systems combine the best features of
the two methods. They support the
convenient shared memory programming
model on distributed memory hardware,
which is more scalable and less expensive to
build.
4
Software Distributed Shared Memory
Node 1
Node 2
Node n
Mem
Mem
Mem
network
distributed shared memory
page based, permissions, …
single system image,
shared virtual address space, …
5
The challenge
Performance
What is the source of the overhead in
DSM?
The large amount of communication
that is required to maintain consistency.
In another words, the maintenance of
the shared memory abstraction.
6
False Sharing
False sharing is a situation in which two
or more processes access different
variables within a page and at least one
of the accesses is a write.
If only one process is allowed to write to a
page at a time, false sharing leads to
unnecessary communication, called the
“ping-pong” effect.
7
Understanding False Sharing
A
w(x)
w(x)
p
p
B
A
w(x)
p
p
r(y)
w(x)
x
y
r(y)
w(x)
p
p
r(y)
w(x)
x
y
page p1
B
r(y)
r(y)
r(y)
page p2
8
Techniques for reducing the
amount of communication
1) Software release consistency
2) Multiple consistency protocol
3) Write shared protocol
4) an update with timeout mechanism
These techniques have been incorporated
in the Munin DSM system.
9
The need for Relaxed Consistency
Schemes
In any implementation of Sequential Consistency
there should be some global control mechanism.
If processor write to shared data ,all the other should
know about it ,before it can write again.
Either of writes or reads require memory
synchronization operations.
In most implementation writes require some kind
of memory synchronization:
A
w(x)
w(y)
w(x)
B
10
1) Relaxed Consistency Schemes
The Relaxed Consistency Schemes are
designed to allow less memory
synchronization operations.
Writes can be delayed, aggregated, eliminated.
This results in less communication and therefore
higher performance.
barrier
A
w(x)
w(y)
w(x)
B
11
The idea
Release consistency exploits the fact
that programmers use synchronization
to separate accesses to shared variables
by different threads.
The system then only needs to
guarantee that memory is consistent at
(select) synchronization points.
12
False Sharing in Relaxed
Consistency Schemes
False sharing has much smaller overhead in
relaxed consistency models.
The overhead induced by false sharing can be
further reduced by the the usage of multiplewriter protocols.
Multiple-writer protocols allow multiple
processes to simultaneously modify their local
copy of a shared page.
The modifications are merged at certain points of
execution.
13
Release Consistency
[Gharachorloo et al. 1990, DASH]*
(*)
Introduces a special type of variables, called
synchronization variables or locks.
Locks cannot be read or written to. They can
be acquired and released. For a lock L those
operations are denoted by acquire(L) and
release(L) respectively
We will say that a process that acquired a
lock L but has not released it, holds the lock
L.
No more than one process can hold a lock L.
One process holds the lock while others wait.
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency
and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual
International Symposium on Computer Architecture, pages 15--26. IEEE, May 1990.
14
Using Release and Acquire to define
execution-flow synchronization primitives
Let a set of processes release tokens by reaching the
operation release in their program order.
Let another set (possibly with overlap) acquire those
tokens by performing acquire operation, where
acquire can proceed only when all tokens have already
arrived from all releasing processes.
2-way synchronization = lock-unlock, 1 release, 1
acquire
n-way synchronization = barrier, n releases, n
acquires
PARC’s synch = k-way synchronization
15
Formal Definition of Release
Consistency
Conditions for Release Consistency:
(A) Before a read or write access is allowed to perform
with respect to any other processor, all previous acquire
accesses must be performed, and
(B) Before a release access is allowed to perform with
respect to any other processor, all previous read or
write accesses must be performed, and
(C) acquire and release accesses are sequentially
consistent.
16
Understanding RC
From this point all processes must
see the value 1 in X
w(x)1
A
B
r(x)0
rel(L1)
r(x)?
r(x)1
acq(L1)
r(x)1
t
It is undefined what
value is read here. It
can be any value
written by some
process. Here it can
be 0 or 1.
1 must be read
according to
rule (B), but
the
programmer
can not be sure
of it
Programmer is sure
that this will return
1 according to rules
(C) and (A)
17
Acquire and Release
release serves as a memory-synch operation, or a
flush of the local modifications to the attention of all
other processes.
According to the definition, the acquire and
release operations are not only used for
synchronization of execution, but also for
synchronization of memory, i.e. for propagation of
writes from/to other processes.
This allows to overlap the two expensive kinds of
synchronization.
This turns out also simpler on the programmer
from semantic point of view.
18
Acquire and Release (cont.)
A release followed by an acquire of the same
lock guarantees to the programmer that all writes
previous to the release will be seen by all reads
following the acquire.
The idea is to let the programmer decide which
blocks of operations need be synchronized, and put
them between matching pair of acquire-release
operations.
In the absence of release/acquire pairs, there is
no assurance that modifications will ever propagate
between processes.
19
Implementing RC
The first implementation was proposed by the
inventors of RC and is called DASH.
DASH combats memory latency by pipelining
writes to shared memory.
P1
w(x)
w(y)
w(z)
y
z
x
P2
ack
ack
rel(L)
stalled
ack
• The processor is stalled only when executing a
release, at which time it must wait for all its
previous writes to perform.
20
Implementing RC (cont.)
It is important to reduce the number of
messages exchanges, because every
message has additional fixed overhead,
independent of its size.
Another implementation of RC, called Munin
reduces the number of messages by buffering
writes until a release.
P1
w(x)
w(y)
w(z)
rel(L)
x,y,z
Ack(x,y,z)
P2
21
Eager Release Consistency
[Carter et al. 1991, Munin]*
Implementation of Release Consistency (not a new
memory model).
Postpone sending modifications to the next
release.
Upon a release send all accumulated
modifications to all caching processes.
No memory-synchronization operations on an
acquire.
Upon a miss (no local caching of the variable) get
latest modification from latest modifier (need
some more control to store its identity, no big
(*) John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of MUNIN. In
deal).
Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152--164, October
1991.
22
Understanding ERC
A
apply changes
r(x)0 r(x)0
acq(L1) r(y)1
x,y
B
w(x)1 w(y)1
r(z)0
rel(L1)
x,y
C
acq(L2) w(z)1
apply
changes
apply changes
r(z)1
z
r(z)1
z
r(x)1
apply changes
rel(L2)
t
• Release operation does not complete (is not performed) until
the acknowledgements from all the processes are received.
23
Release Vs.
Sequential consistency
Ordinary reads and writes can be buffered
or pipelined between synchronization points.
2. Ordinary reads and writes following a
release do not have to be delayed for the
release to complete.
3. An acquire access does not have to delay for
previous ordinary reads and writes to
complete.
1.
24
2) Write-Shared Protocol
(Supporting Multiple Writers in ERC)
Modifications are detected by twinning.
When writing to unmodified page, its twin is
created.
When releasing, the final copy of a page is
compared to its twin.
The resulting difference is called a diff.
Twinning and diffing not only allow multiple
writers, but also reduce communication.
Sending a diff is cheaper than sending an entire
page.
25
Twinning and Diffing
write P
twin
writable working copy
release:

diff
26
3)Multiple Consistency
Protocol
The use of a single protocol for all shared
data leads to a situation where some
programs can be handled effectively by a
given DSM system, while others cannot.
Types of access patterns that should be
supported:
1.
2.
3.
4.
5.
Conventional shared
Read only
Migratory
Write shared
synchronization
27
Update-based vs. Invalidatebased
In update-based protocols the
modifications are sent whereas in
invalidate-based protocol only
notifications of modifications are sent.
Update-based
Invalidate-based
P1
w(x)1 w(y)2 rel(L)
P1
“I changed x and y”
P2
w(x)1 w(y)2 rel(L)
x:=1
y:=2
P2
28
Update-Based vs. InvalidateBased (cont.)
Invalidations are smaller than the updates.
The bigger the coherency unit the bigger is the
difference.
In invalidation-based schemes there can be
significant overhead due to access misses.
P1
P2
w(x)1 w(y)2 rel(L)
inv(x)
inv(y) acq(L)
get(x)
r(x)
x=1
get(y)
y=2
r(y)
29
4)update with timeout
mechanism
The problem: Updates to a data item are
propagated to all of its replicas, including
those that are no longer being used.
In DSM systems, this problem becomes even
more severe, because the main memories of
the nodes in which the replicas are kept are
very large and it takes a long time before a
page gets replaced, if at all.
Solution: invalid the old copies.
30
Reducing the Number of
Messages
In DASH and Munin systems all processes (or all
processes that cache the page) see the updates of a
process.
Consider the following example of execution in
Munin:
P1
P2
P3
P4
w(x) rel(L)
acq(L) w(x) rel(L)
acq(L) w(x) rel(L)
acq(L) r(x)
• There are many unneeded messages. In DASH even more.
• This problem exists in invalidation-based schemes as well.
31
Reducing the Number of Messages
(cont.)
• Logically, however it suffices to update each processor’s
copy only when it acquires L.
P1
P2
P3
P4
w(x) rel(L)
acq(L) w(x) rel(L)
acq(L) w(x) rel(L)
acq(L) r(x)
• Therefore, a new algorithm, called Lazy Release
Consistency (LRC) for implementing RC was proposed.
• LRC is aimed at reducing both the number of messages
and the amount of data exchanged.
32
Lazy Release Consistency
[Keleher et al., Treadmarks 1992]*
• The idea is to postpone sending of modifications
until a remote processor actually needs them.
• Invalidate-based protocol
• The BIG advantage: no need to get modifications
that are irrelevant, because they are already masked
by newer ones.
• NOTE: implements a slightly more relaxed
memory model than RC!
(*)
P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenopol. Treadmarks: Distributed shared memory on
standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference,
pages 115--132, Jan. 1994.
33
Lazy release consistency
Enforces consistency at the time of an acquire.
Causes fewer messages to be sent. At the time of a
lock release, Munin sends messages to all processors
who cache data modified by the releasing processor.
In contrast, in lazy release consistency, consistency
messages only travel between the last releaser and
the new acquirer.
But :
Somewhat more complicated
After a release, Munin can forget about all
modifications the releasing processor made prior
to the release.
34
Multiple writer protocol
in Treadmarks
The problem with a single writer protocol.
False sharing can cause singlewriter protocols
to perform even worse.
multiplewriter protocols allow multiple writers
to simultaneously modify the same page, with
consistency traffic deferred until a later time,
in particular until synchronization occurs.
Using diff.
35
Multiple writer protocol
in Treadmarks
The primary benefit of using diffs is
they can be used to implement multiplewriter protocols,
but they can also significantly reduce
overall bandwidth requirements because
diffs are typically much smaller than a
page.
36
Formal Definition of Lazy
Release Consistency
Conditions for Lazy Release Consistency:
(A) Before a read or write access is allowed to perform
with respect to any other process, all previous acquire
accesses must be performed with respect to that other
process, and
(B) Before a release access is allowed to perform with
respect to any other process, all previous read or write
accesses must be performed with respect to that other
process, and
(C) acquire and release accesses are sequentially
consistent.
37
Understanding the LRC Memory
Model
w(x)1
A
B
C
rel(L1)
r(x)0
r(x)?
r(x)?
acq(L1)
r(x)1
r(x)0
r(x)?
r(x)?
acq(L2)
r(x)?
t
• It is guaranteed that the acquirer of the same lock sees the
modification that precede the release in program order.
38
CLOUDS
Clouds is a distributed operating system that
integrates a set of loosely coupled machines
into a conceptually centralized system.
The system is composed of:
Compute servers
Data servers
User workstations
Compute server and data server are logical
designations
39
The Object Oriented Model
structures a software system as a set of
objects.
Object consists data and operations.
Objects respond to messages.
Sending msg msg
To object 1
object1
In the end,the method sends
reply to the msg sender.
Object 1 responds
by executing a method
The method read or update data
and may sends messages to
other objects
40
The Object Thread Model
Clouds has a similar structure as object thread,
implemented at the operating system level.
Clouds objects are largegrained encapsulations of
code and data which model distinct, persistent virtual
address spaces.
The contents of an object is an instance of a class a
class is a compiled program module.
Unlike virtual address spaces in conventional
operating systems, the contents of a Clouds object
are longlived. That is, a Clouds object exists forever
and survives system crashes and shutdowns (like a
file) until explicitly deleted.
41
Execution
A thread is a logical path of execution that
executes code in objects, traversing objects
as it executes.
Unlike processes in a conventional operating
system, a Clouds thread is not bound to a
single address space, since each object
represents a distinct address space.
42
Created by
user or
A program
Execution example
Thread execute
entry point
in a object
thread
Access to a persistent
data in the object
Returns to the calling
object
May invoke operations
In other objects
The thread temporarily
leaves the calling
objects enters the called
object
43
Clouds…
Clouds uses objects as the abstraction of storage and
threads as the abstraction of computation.
Clouds is a general purpose operating system. It is
intended to support all types of languages and
applications, distributed or not. All applications can
view the system as a monolith, but distributed
applications may choose to view the system as
loosely coupled compute and data servers.
Each compute facility in Clouds has access to all
resources in the system.
44
Distributed Computations
A novel feature of Clouds that makes
distributed programming simpler is the
automatic migration of objects via
Clouds object invocation and distributed
shared memory (DSM) mechanisms.
45
Synchronous invocation
Computing
Server A
O1
Creates thread
to invoke
O1
Thread1
Executing thread 1
O1 Data and code
are demand as a
page to A
Or,other node (B) do the
If the system
to invoke
invokeB
computation
thechose
thread
on A ,thethe
page
of O2 are
to perform
operation
Ecexecute
on O2
thread 1
on
other
nodethe
and
are from
demand
(will
demad
page
indata
O1 invoke O2
Data Server
to
node
A.
server to B),than the results are
returned to A.
O1
46
Synchronous invocation
This invocation mechanism is called
synchronous, because execution in the
current environment of the invoking
thread waits until the invoked operation
completes and results are returned.
47
Asynchronous invocation
When thread t1 executing in object 1
asynchronously invokes an operation of object 2 , a
new thread t2 is created which executes the invoked
operation.
Asynchronous invocation returns a variable p which
can be used as a place holder for object return value
of any type.
Thread t1 can continue with execution in object 1
At a later time, thread t1 can wait for the completion
of the operation being executed by t2 by executing a
claim call on p. When claim p returns, t2 has
completed and the results returned by the invoked
operation are available to t1 through p.
48
More details…
The differences from RPC
Objects migration
In Clouds , a DSM coherence protocol is used to
ensure that the data in an object is seen by
concurrent threads in a consistent fashion even if
they are executing on different compute servers.
Since several threads can simultaneously execute in
an object, it is necessary to coordinate access to the
object data. This is handled by Clouds object
programmers using Clouds system synchronization
mechanisms such as semaphores and locks. Through
these mechanisms, threads may synchronize their
actions regardless of where they execute.
49
advantages
An application programmer does not have to
be aware of distribution.
A distributed program can be written as a
centralized program where processes
communicate and synchronize using shared
memory.
The degree of distribution or concurrency
does not have to be decided at the time the
program is written.
Clouds treats computation and storage
orthogonally. Second, Clouds uses a single
level persistent store for all objects.
50
Summary
Munin-reduced communication mainly
by release consistency.
Treadmarkslazy release consistency
Clouds- distributed computation and
data .
51
Descargar

Munin, Clouds and ThreadMark