Communication Models for Parallel
Computer Architectures
• Two distinct models have been proposed for how
CPUs in a parallel computer system should
 In the first model, all CPUs share a common physical
• This kind of system is called a multiprocessor or shared
memory system.
 In the second design, each CPU has its own private
• Such a design is called a multicomputer or distributed
memory system.
 Consider a program to find all of the objects in a bitmap image.
• One copy of the image is kept in memory.
• Each CPU runs a single process which inspects one section of
the image.
• Some objects occupy multiple sections, so it is essential that
each process have access to the entire image.
 Example multiprocessors include:
Sun Enterprise 1000
Sequent NUMA-Q
SGI Origin 2000
HP/Convex Exemplar
 In a multicomputer solving the same problem, each
CPU has a section of the image in its local memory.
• If the CPUs need to follow an object across the border, they
must request the information from a neighboring CPU.
 This is done via message passing.
 Programming multicomputers is more difficult than
programming multiprocessors, but they are more
• Building a multicomputer with 10,000 CPUs is
 Example multicomputers include:
• IBM SP/2
• Intel/Sandia Option Red
• Wisconsin COW
 Much research focuses on hybrid systems combining
the best of both worlds.
 Shared memory might be implemented at a higher-level
than the hardware.
• The operating system might simulate a shared memory by
providing a single system-wide paged shared address space.
 This approach is called DSM (Distributed Shared Memory).
Shared Memory
• Each machine has its own virtual memory and its own page
• When a CPU does a LOAD or STORE on a page it does not
have, a trap to the OS occurs.
• The OS locates the page and asks the CPU currently holding it
to unmap the page and send it over the interconnection
• When it arrives, the page is mapped in and the faulting
instruction restarted.
 A third possibility is to have a user-level runtime
system implement a form of shared memory.
Shared Memory
Shared Memory
 The programming language provides a shared
memory abstraction implemented by the
compiler and runtime system.
• The Linda model is based on the abstraction of a
shared space of tuples.
 Processes can input a tuple from the shared tuple space or
output a tuple to the shared tuple space.
• The Orca model allows shared generic objects.
 Processes can execute object-specific methods on shared
 When a change occurs to the internal state of some object,
it is up to the runtime system to simultaneously update all
copies of the object.
Interconnection Networks
• Multicomputers are held together by
interconnection networks which move packets
between CPUs and memory.
 The CPUs and memory modules of multiprocessors are
also interconnected.
 Interconnection networks consist of:
Memory modules
Interconnection Networks
 The links are the physical channels over which
bits move. They can be
• electrical or optical fiber
• serial or parallel
• simplex, half-duplex, or full duplex
 The switches are devices with multiple input
ports and multiple output ports.
• When a packet arrives at an input port on a switch
some bits are used to select the output port to which
the packet is sent.
 An interconnection network consists of switches and
wires connecting them.
 The following slide shows an example.
• Each switch has four input ports and four output ports.
• In addition each switch has some CPUs and interconnect
• The job of the switch is to accept packets arriving on any input
port and send each one out on the correct output port.
• Each output port is connected to an input port of another
switch by a parallel or serial line.
 Several switching strategies are possible.
• In circuit switching, before a packet is sent, the entire path
from the source to the destination is reserved in advance.
 All ports and buffers are claimed, so that when transmission
starts, all necessary resources are guaranteed to be available and
the bits can move at full speed from the source, through the
switches to the destination.
• In store-and-forward packet switching, no advance
reservation is needed.
 The source sends a complete packet to the first switch where it is
stored in its entirety.
 The switches may need to buffer packets if an output port is
Communication Methods
 When a program is split up into pieces, the pieces
(processes) often need to communicate with one
 This communication can be done in one of two ways:
• shared variables
• explicit message passing
• Logical sharing of variables is possible even on a
• Message passing is easy to implement on a multiprocessor by
simply copying from the sender to the receiver.
Communication Methods
Message Passing Modes
• Messaging systems can be either persistent
or transient
 Are messages retained when the senders and/or
receivers stop executing?
• Can also be either synchronous or
 Blocking vs. non-blocking
Persistent Communication
• Persistent communication of letters back in the days of the
Pony Express.
Persistence and Synchronicity in
Persistent asynchronous communication
Persistent synchronous communication
Persistence and Synchronicity in
Transient asynchronous communication
Receipt-based transient synchronous communication
Persistence and Synchronicity in
Delivery-based transient synchronous communication at message delivery
Response-based transient synchronous communication
Remote Procedure Call
• Developed by Birrell and Nelson (1984).
 In multiprocessor systems the client code for
copying a file is quite different from the normal
centralized (uniprocessor) code.
 Let’s make the client server request-reply look
like a normal procedure call and return.
 Notice that getchar in the centralized version
turns into a read system call. The following is
for Unix:
• read looks like a normal procedure to its caller.
Remote Procedure Call
• read is a user mode program.
• read manipulates registers and then does a trap to
the kernel.
• After the trap, the kernel manipulates registers and
then does a C-language routine and lots of work gets
done (drivers, disks, etc).
• After the I/O, the process get unblocked, the kernel
read manipulates registers, and returns. The user
mode read manipulates registers and returns to the
original caller.
 Let’s do something similar with request reply:
Remote Procedure Call
 User (client) does a subroutine call to getchar
(or read).
• Client knows nothing about messages.
 We link in a user mode program called the
client stub (analogous to the user mode read
• This takes the parameters to read and converts them
to a message (marshalls the arguments).
• Sends a message to machine containing the server
directed to a server stub.
• Does a blocking receive (of the reply message).
Remote Procedure Call
 The server stub is linked with the server.
• It receives the message from the client stub.
• Unmarshalls the arguments and calls the server (as a
 The server procedure does what it does and
returns (to the server stub).
• Server knows nothing about messages
 Server stub now converts this to a reply
message sent to the client stub.
• Marshalls the arguments.
Remote Procedure Call
 Client stub unblocks and receives the reply.
• Unmarshalls the arguments.
• Returns to the client.
 Client believes (correctly) that the routine it
calls has returned just like a normal procedure
Passing Value Parameters
• Steps involved in doing remote computation through RPC
Remote Procedure Call
 Heterogeneity: Machines have different data
• How can we handle these differences in RPC?
 Have conversions between all possibilities.
 Done during marshalling and unmarshalling.
• Adopt a standard and convert to/from it.
Passing Value Parameters
Original message on the Pentium
The message after receipt on the SPARC
The message after being inverted. The little numbers in
boxes indicate the address of each byte
Remote Procedure Call
 Pointers: Avoid them for RPC!
• Can put the object pointed to into the message itself
(assuming you know its length).
• Convert call-by-reference to copyin/copyout
 If we have in or out parameters (instead of in out) can
eliminate one of the copies
• Change the server to handle pointers in a special
 Callback to client stub
Registering and name
 As we said before, we can use a name server.
 This permits the server to move using the
following process.
• deregister from the name server
• move
• reregister
 This is sometimes called dynamic binding.
Registering and name
 The client stub calls the name server (binder)
the first time to get a handle to use for the
• There is a callback from the binder to the client stub
if the server deregisters or we could have the
attempt to use the handle fail so that the client stub
will go to the binder again.
How does a programmer
create a program with RPC?
• uuidgen generates a unique identifier for the
• Include it in an IDL (interface description
language file) and describe the interface for
the RPC in the file as well
• Write the client and server code
• Client and server stubs are generated from
the IDL file automatically
• Link things together and run on desired
Writing a Client and a Server
• The steps in writing a client and a server in DCE RPC.
Processor Allocation
• Processor Allocation
 Decide which processes should run on which
 Could also be called process allocation.
 We assume that any process can run on any
Processor Allocation
 Often the only difference between different
processors is:
• CPU speed
• CPU speed and amount of memory
 What if the processors are not homogeneous?
• Assume that we have binaries for all the different
 What if not all machines are directly connected
• Send process via intermediate machines
Processor Allocation
• If we have only PowerPC binaries, restrict the
process to PowerPC machines.
• If we need machines very close for fast
communication, restrict the processes to a group of
close machines.
 Can you move a running process or are
processor allocations done at process creation
• Migratory allocation algorithms vs. non migratory.
Processor Allocation
 What is the figure of merit, i.e. what do we
want to optimize in order to find the best
allocation of processes to processors?
• Similar to CPU scheduling in centralized operating
 Minimize response time is one possibility.
Processor Allocation
• We are not assuming all machines are equally fast.
 Consider two processes. P1 executes 100 millions
instructions, P2 executes 10 million instructions.
 Both processes enter system at time t=0
 Consider two machines A executes 100 MIPS, B 10 MIPS
 If we run P1 on A and P2 on B each takes 1 second so
average response time is 1 sec.
 If we run P1 on B and P2 on A, P1 takes 10 seconds P2 .1
sec. so average response time is 5.05 sec.
 If we run P2 then P1 both on A finish at times .1 and 1.1
so average response time is .6 seconds!!
Processor Allocation
 Minimize response ratio.
• Response ratio is the time to run on some machine
divided by time to run on a standardized
(benchmark) machine, assuming the benchmark
machine is unloaded.
• This takes into account the fact that long jobs should
take longer.
 Maximize CPU utilization
 Throughput
• Jobs per hour
• Weighted jobs per hour
Processor Allocation
 If weighting is CPU time, we get CPU utilization
 This is the way to justify CPU utilization (user centric)
• Design issues
 Deterministic vs. Heuristic
• Use deterministic for embedded applications, when
all requirements are known a priori.
 Patient monitoring in hospital
 Nuclear reactor monitoring
 Centralized vs. distributed
• We have a tradeoff of accuracy vs. fault tolerance
and bottlenecks.
Processor Allocation
 Optimal vs. best effort
• Optimal normally requires off line processing.
• Similar requirements as for deterministic.
• Usual tradeoff of system effort vs. result quality.
 Transfer policy
• Does a process decide to shed jobs just based on its
own load or does it have (and use) knowledge of
other loads?
• Also called local vs. global
• Usual tradeoff of system effort (gather data) vs.
result quality.
Processor Allocation
• Location policy
 Sender vs. receiver initiated.
• Sender initiated - uploading programs to a compute
• Receiver initiated - downloading Java applets
 Look for help vs. look for work.
 Both are done.
Processor Allocation
• Implementation issues
 Determining local load
• Normally use a weighted mean of recent loads with
more recent weighted higher.
Processor Allocation
• Example algorithms
• Min cut deterministic algorithm
 Define a graph with processes as nodes and IPC
traffic as arcs
 Goal: Cut the graph (i.e some arcs) into pieces
so that
• All nodes in one piece can be run on one processor
 Memory constraints
 Processor completion times
• Values on cut arcs are minimized
Processor Allocation
• Minimize the max
 minimize the maximum traffic for a process pair
• Minimize the sum
 minimize total traffic
 Minimize the sum to/from a piece
 don't overload a processor
• Minimize the sum between pieces
 minimize traffic for processor pair
 Tends to get hard as you get more realistic
Processor Allocation
• Up-down centralized algorithm
 Centralized table that keeps "usage" data for a
user, the users are defined to be the workstation
owners. Call this the score for the user.
 The goal is to give each user a fair share.
 When user requests a remote job, if a
workstation is available it is assigned.
 For each process a user has running remotely,
the user's score increases by a fixed amount
each time interval.
Processor Allocation
 When a user has an unsatisfied request pending (and
none being satisfied), the score decreases (it can go
 If no requests are pending and none are being satisfied,
the score is bumped towards zero.
 When a processor becomes free, assign it to a
requesting user with the lowest score.
Processor Allocation
• Hierarchical algorithm
 Goal - assign multiple processors to a job
 Quick idea of algorithm
• Processors arranged in a tree
• Requests go up the tree until a subtree has enough resources
• Request is split and parts go back down the tree
 Arrange processors in a hierarchy (tree)
• This is a logical tree independent of how physically connected
• Each node keeps (imperfect) track of how many available
processors are below it.
 If a processor can run more than one process, must be more
sophisticated and must keep track of how many processes can be
allocated (without overload) in the subtree below.
Processor Allocation
• If a new request appears in the tree, the current node sees if it
can be satisfied by the processors below (plus itself).
 If so, do it.
 If not pass the request up the tree
 Actually since machines may be down or the data on availability
may be out of date, you actually try to find more processes than
• Once a request has gone high enough to be satisfied, the
current node splits the request into pieces and sends each piece
to appropriate child.
• What if a node dies?
 Promote one of its children say C
 Now C's children are peers with the previous peers of C
Processor Allocation
 If this is considered too unbalanced, we can promote one of C
children to take C's place.
• How can we decide which child C to promote?
 Peers of dead node have an election
 Children of dead node have an election
 Parent of dead node decides
• What if the root dies?
 Must use children since no peers or parent
 If we want to use peers, then we do not have a single root
 I.e. the top level of the hierarchy is a collection of roots that
communicate. This is a forest, not a tree
 What if multiple requests are generated simultaneously?
Processor Allocation
• Gets hard fast as information gets stale and potential race
conditions and deadlocks are possible.
• Distributed heuristic algorithm
Goal - find a lightly loaded processor to migrate job to
Send probe to a random processor
If the remote load is low, ship the job
If the remote load is high, try another random probe
After k (parameter of implementation) probes all say
the load is too high, give up and run the job locally.
 Modelled analytically and seen to work fairly well
 General goal is to have processes that
communicate frequently run simultaneously
 If they don’t and we use busy waiting for
messages, we will have a huge disaster.
 Even if we use context switching, we may have
a small disaster as only one message transfer
can occur per time scheduling slot
 Co-scheduling (a.k.a. gang scheduling).
Processes belonging to a job are scheduled
• Time slots are coordinated among the processors.
• Some slots are for gangs; other slots are for regular
Taxonomy of Parallel
• Although many researchers have tried to
come up with a taxonomy of parallel
computers, the only one which is widely
used is that of Flynn (1972).
• This classification is based on two concepts
 instruction streams
• corresponding to a program counter
 data streams
• consisting of a set of operands
Taxonomy of Parallel
Taxonomy of Parallel
Memory Semantics
 Even though all multiprocessors present the
CPUs with the image of a single shared address
space, often there are many memory modules
present, each holding some portion of the
physical memory.
• The CPUs and memories are often interconnected
by a complex interconnection network.
• Several CPUs may be attempting to read a memory
word at the same time several other CPUs are
attempting to write the same word.
• Multiple copies of some blocks may be in caches.
Memory Semantics
 One view of memory semantics is to view it as
a contract between the software and the
memory hardware.
• The rules are called consistency models, and many
different ones have been proposed and implemented.
• For example, suppose that CPU 0 writes the value 1
to some memory word and a little later CPU 1 writes
the value 2 to the same word.
• Now CPU 2 reads the word and gets the value 1.
• Is this an error?
Memory Semantics
 The simplest model is strict consistency.
• With this model, any read to a location x, always returns the
value of the most recent write to x.
• This model is great for programmers, but almost impossible to
 The next best model is called sequential consistency.
• The basic idea is that in the presence of multiple read and write
requests, some interleaving of all the requests is chosen by the
hardware (nondeterministically), but all CPUs see the same
Memory Semantics
Memory Semantics
 A looser consistency model, but one that is
easier to implement on large multiprocessors, is
processor consistency. It has two properties:
• Writes by any CPU are seen by all CPUs in the
order they were issued.
• For every memory word, all CPUs see all writes to it
in the same order.
• If CPU 1 issues writes with values 1A, 1B, and 1C
to some memory location in that sequence, then all
other processors see them in that order too.
• Every memory word has an unambiguous value
after several CPUs write to it and stop.
Memory Semantics
 Weak consistency does not even guarantee that writes
from a single CPU are seen in that order.
• One CPU might see 1A before 1B and another CPU might see
1A after 1B.
• However, to add some order, weakly consistent memories have
synchronization variables or a synchronization variable.
 When a synchronization is executed, all pending writes are
finished and no new ones are started until all the old ones are
done and the synchronization itself is done.
 In effect a synchronization “flushes the pipeline” and brings the
memory to a stable state with no operations pending.
 Time is divided into epochs delimited by the synchronizations.
Memory Semantics
Memory Semantics
 Weak consistency has the problem that it is
quite inefficient because it must finish off all
pending memory operations and hold all new
ones until the current ones are done.
 Release consistency improves matters by
adopting a model akin to critical sections.
• The idea behind this model is that when a process
exits a critical region it is not necessary to force all
writes to complete immediately. It is only necessary
to make sure that they are done before any process
enters the critical region again.
Memory Semantics
 In this model, the synchronization operation
offered by weak consistency is split into two
different operations.
• To read or write a shared data variable, a CPU must
first do an acquire operation on the synchronization
variable to get exclusive access to the shared data.
• When it is done, the CPU does a release operation
on the synchronization variable to indicate that it is
UMA Bus-Based SMP
• The simplest multiprocessors are based on a
single bus.
 Two or more CPUs and one or more memory
modules all use the same bus for
 If the bus is busy when a CPU wants to read
memory, it must wait.
 Adding more CPUs results in more waiting.
 This can alleviated by having a private cache
for each CPU.
UMA Bus-Based SMP
Snooping Caches
 With caches a CPU may have stale data in its
private cache.
 This problem is known as the cache coherence
or cache consistency problem.
 This problem can be controlled by algorithms
called cache coherence protocols.
• In all solutions, the cache controller is specially
deigned to allow it to eavesdrop on the bus,
monitoring all bus requests and taking action in
certain cases.
• These devices are called snooping caches.
Snooping Caches
MESI Cache Coherence
 When a protocol has the property that not all writes go
directly through to memory (a bit is set instead and the
cache line is eventually written to memory) we call it a
write-back protocol.
 One popular write-back protocol is called the MESI
• It is used by the Pentium II and other CPUs.
• Each cache entry can be in one of four states:
 Invalid - the cache entry does not contain valid data
 Shared - multiple caches may hold the line; memory is up to date
MESI Cache Coherence
 Exclusive - no other cache holds the line; memory is up to date
 Modified - the entry is valid; memory is invalid; no copies exist
• Initially all cache entries are invalid
• The first time memory is read, the cache line is marked E
• If some other CPU reads the data, the first CPU sees this on the
bus, announces that it holds the data as well, and both entries
are marked S (shared)
• If one of the CPUs writes the cache entry, it tells all other
CPUs to invalidate their entries (I) and its entry is now in the
M (modify) state.
MESI Cache Coherence
• If some other CPU now wants to read the modified
line from memory, the cached copy is sent to
memory, and all CPUs needing it read it from
memory. They are marked as S.
• If we write to an uncached line and the writeallocate is in use, we will load the line, write to it
and mark it as M.
• If write-allocate is not in use, the write goes directly
to memory and the line is not cached anywhere.
MESI Cache Coherence
UMA Multiprocessors Using
Crossbar Switches
 Even with all possible optimizations, the use of
a single bus limits the size of a UMA
multiprocessor to about 16 or 32 CPUs.
• To go beyond that, a different kind of
interconnection network is needed.
• The simplest circuit for connecting n CPUs to k
memories is the crossbar switch.
 Crossbar switches have long been used in telephone
 At each intersection is a crosspoint - a switch that can be
opened or closed.
 The crossbar is a nonblocking network.
UMA Multiprocessors Using
Crossbar Switches
Sun Enterprise 1000
 An example of a UMA multiprocessor based on
a crossbar switch is the Sun Enterprise 1000.
• This system consists of a single cabinet with up to
64 CPUs.
• The crossbar switch is packaged on a circuit board
with eight plug in slots on each side.
• Each slot can hold up to four UltraSPARC CPUs
and 4 GB of RAM.
• Data is moved between memory and the caches on a
16 X 16 crossbar switch.
• There are four address buses used for snooping.
Sun Enterprise 1000
UMA Multiprocessors Using
Multistage Switching Networks
 In order to go beyond the limits of the Sun
Enterprise 1000, we need to have a better
interconnection network.
 We can use 2  2 switches to build large
multistage switching networks.
• One example is the omega network.
• The wiring pattern of the omega network is called
the perfect shuffle.
• The labels of the memory can be used for routing
packets in the network.
• The omega network is a blocking network.
UMA Multiprocessors Using
Multistage Switching Networks
UMA Multiprocessors Using
Multistage Switching Networks
NUMA Multiprocessors
 To scale to more than 100 CPUs, we have to
give up uniform memory access time.
 This leads to the idea of NUMA (NonUniform
Memory Access) multiprocessors.
• They share a single address space across all the
CPUs, but unlike UMA machines local access is
faster than remote access.
• All UMA programs run without change on NUMA
machines, but the performance is worse.
 When the access time to the remote machine is not hidden
(by caching) the system is called NC-NUMA.
NUMA Multiprocessors
 When coherent caches are present, the system is called
 It is also sometimes known as hardware DSM since it is
basically the same as software distributed shared memory
but implemented by the hardware using a small page size.
• One of the first NC-NUMA machines was the
Carnegie Mellon Cm*.
 This system was implemented with LSI-11 CPUs (the
LSI-11 was a single-chip version of the DEC PDP-11).
 A program running out of remote memory took ten times
as long as one using local memory.
 Note that there is no caching in this type of system so
there is no need for cache coherence protocols.
NUMA Multiprocessors
Cache Coherent NUMA
 Not having a cache is a major handicap.
 One of the most popular approaches to building
large CC-NUMA (Cache Coherent NUMA)
multiprocessors currently is the directorybased multiprocessor.
• Maintain a database telling where each cache line is
and what its status is.
• The db is kept in special-purpose hardware that
responds in a fraction of a bus cycle.
Cache Coherent NUMA
DASH Multiprocessor
 The first directory-based CC-NUMA
multiprocessor, DASH (Directory
Architecture for SHared Memory), was built
at Stanford University as a research project.
• It has heavily influenced a number of commercial
products such as the SGI Origin 2000
• The prototype consists of 16 clusters, each one
containing a bus, four MIPS R3000 CPUs, 16 MB
of global memory, and some I/O equipment.
• Each CPU snoops on its local bus, but not on any
other buses, so global coherence needs a different
DASH Multiprocessor
DASH Multiprocessor
 Each cluster has a directory that keeps track of which
clusters currently have copies of its lines.
 Each cluster in DASH is connected to an interface that
allows the cluster to communicate with other clusters.
• The interfaces are connected in a rectangular grid.
• A cache line can be in one of three states
• The DASH protocols are based on ownership and invalidation.
DASH Multiprocessor
• At every instant each cache line has a unique owner.
 For UNCACHED or SHARED lines, the line’s home
cluster is the owner
 For MODIFIED lines, the cluster holding the one and only
copy is the owner.
• Requests for a cache line work there way out from
the cluster to the global network.
• Maintaining memory consistency in DASH is fairly
complex and slow.
• A single memory access may require a substantial
number of packets to be sent.
Sequent NUMA-Q
 The DASH was an important project, but it was never a
commercial system.
 As an example of a commercial CC-NUMA
multiprocessor, consider the Sequent NUMA-Q 2000.
• It uses an interesting and important cache coherence protocol
called SCI (Scalable Coherent Interface).
• The NUMA-Q is based on the standard quad board sold by
Intel containing four Pentium Pro CPU chips and up to 4 GB
of RAM.
 All these caches are kept coherent by using the MESI protocol.
Sequent NUMA-Q
Sequent NUMA-Q
 Each quad board is extended with an IQ-Link
board plugged into a slot designed for network
• The IQ-Link primarily implements the SCI protocol.
• It holds 32 MB of cache, a directory for the cache, a
snooping interface to the local quad board bus and a
custom chip called the data pump that connects it
with other IQ-Link boards.
 It pumps data from the input side to the output side,
keeping data aimed at its node and passing other data
 Together all the IQ-link boards form a ring.
Sequent NUMA-Q
Distributed Shared Memory
 A collection of CPUs sharing a common paged
virtual address space is called DSM
(Distributed Shared Memory).
• When a CPU accesses a page in its own local RAM,
the read or write just happens without any further
• If the page is in a remote memory, a page fault is
• The runtime system or OS sends a message to the
node holding the page to unmap it and send it over.
• Read-only pages may be shared.
Distributed Shared Memory
Distributed Shared Memory
 Pages, however, are an unnatural unit for
sharing, so other approaches have been tried.
 Linda provides processes on multiple machines
with a highly structured distributed shared
• The memory is accessed through a small set of
primitive operations that can be added to existing
languages such as C and FORTRAN.
• The unifying concept behind Linda is that of an
abstract tuple space.
• Four operations are provided on tuples:
Distributed Shared Memory
• out, puts a tuple into the tuple space
• in, retrieves a tuple from the tuple space.
 The tuples are addresses by content, rather than by name.
• read is like in but it does not remove the tuple from the tuple
• eval causes its parameters to be evaluated in parallel and the
resulting tuple to be deposited in the tuple space.
 Various implementations of Linda exist on
• Broadcasting and directories are used for distributing the
Distributed Shared Memory
Distributed Shared Memory
 Orca uses full-blown objects rather than tuples
as the unit of sharing.
 Objects consist of internal state plus operations
for changing the state.
 Each Orca method consists of a list of (guard,
block-of-statements) pairs.
• A guard is a Boolean expression that does not
contain any side effects, or the empty guard, which
is simply true.
• When an operation is invoked, all of its guards are
evaluated in an unspecified order.
Distributed Shared Memory
• If all of them are false, the invoking process is
delayed until one becomes true.
• When a guard is found that evaluates to true, the
block of statements following it is executed.
• Orca has a fork statement to create a new process on
a user-specified processor.
• Operations on shared objects are atomic and
sequentially consistent.
• Orca integrates shared data and synchronization in a
way not present in page-based DSM systems.
Distributed Shared Memory

No Slide Title