DISTRIBUTED COMPUTING
Fall 2007
ROAD MAP: OVERVIEW
• Why are distributed systems interesting?
• Why are they hard?
GOALS OF DISTRIBUTED SYSTEMS
Take advantage of cost/performance difference between
microprocessors and shared memory multiprocessors
Build systems:
1. with a single system image
2. with higher performance
3. with higher reliability
4. for less money than uniprocessor systems
In wide-area distributed systems, information and work are
physically distributed, implying that computing needs
should be distributed. Besides improving response time,
this contributes to political goals such as local control
over data.
WHY SO HARD?
A distributed system is one in which each
process has imperfect knowledge of the
global state.
Reasons: Asynchrony and failures
We discuss problems that these two features
raise and algorithms to address these
problems.
Then we discuss implementation issues for
real distributed systems.
ANATOMY OF A DISTRIBUTED
SYSTEM
A set of asynchronous computing devices
connected by a network. Normally, no
global clock.
Communication is either through messages
or shared memory. Shared memory is
usually harder to implement.
ANATOMY OF A DISTRIBUTED SYSTEM (cont.)
EACH PROCESSOR HAS ITS OWN CLOCK
+ ARBITRARY NETWORK
BROADCAST MEDIUM
Special protocols will be possible for the broadcast medium.
COURSE GOALS
1. To help you understand which system
assumptions are important.
2. To present some interesting and useful
distributed algorithms and methods of
analysis then have you apply them
under challenging conditions.
3. To explore the sources for distributed
intelligence.
BASIC COMMUNICATION PRIMITIVE:
MESSAGE PASSING
Paradigm:
– Send message to destination
– Receive message from origin
Nice property: can make distribution
transparent, since it does not matter
whether destination is at a local computer
or at a remote one (except for failures).
Clean framework: “Paradigms for Process Interaction in
Distributed Programs,” G. R. Andrews, ACM Computing
Surveys 23:1 (March 1991) pp. 49-90.
BLOCKING (SYNCHRONOUS) VS.
NON-BLOCKING (ASYNCHRONOUS)
COMMUNICATION
For sender: Should the sender wait for the
receiver to receive a message or not?
For receiver: When arriving at a reception
point and there is no message waiting,
should the receiver wait or proceed?
Blocking receive is normal (i.e., receiver
waits).
sender
BLOCKING
send
receiver
send
NO
COMPUTATION
ACK
sender
NON-BLOCKING
receiver
send
ACK (?)
REMOTE PROCEDURE CALL
Client calls the server using a call server (in parameters;
out parameters). The call can appear anywhere that a
normal procedure call can.
Server returns the result to the client.
Client blocks while waiting for response from server.
CLIENT
server
call
return
RENDEZVOUS FACILITY
– One process sends a message to another process and blocks at
least until that process accepts the message.
– The receiving process blocks when it is waiting to accept a
request.
Thus, the name: Only when both processes are ready for the data
transfer, do they proceed.
We will see examples of rendezvous interactions in CSP and Ada.
sender
receiver
send
accept
accepted
Beyond send-receive: Conversations
Needed when a continuous connection is
more efficient and/or only some data at a
time.
Bob and Alice: Bob initiates, Alice responds,
then Bob, then Alice, …
But what if Bob wants Alice to send messages
as they arrive without Bob’s doing more than
an ack?
Sendonly or receiveonly mode.
Others?
SEPARATION OF CONCERNS
Separation of concerns is the software
engineering principle that each
component should have a single small job
to do so it can do it well.
In distributed systems, there are at least
three concerns having to do with remote
services: what to request, where to do it,
how to ask for it.
IDEAL SEPARATION
• What to request: application programmer
must figure this out, e.g. access customer
database.
• Where to do it: application programmer
should not need to know where, because
this adds complexity + if location changes,
application break.
• How to ask for it: want a uniform interface.
WHERE TO DO IT:
ORGANIZATION OF CLIENTS AND SERVERS
A service is a piece of work to do. Will be done by a server.
A client who wants a service sends a message to a service broker
for that service. The server gets work from the broker and
commonly responds directly to the client. A server is a process.
More basic approach: Each server has a port from which it can
receive requests.
Difference: In client-broker-server model, many servers can offer
the same service. In direct client-server approach, client must
request a service from a particular server.
client…client
client
Service broker
server…server
server
ALTERNATIVE: NAME SERVER
A service is a piece of work to do. Will be done by a
server. Name Server knows where services are done
Example: Client requests address of server from the
Name Server and then communicates directly with that
server..
Difference: Client-server communication is direct, so may
be more efficient.
client…client
client
Service broker
Client … client
server
HOW TO ASK FOR IT:
OBJECT-BASED
• Encapsulation of data behind functional
interface.
• Inheritance is optional but interface is the
contract.
• So need a technique for both synchronous
and asynchronous procedure calls.
REFERENCE EXAMPLE:
CORBA OBJECT REQUEST
BROKER
• Send operation to ORB with its
parameters.
• ORB routes operation to proper site for
execution.
• Arranges for response to be sent to you
directly or indirectly.
• Operations can be “events” so can allow
interrupts from servers to clients.
SUCCESSORS TO CORBA
Microsoft Products
• COM: allow objects to call one another in
a centralized setting: classes + objects of
those classes. Can create objects and
then invoke them.
• DCOM: COM + Object Request Broker.
• ActiveX: DCOM for the Web.
SUCCESSORS TO CORBA
Java RMI
• Remote Method invocation (RMI): Define
a service interface in Java.
• Register the server in RMI repository, i.e.,
an object request broker.
• Client may access Server through
repository.
• Notion of distributed garbage collection
SUCCESSORS TO CORBA
Enterprise Java Beans
• Beans are again objects but can be
customized at runtime.
• Support distributed transaction notion
(later) as well as backups.
• So transaction notion for persistent
storage is another concern it is nice to
separate.
REDUCING BUREAUCRACY:
automatic registration
• SUN also developed an abstraction
known as JINI.
• New device finds a lookup service (like an
ORB), uploads its interface, and then
everyone can access.
• No need to register.
• Requires a trusted environment.
COOPERATING DISTRIBUTED SYSTEMS:
LINDA
• Linda supports a shared data structure called a tuple
space.
• Linda tuples, like database system records, consists of
strings and integers. We will see that in the matrix
example below.
TUPLE
SPACE
PROCESSES
LINDA OPERATIONS
The operations are out (add a tuple to the space); in
(read and remove a tuple from the space); and
read (read but don’t remove a tuple from the tuple
space).
A pattern-matching mechanism is used so that
tuples can be extracted selectively by specifying
values or data types of some fields.
in (“dennis”, ?x, ?y, ….)
– gets tuple whose first field contains “dennis,”
assigns values in second and third fields of the
tuple to x and y, respectively.
EXAMPLE: MATRIX MULTIPLICATION
There are two matrices A and B. We store A’s rows and
B’s columns as tuples.
(“A”, 1, A’s first row), (“A”, 2, A’s second row) ….
(“B”, 1, B’s first column), (“B”, 2, B’s second column) ….
(“Next”, 15)
There is a global counter called Next in the range 1 ..
number of rows of A x number of columns of B.
A process performs an “in” on Next, records the value,
and performs an “out” on Next+1, provided Next is still
in its range.
Convert Next into the row number I and column number j
such that Next = i x total number of columns + j.
ACTUAL MULTIPLICATION
First find i and j.
in (“Next”, ?temp);
out (“Next”, temp +1);
convert (temp, i, j);
Given i and j, a process just reads the values
and outputs the result.
read (“A”, i, ?row_values)
read (“B”, j, ?col_values)
out (“result”, i, j, Dotproduct(row, col)).
LINDA IMPLEMENTATION OF
SHARED TUPLE SPACE
The implementers assert that the work
represented by the tuples is large enough
so that there is no need for shared
memory hardware.
The question is how to implement out, in,
and read (as well as inp and readp).
BROADCAST IMPLEMENTATION 1
Implement out by broadcasting the argument of out to all sites.
(Use a negative acknowledgement protocol for the broadcast.)
To implement read, perform the read from the local memory.
To implement in, perform a local read and then attempt to delete
the tuple from all other sites.
If several sites perform an in, only one site should succeed.
One approach is to have the site originating the tuple decide
which site deletes.
Summary: good for reads and outs, not so good for ins.
out
BROADCAST IMPLEMENTATION 2
Implement out by writing locally.
Implement in and read by a global query.
(This may have to be repeated if the data
is not present.)
Summary: better for out. Worse for read.
Same for in.
in, read
COMMUNICATION REVIEW
Basic distributed communication when no
shared memory: send/receive.
Location transparency: broker or name server
or tuple space.
Synchrony and asynchrony are both useful
(e.g. real-time vs. informational sensors).
Other mechanisms are possible
COMMUNICATION BY SHARED
MEMORY: beyond locks
Framework: Herlihy, Maurice. “Impossibility and Universality Results
for
Wait-Free Synchronization,” ACM SIGACT-SIGOPS Symposium
on Principles of Distributed Computed (PODC), 1988.
In a system that uses mutual exclusion, it is possible that one
process may stop while holding a critical resources and
hang the entire system.
It is of interest to find “wait-free” primitives, in which no
process ever waits for another one.
The primitive operations include test-and-set, fetch-and-add,
and fetch-and-cons.
Herlihy shows that certain operations are strictly more
powerfully wait-free than others.
CAN MAKE ANYTHING WAIT-FREE
(at a time price)
Don’t maintain the data structure at all. Instead, just keep
a history of the operations.
enq(x)
put enq(x) on end of history list (fetch-and-cons)
end enq(x)
deq
put deq on end of history list (fetch-and-cons)
“replay the array” and figure out what to return
end deq
Not extremely practical: the deq takes
O(number of deq’s + number of enq’s) time.
Suggestion is to have certain operations reconstruct the
state in an efficient manner.
GENERAL METHOD: COMPARE-AND-SWAP
Compare-and-swap takes two values: v and v’. If the
register’s current value is v, it is replaced by v’,
otherwise it is left unchanged. The register’s old
value is returned.
temp := compare-and-swap (register, 0, i)
if register = 0 then register := i
else register is unchanged
Use this primitive to perform atomic updates to a
data structure.
In the following figure, what should the compareand-swap do?
PERSISTENT DATA STRUCTURES AND WAITFREEDOM
current
current
Original Data Structure
One node added, one node removed. To establish change, change the current pointer.
Old tree would still be available.
Important point: If process doing change should abort, then no other process is affected.
Logical Level is Nice, but…
• We have talked about some programming
constructs one can use above a
communications infrastructure.
• Understanding that infrastructure will be
necessary to understand performance and
fault tolerance considerations.
• Our discussion of that will come from Joe
Conron’s lecture notes datacomessence.ppt
ORDER-PRESERVING BROADCAST
PROTOCOLS ON BROADCAST NET
Framework: Chang, Jo-Mei. “Simplifying
Distributed Database Systems Design by Using a
Broadcast Network,” ACM SIGMOD, June 1984.
• Proposes a virtual distributed system that implements ordered atomic
broadcast and failure detection.
• Shows that this makes designing the rest of system easier.
• Shows that implementing these two primitives isn’t so hard.
Paradigm: find an appropriate intermediate level of
abstraction that can be implemented and that facilitates
the higher functions.
Build Facilities that use Broadcast Network.
Implement Atomic Broadcast Network.
RATIONALE
• Use property of current networks, which
are naturally broadcast, although not so
reliable.
• Common tasks of distributed systems:
Send same information to many sites
participating in a transaction (update all
copies); reach agreement (e.g. transaction
commitment).
DESCRIPTION OF ABSTRACT MACHINE
Services and assurances it provides:
• Atomic broadcast: failure atomicity. If a message is
received by an application program at one site, it will be
received at all operational sites.
• System-wide clock and all messages are timestamped in
sequence. This is the effective message order.
Assumptions: Failures are fail-stop, not malicious. So,
for example token site will not lie about messages
or sequence numbers.
Network failures require extra memory.
CHANG SCHEME
Tools: Token-passing scheme + positive
acknowledgments + negative
acknowledgements.
Sender
Token Site
Broadcast
Ack with counter
Increment counter Commit
message
BEAUTY OF NEGATIVE
ACKNOWLEDGMENT
How does a site discover that it hasn’t received a
message?
Non-token site knows that it has missed a
message if there is a gap in the counter values
that it has received. In that case, it requests that
information from the token site (negative ack).
Overhead: one positive acknowledgment per
broadcast message vs. one acknowledgment
per site per message in naïve implementation.
TOKEN TRANSFER
Token transfer is a standard message. The
target site must acknowledge. To become a
token site, the target site must guarantee
that it has received all messages since the
last time it was a token site.
Detect failure at a non-token site, when it fails
to accept token responsibility.
Here is token
token
site
I can take it
token
REVISIT ASSUMPTIONS
Sites do not lie about their state (i.e., no
malicious sites; could use authentication).
Sites tell you when they fail (e.g. through
redundant circuitry) or by not responding.
If there is a network partition, then no negative
ack would occur, so must keep message m
around until everyone has acquired the token
after m was sent.
LAMPORT Times, Clocks paper
• What is the proper notion of time for Distributed
Systems?
• Time Is a Partial Order
• The Arrow Relation
• Logical Clocks
• Ordering All Events using a tie-breaking Clock
• Achieving Mutual Exclusion Using This Clock
• Correctness
• Criticisms
• Need for Physical Clocks
• Conditions for Physical Clocks
• Assumptions About Clocks and Messages
• How Do We Achieve Physical Clock Goal?
ROAD MAP: TIME ACCORDING TO
LAMPORT
Languages &
Constructs for
Synchronization
How to model time
in distributed systems
TIME
Assuming there are no failures, the most
important difference between distributed
systems and centralized ones is that
distributed systems have no natural notion of
global time.
– Lamport was the first who built a theory
around accepting this fact.
– That theory has proven to be surprisingly
useful, since the partial order that Lamport
proposed is enough for many applications.
WHAT LAMPORT DOES
1. Paper (reference on next slide) describes a
message-based criterion for obtaining a time
partial order.
2. It converts this time partial order to a total order.
3. It uses the total order to solve the mutual
exclusion problem.
4. It describes a stronger notion of physical time
and gives an algorithm that sometimes achieves
it (depending on quality of local clocks and
message delivery).
NOTIONS OF TIME IN DISTRIBUTED SYSTEMS
Lamport, L. “Times, Clocks, and the Ordering of
Events in a Distributed System,”
Communications of the ACM, vol. 21, no. 7 (July
1978).
– Distributed system consists of a collection of distinct
processes, which are spatially separated. (Each
process has a unique identifier.)
– Communicate by exchanging messages.
– Messages arrive in the order they are sent. (Could be
achieved by hand-shaking protocol.)
– Consequence: Time is partial order in distributed
systems. Some events may not be ordered.
THE ARROW (partial order) RELATION
We say A happens before B or A  B, if:
1. A and B are in the same process and A
happens before B in that process (Assume
processes are sequential.)
2. A is the sending of a message at one process
and B is the receiving of that message at
another process, then A  B.
3. There is a C such that A  C and C  B.
In the jargon,  is an irreflexive partial ordering.
LOGICAL CLOCKS
Clocks are a way of assigning a number to
an event. Each process has its own clock.
For now, clocks will have nothing to do with
real time, so they can be implemented by
counters with no actual timing mechanism.
Clock condition: For any events A and B, if A
 B, then C(A) < C(B).
IMPLEMENTING LOGICAL CLOCKS
• Each process increments its local clock
between any two successive events.
• Each process puts its local time on each
message that it sends.
• Each process changes its clock C to C’
when it receives message m having
timestamp T. Require that C’> max(C, T).
IMPLEMENTATION OF LOGICAL CLOCKS
13
8
14
Receiver clock jumps to 14 because of timestamp on message received.
13
18
19
Receiver clock is unaffected by the timestamp associated with sent message,
because receiver’s clock is already 18, so greater than message timestamp.
ORDERING ALL EVENTS
We want to define a total order .
Suppose two events occur in the same
process, then they are ordered by the first
condition.
Suppose A and B occur in different processes,
i and j. Use process ids to break ties.
LC(A) = A|i, i.e. A concatenated with i.
LC(B) = B|j.
The total ordering  is called Lamport clock.
ACHIEVING MUTUAL EXCLUSION
USING THIS CLOCK
Goals:
1. Only one process can hold the resource at
a time.
1. Requests must be granted in the order in
which they are made.
Assumption: Messages arrive in the order
they are sent. (Remember, this can be
achieved by handshaking.)
ALGORITHM FOR MUTUAL EXCLUSION
1.
To request the resource, Pi sends the message “request resource” to
all other processes along with Pi’s local Lamport timestamp T. It also
puts that message on its own request queue.
2.
When a process receives such a request, it acknowledges the
message. (Unless it has already sent a message to Pi timestamped
later than T.)
3.
Releasing the resource is analogous to requesting, but doesn’t require
an acknowledgement.
Pk
Pi
REQUEST
REQUEST
REQUEST
None needed
Pj
Ack
Pi
executes
Ack
RELEASE
RELEASE
Ack
USING THE RESOURCE
Process Pi starts using the resource when:
i.
its own request on its local request queue has the
earliest Lamport timestamp T (consistent with );
ii. it has received a message (either an
acknowledgement or some other message) from
every other process with a timestamp larger than T.
iii. (Always) On receiving a release from pj, remove the
request from pj.
CORRECTNESS
Theorem: Mutual exclusion and first-requested,
first-served are achieved.
Proof
Suppose Pi and Pj are both using the resource at the
same time and have timestamps Ti and Tj.
Suppose Ti < Tj. Then Pj must have received i’s request,
since it has received at least one message with a
timestamp greater than Tj from Pi and since messages
arrive in the order they are sent. But then Pj would not
execute its request. Contradiction.
First-requested, first-served. If Pi requests the resource
before Pj (in the  sense), then Ti < Tj, so Pi will win.
CRITICISMS
• Many messages. If only one process is
using the resource, it still must send
messages to many other processes.
• If one process stops, then all processes
hang (no wait freedom; could we achieve?)
Is there a Wait-Free Variant?
• Modify resource locally and then send to
everyone. If everyone accepts the new
value, then new resource value is good.
• Difficulty: what do you do if you don’t hear
from someone? Could there be a partition
of processes?
• Is this even possible?
NEED FOR PHYSICAL CLOCKS
Time as a partial order is the most frequent assumption in distributed
systems, however it is sometimes important to have a physical notion of
time.
Example: Going outside the system. Person X starts a program A, then calls
Y on the telephone, who then starts program B. We would like A  B.
But that may not be true for Lamport clocks, because they are sensitive only
to inter computer messages. Physical clocks try to account for event
ordering that are external to the system.
X
Y
starts A
calls y
receives call
from x
starts B
Would like starts A  starts B
But this may not be true with
 as so far defined.
CONDITIONS FOR PHYSICAL CLOCKS
• Suppose u is the smallest time through internal or
external means that one process can be informed
of an event occurring at another process. That is, u
is the smallest transmission time.
(Distance/speed of light?)
• Suppose we have a global time t (all processes are
in same frame of reference) that is unknown to any
process.
Goal for physical clocks: Ci(t + u) > Cj(t) for any i, j.
This ensures that if A happens before B, then the
clock time for B will be after the clock time for A.
ASSUMPTIONS ABOUT CLOCKS AND
MESSAGES
1. Clock drift. In one unit of global time, Ci
will advance between 1-k and 1+k time
units. (k << 1)
2. A message can be sent in some minimum
time v with a possible additional delay of
at most e.
HOW DO WE ACHIEVE PHYSICAL
CLOCK GOAL?
• Can’t always do so, e.g., can’t synchronize
quartz watches using the U.S. post office.
• Basic algorithm: Periodically (to be
determined), each process sends out
timestamped messages.
• Upon receiving a message from Pi
timestamped Ti, process Pj sets its own
timestamp to max(Ti + v, Tj).
WHAT ALGORITHM ACCOMPLISHES
Simplifying to the essence of the idea, suppose
there are two processes i and j and i sends a
message that arrives at global time t.
After possibly resetting its timestamp, process j
ensures that
Cj(t) ≥ Ci(t) + v – (e+v)x(1+k)
That is, since i sent its message at local time Ti, i’s
clock may have advanced (e+v)x(1+k) time units
to Ti+(e+v)x(1+k) time. At the least Cj(t) ≥ Ti+v.
How good can synchronization be, given e, v, k?
ROAD MAP: SOME FUNDAMENTAL
PROTOCOLS
Languages &
Constructs for
Synchronization
Time in
distributed
systems
Protocols built on
asynchronous networks
(achieve global state)
PROTOCOLS
Asynchrony and distributed centers of processing
give rise to various problems:
1. Find a spanning tree n a network.
1. When does a collection of processes terminate?
1. Find a consistent state of a distributed system, i.e.,
some analogue to a photographic “snapshot”.
1. Establish a synchronization point. This will allow us
to implement parallel algorithms that work in
rounds on a distributed asynchronous system.
1. Find the shortest path from a given node s to every
other node in the network.
MODEL
Links are bidirectional. A message traverses
a single link.
All nodes have distinct ids. Each node knows
immediate neighbors
Messages incur arbitrary but finite delay.
FIFO discipline on links, i.e., messages are
received in the order they are sent.
PRELIMINARY: ESTABLISH A SPANNING TREE
Some node establishes itself as the leader (e.g. node x
establishes a spanning tree for its broadcasts, so x is root).
That node sends out a “request for children” to its neighbors
in the graph.
When a node n receives “request for children” from node m
if m is the first node that sent n this message, then n
responds “ACK and you’re my parent”; sends request for
children to its other neighbors else n responds “ACK”
Each node except the root has one parent, and every node is
in the tree. A leaf is a node that only received ACKs from
neighbors to which it sent requests.
TERMINATING THE SPANNING TREE
A node that determines that it is a leaf sends up an
“I’m all done” message to its parent.
Each non-root parent sends an “I’m all done”
message to its parent once it has received such
message from all its children.
When the root receives “I’m all done” from its
children, then it is done.
BROADCAST WITH FEEDBACK
A given node s would like to pass message X
to all other nodes in the network and be
informed that all nodes have received the
message.
Algorithm:
Construct the spanning tree and then send X along
the tree.
Have each node send an acknowledgement to its
parent after it has received an acknowledgement from
its children.
PROCESS TERMINATION
Def: Detect the completion of a collection of non-interacting tasks, each of which is
performed on a distinct processor.
When a leaf finishes its computation, it sends an “I am terminated” message to its
parent.
When an internal node completes it has received an “I am terminated” message
from all of its children, it sends such a message to its parent.
When the root completes and has received an “I am terminated” message from all
of its children, all tasks have been completed.
5
4
4
4
3
2
1
DISTRIBUTED SNAPSHOTS
Intuitively, a snapshot is a freezing of a
distributed computation at the “same time.”
Given a snapshot, it is easy to detect stable
conditions such as deadlock.
(A deadlock condition doesn’t go away. If a
deadlock held in the past and nothing has
been done about it, then it still holds. That
makes it stable.)
FORMAL NOTION OF SNAPSHOT
Assume that each processor has a local clock, which is
incremented after the receipt of and processing of each
incoming message, e.g., a Lamport clock. (Processing may
include transmitting other messages.)
A collection of local times {tk|kεN}, where N denotes the set of
nodes, constitutes a snapshot, if each message received
by node j from node i prior to tj has been sent by i prior to ti.
A message sent by i before ti but not received by j before tj is
said to be in transit.
The correctness criterion is that no message be received
before the snapshot, which was sent after the snapshot.
(Such a thing could never happen if the snapshot time at
every site were a single global time.)
DISTRIBUTED SNAPSHOTS
i
SITUATION I
A MESSAGE
FROM i TO J
j
i
Time
j
ti
OK
tj
SITUATION II
i
j
ti
Time
BAD
tj
SITUATION III
Time
i
j
tj
ti
ti is SNAPSHOT TIME FOR PROCESS i
tj is SNAPSHOT TIME FOR PROCESS j
IN TRANSIT -- OK
ALGORITHM
Node i enters its snapshot time either spontaneously or upon receipt of a
“flagged” message, whichever comes first. In either case, it sends out a
“flagged” token to all neighbors and advances the clock to what
becomes its snapshot time ti.
Messages sent later are after the snapshot.
This algorithm allows each node to determine when all messages in transit
have been received.
That is, when a node receives a flagged token from all its neighbors, then it
has received all messages in transit.
spontaneous
send flagged token; set snapshot
received
flagged
token
SNAPSHOT PROTOCOL IS CORRECT
Remember that we must prevent a node i
from receiving a message before its
snapshot time, ti, that was sent by a node j
after its snapshot time, tj.
But any message sent after the snapshot will
follow the flagged token at the receiving site
because of the FIFO discipline on links. So,
bad case cannot happen.
SYNCHRONIZER
It is often much easier to design a distributed
protocol when the underlying system is
synchronous.
In synchronous systems, computation
proceeds in “rounds”. Messages are sent at
the beginning of the round and arrive before
the end of the round. The beginning of each
round is determined by a global clock.
A synchronizer enables a protocol designed
for a synchronous system to run an
asynchronous one.
PROTOCOL FOR SYNCHRONIZER
1. Round manager broadcasts “round n begin” Each
node transmits the messages of the round.
2. Each node then sends its flagged token to all
neighbors and records that time as the snapshot
time. (Flagged tokens are numbered to distinguish
the different rounds.)
3. Each node receives messages along each link until
it receives the flagged token, meaning neighbor has
sent everything.
4. Nodes perform termination back to manager after
they have received a token from all neighbors (not
just children). Termination involves messages on
spanning tree rooted at round manager.
MINIMUM-HOP PATHS
Task is to obtain the paths with the smallest number of links
from a given node s to each other node in the network.
Suppose the network is synchronous. In the first round s
sends to its neighbors. In the second round, the neighbors
send to their neighbors. And so it continues.
When node I receives id s, for the first time, it designates the
link on which s has arrived as the first link on the shortest
path to s.
Use the synchronization protocol to simulate rounds.
NETWORK
A
ONE POSSIBLE RESULT A
B
B
S
S
D
D
C
E
F
C
E
F
MINIMUM-HOP PATHS
The round by round approach takes a long
time. Can you think of an asynchronous
approach that takes less time?
How do you know when you’re done?
NETWORK
ONE POSSIBLE RESULT A
A
B
B
S
D
S
D
C
C
E
F
E
F
Group Protocol
• All nodes start at distance infinity except S
which is at distance 0.
• S sends a message. All receivers of S’s
message are at distance 1.
• When a node receives a message that it is
at distance k and k < that node’s current
distance, the node sends k+1 to all its
neighbors.
Group Protocol Termination
• Invariant: If for all nodes n the neighbors of
n have distance at least dist(n) – 1, then
we’re done.
Anna’s Group Protocol Termination
• Each time a node x receives a message m
from y, it acks. Now x sends messages to
all its neighbors (if distance has gone
down). When x receives a reply from all its
neighbors, then it replies to y that x is
done updating. When s receives all
replies, it should broadcast everyone is
done.
• Dennis doesn’t think this is done.
Group Protocol Termination
• A node will terminate if it gets a message
from all its neighbors
CLUSTERING
K-means clustering:
Choose k centroids from the set of points at random.
Then assign each point to the cluster of the nearest centroid.
Then recompute the centroid of each cluster and start over.
Why does this converge?
A Lyapunov function is a function of the state of an algorithm
that decreases whenever the state changes and that is
bounded from below.
With sequential k-means the sum of the distances always
decreases.
Can't get lower than zero.
WHY DOES DISTANCE DECREASE?
Well, when you readjust the mean, it decreases for that set.
When you reassign, every distance gets smaller still.
So every step readjusts the total distance.
How do we do this for a distributed asynchronous system?
What if you have rounds?
What if you don’t?
SETI-at-Home Style Projects?
SETI stands for search for extra-terrestrial intelligence. It
consists of testing radio signal receptions for some regularity.
Ideal distributed system project: master sends out work. Servers
do work.
Servers may crash. What to do? Work queue trick: send work
description to server and put job at bottom of queue. If server
completes, then remove work description from queue.
Servers may be dishonest. What to do?
BROADCAST PROTOCOLS
Often it is important to send a message to a
group of processes in an all-or-nothing
manner. That is, either all non-failing
processes should receive the message or
none should.
This is called atomic broadcast
Assumptions:
1. fail-stop processes
2. messages are received from one process to
another in the order they are sent
ATOMIC (UNORDERED) BROADCAST PROTOCOL
• Application: Update all copies of a replicated data item
• Initiator: Send message m to all destination processes
• Destination process: When receiving m for the first
time, send it to all other destinations
Initiator
Initiator
Fault-Tolerant Broadcasts
• Reference: “A Modular Approach to FaultTolerant Broadcasts and Related
Problems” Vassos Hadzilacos and Sam
Toueg.
• Describes reliable broadcast, FIFO
broadcast, causal broadcast and ordered
broadcast.
Stronger Broadcasts
• FIFO broadcast: Reliable broadcast that
guarantees that messages broadcast by the
same sender are received in the order they
were broadcast.
• A bit more precise: If a process broadcasts
a message m before it broadcasts a
message m’, then no correct process
accepts m’ unless it has previously
accepted m. (Might buffer a message
before accepting it.)
Problems with FIFO
• Network news application, where users
distribute their articles with FIFO Broadcast.
User A broadcasts an article.
• User B, at a different site, accepts that article
and broadcasts a response that can only be
understood by a user who has already seen
the original article.
• User C accepts B’s response before
accepting the original article from A and so
misinterprets the response.
Causal Broadcast
• Causal broadcast: If the broadcast of m
causally precedes the broadcast of m’ (in
the sense of Lamport ordering), then m
must be accepted everywhere before m’
• Does this solve the previous problem?
Problems with Causal Broadcast
• Consider a replicated database with two
copies of a bank account x residing in
different sites. Initially, x has a value of 100.
A user deposits 20, triggering a broadcast of
“add 20 to x” to the two copies of x.
• At the same time, at a different site, the
bank initiates a broadcast of “add 10 percent
interest to x”. Not causally related, so
Causal Broadcast allows the two copies of x
to accept these update messages in
different orders.
THE NEED FOR ORDERED BROADCAST
In the causal protocol, it is possible for two updaters
at different sites to send their messages in a
different order to the various processes, so the
sequence won’t be consistent.
U1
U2
m’
m
A
B
U1
m’
U2
A
m
B
Total Order (Atomic Broadcast)
• If correct processes p and q both accept
messages m and m’, the p accepts m
before m’ if and only if q accepts m before
m’
DALE SKEEN’S ORDERED
BROADCAST PROTOCOL
Idea is to assign each broadcast a global logical
timestamp and deliver messages in the order of
timestamps.
• As before, initiator send message m to all receiving processes
(maybe not all)
• Receiver process marks m as undelivered (keeps m in buffer)
and sends a proposed timestamp that is larger than any
timestamp that the site has already proposed or received
Timestamps are made unique by attaching the site’s
identifier as low-order bits.
Time advances at each process based on Lamport
clocks.
SKEEN’S ORDERED BROADCAST
PROTOCOL (cont.)
Initiator
Receivers
Send
message m
Proposed timestamp (e.g., 17),
based on local Lamport time
Take max.
(e.g., 29)
Final timestamp
Forget proposed timestamp for m.
Wait until final timestamp for m is minimum of
proposed or final timestamps.
Accept m.
Forget timestamp for m.
CORRECTNESS
Theorem: m and m’ will be accepted in same
order at all common sites.
Proof steps:
– Every two final timestamps will be different.
– If TS(m)<TS(m’), then any proposed
timestamp for m<TS(m’); TS(m) is final
timestamp for m.
Is the protocol overly complex?
Find an example showing that changing the
Skeen protocol in any one of the following
ways would yield an incorrect protocol.
1. The timestamps at different sites could be the
same.
2. The initiator chooses the minimum (instead of the
maximum) proposed timestamp as the final
timestamp.
3. Sites accept messages as soon as they become
deliverable.
ROAD MAP: COMMIT PROTOCOLS
Recovery of data
following fail-stop
failures
All-or-nothing
commitment of
transactions.
Avoid partial updates.
Fail-stop failures.
THE NEED
Scenario: Transaction manager
(representing user) communicates with
several database servers.
Main problem is to make the commit atomic
(i.e., either all sites commit the transaction
or none do).
NAÏVE (INCORRECT) ALGORITHM
TM
Servers
TM
Done
Commit
INVENTORY INCREMENTED
No!
Commit
CASH NOT DECREMENTED
RESULT: INCONSISTENT STATE
TWO-PHASE COMMIT: PHASE 1
– Transaction manager asks all servers whether they can
commit.
– Upon receipt, each able server saves all updates to
stable storage and responds yes.
If server cannot say yes (e.g., because of a concurrency
control problem), then it says no. In that case, it can
immediately forget the transaction. Transaction
manager will abort the transaction at all sites.
Yes
TM
Yes
Prepare
Yes
Prepare
Prepare
Servers
TWO-PHASE COMMIT: PHASE 2
– If all servers say yes, then transaction manager writes a commit
record to stable storage and tells them all to commit, but if some say
no or don’t respond, transaction manager tells them all to abort.
– Upon receipt, the server writes the commit record and then sends an
acknowledgement. The transaction manager is done when it
receives all acknowledgements.
If a database server fails during first step, all abort.
If a database server fails during second step, it can consult the
transaction manager to see whether it should commit.
TM
Ack
Ack
Commit
Commit
Ack
Commit
ALL OF TWO-PHASE COMMIT
Transaction
Manager
Server
Prepare
Yes
Commit
Done
States of server
Prepare
Commit
Ready to
Commit
Active
Cannot
prepare
Aborted
Committed
QUESTIONS AND ANSWERS
Q: What happens if the transaction manager fails?
A: A database server who said yes to the first phase but has
received neither a commit nor abort instruction must wait
until the transaction manager recovers. It is said to be
blocked.
Q: How does a recovering transaction manager know whether
it committed a given transaction before failing?
A: The transaction manager must write a commit T record to
stable storage after it receives yes’s from all data base
servers on behalf of T and before it sends any commit
messages to them.
Q: Is there any way to avoid having a data base server block
when the transaction manager fails?
A: A database server may consult other database servers who
have participated in the transaction, if it knows who they are.
OPTIMIZATION FOR READ-ONLY
TRANSACTIONS
Read-only transactions.
Suppose a given server has done only reads
(no updates) for a transaction.
• Instead of responding to the transaction manager
that it can commit, it responds READ-only
• The transaction manager can thereby avoid sending
that server a commit message
THREE-PHASE COMMIT
A non-blocking protocol, assuming that:
• A process fail-stops and does not recover during
the protocol
• The network delivers messages from A to B in the
order they were sent
• Live processes respond within the timeout period
Non-blocking = surviving servers can decide
what to do.
PROTOCOL
Transaction Manager
(Initiator)
Server
(Agent)
Willing?
Willing-Yes
Prepare
OK
Committed
Done
STATES OF SERVER ASSUMING FIRST
TM DOES NOT FAIL
Send
Willing-Yes
Active
Receive
Prepare
Willing
Ready to
Commit
No
Receive
Committed
Abort
Committed
INVARIANTS while first TM active
• No server can be in the willing state while any other
server (live or failed) is in the committed state
• No server can be in the aborted state while any other
server (live or failed) is in the ready-to-commit state
Some may have aborted
Active
No one has committed.
Willing
Ready to Commit
No one has aborted. Some
may have committed.
Committed
CONTRAST WITH TWO-PHASE
COMMIT
Active
Some may have aborted
Ready to Commit
Some may have
committed.
Committed
RECOVERY IN THREE-PHASE COMMIT
after first TM fails or slows down too much
What the newly elected TM does:
Any
Ready-to-Commit
or
Committed
Any
Aborted?
Send
ABORT
Send
Committed
All Willing
All Servers Alive
Some Dead
Send
ABORT;
Send
PREPARE; COMMITTED
ROAD MAP: KNOWLEDGE LOGIC
AND CONSENSUS
Commitment
fail-stop
site only
Knowledge Logic:
consensus.
Fail-stop.
Network perhaps.
EXAMPLE: COORDINATED ATTACK
Forget about computers. Think about a pair of allied generals A and B.
They have previously agreed to attack simultaneously or not at all. Now,
they can only communicate via carrier pigeon (or some other unreliable
medium).
Suppose general A sends the message to B
“Attack at Dawn”
Now, general A won’t attack alone. A doesn’t know whether B has received
the message. B understand A’s predicament, so B sends an
acknowledgment.
“Agreed”
Attack
A
Agreed
B
A
B
WILL IT EVER END?
ack your ack
A
ack your ack to my ack
B
A
B
IT NEVER ENDS
Theorem: Assume that communication is unreliable. Any
protocol that guarantees that if one of the generals
attacks, then the other does so at the same time, is a
protocol in which necessarily neither general attacks.
Have you ever had this problem when making an
appointment by electronic mail?
10 AM?
A
OK
B
A
B
But will he
show up?
BACK TO COMPUTERS
While ostensibly about military matters, the Two Generals
problem and the Byzantine Agreement problem should
remind you of the commit problem.
• In all three problems, there are two possibilities: commit
(attack) and abort (don’t attack).
• In all three problems, all sites (generals) must agree.
• In all three problems, always aborting (not attacking) is not an
interesting solution.
The theorem shows that no non-blocking commit protocol
is possible when the network can drop messages.
Corollary: If the decision must be made within a fixed time
period, then unbounded network delays prevent the
sites from ever committing.
ROAD MAP: KNOWLEDGE LOGIC
AND TRANSMISSION
Knowledge Logic:
consensus
Failures of all types
Knowledge Logic and
Transmission.
Common knowledge is
unnecessary.
BASIC MODEL FOR
KNOWLEDGE LOGIC
• Each processor is in some local state.
That is, it knows some things.
• The global state is just the set of all local
states.
• Two global states are indistinguishable to
a processor if the processor has the same
local state in both global states.
SOME USEFUL NOTATION FOR SUCH PROBLEMS
Ki – agent i knows.
CG – common knowledge among group G
A statement x is common knowledge if
1. Every agent knows x.  i Ki x.
2. Every agent knows that every other agent knows x.
 i  j Ki Kj x.
3. Every agent knows that every other agent knows that
every other agent knows x
and so on.
I know x.
You know that I know x.
You know that I know
that you know x…
…
EXAMPLES
In coordinated attack problem, when A sends his message.
KA “A says attack at dawn”
When B receives that, then
KBKA “A says attack at dawn”
However, it is false that KAKBKA “A says attack at dawn”
This is remedied when A receives the first acknowledge, at
which point
KAKBKA “A says attack at dawn”
However, it is false that
KBKAKBKA “A says attack at dawn”
More knowledge but never common knowledge.
EXAMPLE: RELIABLE AND
BOUNDED TIME COMMUNICATION
If A knows that B will receive any
message that A sends within one minute
of A’s sending it, then if A sends
“Attack at dawn”
A knows that within two minutes
CA,B “A says attack at dawn”
CONCLUSIONS
• Common knowledge is unattainable in
systems with unreliable communication (or
with unbounded delay)
• Common knowledge is attainable in
systems with reliable communication in
bounded time
APPLYING KNOWLEDGE TO SEQUENCE
TRANSMISSION PROTOCOLS
Problem: The two processes are the sender and the receiver.
Sender S has an input tape with an infinite sequence of data
elements (0,1, blank). S tries to transmit these to receiver
R. R writes these onto the output tape.
Correctness: Output tape should contain a prefix of input tape
even in the face of errors (safety condition).
Given a sufficiently long correct transmission, output tape
should make progress (liveness condition).
….1
0
1
S
1
0
0
0
0
1
R
MODEL
• Messages are kept in order
• Sender and receiver are synchronous. This
implies that sending a blank conveys information.
Three possible type of errors:
– Deletion errors: either a 0 or a 1 is sent, but a blank
is received.
– Mutation errors: a 0 (resp. 1) is sent, but a 1 (resp. 0)
is received. Blanks are transmitted correctly.
– Insertion errors: a blank is sent, but a 0 or 1 is
received.
Question: Can we handle all three error types?
POSSIBLE ERROR TYPES
If all error types are present, then a sent sequence
can be transformed to any other sequence of the
same length. So receiver R can gain no
information about messages that sender S actually
transmitted.
For any two of the three, the problem is solvable.
To show this, we will extend the transmission
alphabet to consist of blank, 0, 1, ack, ack2, ack3.
Eventually, we will encode these into 0s, 1s and
blanks.
ERROR TYPE: DELETION ALONE
So a 1,0, or any acknowledgement can become a blank.
Suppose the input for S is 0,0,1…
For any symbol y, we went to achieve that the sender
knows that the receiver has received (knows) symbol y.
Denote this Ks Kr (y).
Imagine the following protocol: If S doesn’t receive an
acknowledgement, then it resends the symbol it just
sent. If S receives an acknowledgement, S sends the
next symbol on its tape.
Scenario: S sends y, R sends ack, S sends next symbol y’.
Is there a problem?
GOAL OF PROTOCOL
Yes, there is a problem. Look at this from R’s point of view. It may be that
y’ = y.
R doesn’t know whether S is resending y (because it didn’t receive R’s
acknowledgement) or S is sending a new symbol.
So, R needs more knowledge. Specifically, R must know that S received
its acknowledgement. S must know that R knows this.
We need Ks Kr Ks Kr y. To get this, S sends ack2 to R. Then R sends
ack3 to S.
S
R
y
Ack
Ks Kr y
Ack2
Ack3
Ks Kr Ks Kr y
Kr y
Kr Ks Kr y
EXERCISE
Suppose that the symbol after y is y’ and y’  y.
Then can S send y’ as soon as it receives ack to y?
(Assume R has a way of knowing that it received
y and y’ correctly.)
S sends y
S
R
y
R sends ack
Kr y
Ack
S sends y’
Ks Kr y
R sends ack …
y’
Kr y’; Kr Ks Kr y
ENCODING PROTOCOL IN 0’s and 1’s
Protocol Symbol
S
Encoding
Blank
11
0
00
ack
0
1
01
ack3
1
ack2
10
ack2
0
1
blank
10
00
R
0
1
ack
ack3
01
11
WHAT IS SENT
WHAT IS SENT
WHAT DO WE WANT FROM AN
ENCODING?
1. Unique decidability. If e(x) is received
uncorrupted, then recipient knows that it is
uncorrupted and is an encoding of x.
2. Corruption detectability. If e(x) is corrupted,
the recipient knows that it is.
Thus, receiver knows when it receives good
data and when it receives a garbled
message.
ENCODING FOR DELETIONS AND MUTATIONS
Recall that mutation means that a 0 can become a 1 or vice versa.
Encoding (b is blank)
Protocol Symbol
Encoding
Blank
bbb1
0
1bb b
ack
1b
1
b1b b
ack3
b1
ack2
bb1 b
ack
Note:
S
R
The same extended alphabet
protocol will work.
ack2
S
R
ack3
S
R
Any insertion will result in two
non-blank characters. A mutation
can only change a 1 to a 0.
Exercise for you
• Suppose we had deletion errors (1
becomes blank or 0 becomes blank) and
insertion errors (blank becomes 0 or blank
becomes 1), but no mutation errors (so 1
never becomes 0 and 0 never becomes
1).
• Now try mutation and insertion.
Self-Stabilizing Systems
• A distributed system is self-stabilizing if,
when started from an arbitrary initial
configuration, it is guaranteed to reach a
legitimate configuration as execution
progresses, and once a legitimate
configuration is achieved, all subsequent
configurations remain legitimate.
Self-Stabilizing Systems
(using invariants)
• There is an invariant I which implies a
safety condition S.
• When failures occur, S is maintained
though I may not be.
• However when the failures go away, I
returns.
•
http://theory.lcs.mit.edu/classes/6.895/fall02/papers/Arora/masking.pdf
Self-Stabilizing Systems
components
• A corrector returns a program from state S
to I: e.g. error correction codes, exception
handlers, database recovery.
• A detector sees whether there is a
problem: e.g. acceptance tests, watchdog
programs, parity ...
Self-Stabilizing Systems
example
• Error model: messages may be dropped.
• Message sending protocol called the alternating
bit protocol, which we explain in stages.
• Sender sends a message, Receiver
acknowledges if message is received and
uncorrupted (can use checksum).
• Sender sends next message.
Alternating Bit Protocol
continued
• If Sender receives no ack, then it resends.
• But: What if receiver has received the
message but the ack got lost.
• In that case, the receiver thinks of this as a
new message.
Alternating Bit Protocol
-- we’ve arrived.
• Solution 1: Send a sequence number with the
message so receiver knows whether a message
is new or old. Receiver still acks.
• But: This number increases as the log of the
number of messages.
• Better: Send the low order bit of the sequence
number. This is the alternating bit protocol.
• Invariant: Output equals what was sent perhaps
without the last message.
Why is this Self-Stabilizing?
• Safety: output is a prefix of what was sent
even in the face of failures (provided
checksums are sufficient to detect
corruption).
• Invariant: (Output equals what was sent
perhaps without the last message) is a
strong liveness guarantee.
Exercise: what do sender/receiver
know at each stage of protocol
• When sender receives ack?
• When receiver receives a message with
low order bit?
MQ Series
a reliable transmission protocol
• Idea is to guarantee that a message that is
sent is received exactly once.
• Sender deposits a message at a queue
manager and the queue manager ensures
that the message arrives exactly once at
the receiver.
• How can this be made to work?
• Assume: fail-stop failures (stable storage
does not fail).
MQ series
basic idea
• It is necessary to coordinate MQ series
and the receiver.
• What protocol does this sound like?
• What should we require about message
delay/timeout?
MQ series
answers
• Need a commit protocol to get this kind of
coordination.
• MQ series supports the “XA” interface
(every component can do two phase
commit)
• No particular requirement on messages,
but there could be blocking.
ROAD MAP: TECHNIQUES FOR
REAL-TIME SYSTEMS
Fault Tolerance
Clock
Synchronization
Operating System
Scheduling
Real-Time
SYSTMES THAT CANNOT OR SHOULD NOT WAIT
Time-sharing operating environments: concern for
throughput.
Want to satisfy as many users as possible.
Soft real-time systems (e.g., telemarketing): concern for
statistics of response time.
Want a few disgruntled customers.
Firm real-time systems (e.g., obtain ticker information on
Wall Street): concern to meet as many deadlines as
possible.
If you miss, you lose the deal.
Hard real-time systems (e.g., airplane controllers):
requirements to meet all deadlines.
If you miss, then airplane may crash.
DISTINCTIVE CHARACTERISTICS OR REAL-TIME SYSTEMS
1. Predictability is essential – For hard, real-time systems the time
to run a routine must be known in advance.
Implication: Much programming is done in assembly language.
Changes are done by “patching” machine code.
2. Fairness is considered harmful – We do not want an ambulance
to wait for a taxi.
3. Implication: messages must be prioritized, FIFO queues are bad.
4. Preemptibility is essential – An emergency condition must be
able to override a low-priority task immediately.
5. Implication: Task switching must be fast, so processes must
reside in memory.
6. Scheduling is of major concern – The time budget of an
application is as important as its monetary budget. Meeting time
constraints is more than just a matter of fast hardware.
7. Implication: We must look at the approaches to scheduling.
SCHEDULING APPROACHES
1. Cyclic executive – Divide processor time into
endlessly repeating cycles where each cycle is
some fixed length, say 1 second. During a
cycle some periodic tasks may occur several
times, others only once. Gaps allow sporadic
tasks to enter.
2. Rate-monotonic – Give tasks priority based on
the frequency with which they are requested.
3. Earliest-deadline first – Give the highest
priority to the task with the earliest deadline.
CYCLIC EXECUTIVE STRATEGY
Gap
T1
T3
T4
Gap
T1
T1
T2
Gap
A cycle design containing sub-intervals of different
lengths. During a sub-interval, either a periodic task
runs or a gap is permitted for sporadic tasks to run.
Note the task T1 runs three times during each cycle. In
general, different periodic tasks may have to run with
different frequencies.
Earliest Deadline First
• Algorithm is the title. Execute the task with
the earliest deadline.
• Theorem: if any scheduler can schedule
all tasks by their deadlines, then earliest
deadline first can do so.
• How might you prove this?
Earliest Deadline First
• Suppose that an optimal scheduler does
task t1 before t2 even though both tasks
are in the system and t2 has an earlier
deadline than t1. Both complete by their
deadlines.
• Earliest deadline first would do t2 first, so
clearly t2 finishes even earlier. But now t1
finishes when t2 would have. This was
early enough for t2 and hence for t1.
Earliest Deadline First
-- overload issue
• What happens on overload (i.e. not
enough time to finish everything)?
• Change the model: firm real time. Get
value if you finish a task by its deadline but
lose nothing if you don’t.
Earliest Deadline First
-- pure alg does badly
• If you have k tasks where each takes 1
second and they have deadlines 1 sec, 2
sec, …, k sec, you can do them all.
• What if we add a new task that takes
1/1000 sec with deadline 0.5 seconds.
• We execute new task but no other task
completes by its deadline.
Earliest Deadline First
-- heuristic partial solution
• Never execute a task having negative
slack time (no way it can finish on time).
• But this doesn’t necessarily lead to an
optimal solution.
• Sometimes you are faced with two tasks,
each of which could finish if you
terminated the other.
Earliest Deadline First
-- value density lower bound
• Suppose tasks have values and
computation times. Value density =
value/computation time.
• If all tasks have same value density, but
different computation times, cannot
guarantee better than a factor of ½ of
optimal score.
• Can you think of an example?
Earliest Deadline First
-- value density scenario
• No slack time for any task
• Task 1 lasts 1 sec, value of 1
• Task 2 arrives at 1-epsilon and lasts 2
secs, value of 2.
• Task 3 arrives at 1 and lasts epsilon
seconds.
• Do we abandon task 1 for task 2 or do we
do task 3?
Earliest Deadline First
-- value density outcome 1
• If we drop task 1 in favor of task 2, then
there might be more and more epsilon
length tasks until time 3 – 2 epsilon when
a task of length 2 arrives.
• Optimal algorithm would have done task 1,
all the little epsilons, and then the final 2
second task. Optimal: 5 – 2 epsilon;
Switcher: 2
Earliest Deadline First
-- value density outcome 2
• If we don’t drop task 1 in favor of task 2,
then there might be no more tasks.
• Optimal algorithm would have done task 2,
whereas your algorithm did task 1 plus
epsilon. Optimal: 2; Non-switcher: 1
Earliest Deadline First
-- competitive factor
• At time 1-epsilon, your scheduler can’t
know what to do, so an optimal
(clairvoyant) scheduler may get at least a
factor of 2 more.
• Meeting this competitive factor was the
PhD work of Gilad Koren here at NYU.
• See the D-over algorithm in the pdf notes.
• Squash club puzzle.
RATE MONOTONIC ASSIGNMENT
Three tasks:
Task
Period
Compute Time
T1
100
20
T2
150
40
T3
350
100
Rate monotonic would say that T1 should get highest
priority (because its period is smallest and rate is
highest), then T2, then T3.
Assume that all tasks are perfectly preemptable.
As the following figure shows, all tasks meet their
deadlines. What happens if T3 is given highest priority
“because it is the most important”?
EXAMPLE OF RATE MONOTONIC SCHEDULING
T3 interrupted
T3 completes first time
T3 completes earlier
T3
T2
T1
0
50
100
150
200
250
300
350
400
450
500
Use of rate monotonic scheduler (higher rate gets higher
priority) ensures that all tasks complete by their deadlines.
Notice that T3 completes earlier in its cycle the second time,
indicating that the most-difficult-to-meet situation is the
very initial one.
CASE STUDY
A group is designing a command and control system.
Interrupts arrive at different rates, however the maximum rate
of each
interrupt is predictable.
Computation time of task associated with each interrupt is
predictable.
First implementation uses Ada and a special purpose
operating system. The operating system handled
interrupts in a round-robin fashion.
That is, first the OS checked for interrupts for task 1, then task
2, and so on.
System did not meet its deadlines, yet was grossly
underutilized (about 50%).
FIRST DECISION
Management decided that the problem was
Ada.
Do you think they were right? (Assume that
they could have shortened each task by 10%
and that the tasks and times are of the three
task system given previously.
CASE STUDY – SOLUTIONS
Switching from Ada probably would not have helped.
Consider using round-robin for the three task system given
before. If task T3 is allowed to run to completion, then it will
prevent task T1 from running for 100 time units (or 90 with
the time improvement). That is not fast enough.
Change scheduler to give priority to task with smallest period,
but tasks remain non-preemptable.
Helps, but not enough since the T3-T1 conflict would still
prevent T1 from completing.
Change tasks so longer tasks are preemptable.
This would solve the problem in combination with rate
monotonic priority assignment. (Show this.)
Motto: Look first at the scheduler.
PRIORITY INVERSION AND PRIORITY
INHERITANCE (T1 highest, T3 lowest)
T1 waits for lock
T1 preempts
T2 preempts T3 before T3 releases lock
T1
T2
T3
T3 acquires lock
Definitions
• Priority inversion: a lower priority task T2
effectively blocks a higher one T1 because
T2 prevents T3 from releasing a lock.
• Priority inheritance: the idea that if T3
holds a lock and T1 waits for the lock, then
T3 should “inherit” the priority of T1 while
T3 holds the lock.
SPECIAL CONSIDERATIONS FOR
DISTRIBUTED SYSTEMS
Since communication is unpredictable, most distributed non-shared
memory real-time systems do no dynamic task allocation. Tasks
are pre-allocated to specific processors.
Example: oil refineries where each chemical process is controlled by a
separate computer.
Message exchange is limited to communicating data (e.g., in sensor
applications) or status (e.g. time out messages). Messages must
be prioritized and some messages should be datagrams.
Example: Command and control system has messages that take priority
over all other messages, e.g., “hostilities have begun.”
Special processor architectures are possible that implement a global
clock (hence require real-time clock synchronization) and
guaranteed message deliveries.
Example application: airplane control with a token-passing network.
OPEN PROBLEMS
Major open problem is to combine real-time
algorithms with other needs, e.g., high performance
network protocols and distributed database
technology.
• What is the place of carrier-sense detection circuits in
real–time system?
– If exponential back-off is used, then no guarantee is possible.
(See text.)
– However, a tree-based conflict protocol, e.g., based on a site’s
identifier, can guarantee message transmission.
• How should deadlocks be handled in a real-time
transaction system?
– Aborting an arbitrary transaction is unacceptable.
– Aborting a low priority transaction may be acceptable.
Spatial Systems
• From a working paper by Gerard Lelann
for the European Space Agency.
“Problemes de communication et de
coordination dans les systemes spatiaux”
• Issues for space: distributed computing
(ground, space probe), delays, real-time
control, failures including transient ones.
Need for Research
• Most real-time work ignores failures
(Lelann’s own protocol for LANs is an
exception)
• Most fault-tolerant work ignores real-time
and assumes connected networks (but
Earth-Mars takes 8.5 to 40 minutes and
message drops are frequent).
• Distributed work generally ignores realtime too.
Hardware Failures
• Alpha particles can cause multiple bit flips
and may vary (earth’s “suburbs” have
fewer such particles than interplanetary
space)
• Reed Solomon codes are used
Case Study: Mars Pathfinder
•
•
•
•
•
•
Pathfinder lands on Mars.
Silence.
Ground control reboots.
Transmits a while then silence.
Repeat over a few days.
Mission practically useless.
Case Study: Mars Pathfinder
• High priority task A and low priority task C share
semaphore X.
• When C holds X a task B of intermediate priority
enters but doesn’t need X.
• Net result B beats C and C never finishes.
• Every 125 ms a task N enters. It’s not supposed
to see any other task. When it sees C, it resets
the system. Await orders from earth.
• Priority inheritance would have fixed this.
Case Study: Mars Spirit
• Rover lands and also is silent.
• Workers on earth detect the problem so
they load some files.
• No transactional guarantee however, so
net result is only some of the files are
loaded.
• System in an inconsistent state.
• How would you do this in a wait-free
manner?
Case Study: Mars Spirit
• Send processes replacing files f1, f2, ….,
fk to files f1new, f2new, …, fknew. Then
have a process that renames those files
en masse. Only the last has to be
transactional.
• Transmission is easier than consensus.
COMPONENTS OF SECURITY
• Authentication – Proving that you are who
you say you are.
• Access Rights – Giving you the information
for which you have clearance.
• Integrity – Protecting information from
unauthorized exposure.
• Prevention of Subversion – Guard against
Replay attacks, Trojan Horse attacks, Covert
Channel analysis attacks…
AUTHENTICATION AND ZERO
KNOWLEDGE PROOFS
The parable of the Amazing Sand Counter:
Person S makes the following claim:
• You fill a bucket with sand. I can tell, just by looking at it,
how many grains of sand there are. However, I won’t tell
you.
• You may test me, if you like, but I won’t answer any
question that will teach you anything about the number of
grains in the bucket.
• The test may include your asking me to leave the room.
What do you do?
SAND MAGIC
The Amazing Sand Counter claims to know how
many grains of sand there are in a bucket just
by looking at it.
How can you put him to the test?
AUTHENTICATING THE AMAZING
SAND COUNTER
Answer:
1. Tester tells S to leave the room.
2. Tester T removes a few grains from bucket and
counts them, then keeps in T’s pocket.
3. T asks S to return and say how many grains have
been removed.
4. T repeats until convinced or until T shows that S
lies.
Are there any problems left? Can the tester use
the Amazing Sand Counter’s knowledge to
masquerade as the Amazing Sand Counter?
MIGHT A COUNTERFEIT AMAZING
SAND COUNTER SUCCEED
Can the tester use the Amazing Sand
Counter’s knowledge to masquerade as
the Amazing Sand Counter?
REPLAY ATTACKS AND TIME
Tester T can use a replay technique:
1. T claims to U that T is an Amazing Sand Counter.
2. U presents a bucket to T.
3. T removes the bucket and shows it to S, pretending to
engage S in yet another test.
4. U asks T to leave the room and removes some sand.
5. T returns, but asks for a little time.
6. T then shows the bucket to S.
7. S says how many grains have been removed.
8. T repeats what S says to U.
To prevent this, a site must distinguish replayed
messages from current ones, perhaps by
signatures.
Crypto puzzle 1: Fagin/Vardi
management dilemma
Sometimes one doesn’t need zero knowledge,
but just the answer to a yes/no question.
An employee E complains to boss B1 about
some person P1 and to boss B2 about P2.
B1 and B2 confer and want to determine whether
P1 = P2. If not, though, neither wants to reveal
its Pi, nor ask any more of E.
Is there a good non-computational technique to
figure this out? Assume the set of possible
people is just the list of people in E’s group
and this is known to both B1 and B2.
Crypto puzzle 1: Hints
Paper cups.
Pieces of paper.
Pencils.
Remember: We just need to know if the
same person is being complained about.
PUBLIC-KEY ENCRYPTION
Motivation: Eliminate need for shared keys when
keeping secrets.
The new idea is to put some information in a public
place.
Here are the details:
• Associated with each user u is an encryption function E_u
and a decryption function D_u.
• For all messages m, E_u(m) is unintelligible even knowing
E_u. However, D_u(E_u(m)) = E_u(D_u(m)) = m.
• E_u is public.
• Given E_u and E_u(m), it is impossible to figure out D_u.
USING PUBLIC-KEY ENCRYPTION
• To send a message m to u that only u can read:
Send E_u(m)
• For user t to sign and send a message m to u, t
can”
Send D_t(E_u(m)).
The receiver u can prove that t sent this
message, because only t knows D_t.
Further, only u (and t) can read the message.
SSL (Secure Socket Layer)
• Effect: client knows it is talking to a certain
server; client remains anonymous to
server; communication is secure.
• Useful for purchases, even anonymous
ones.
SSL Method -- simplified
• Client uses the well known public
encryption method of the server to
communicate desire to communicate.
Nobody can eavesdrop. Client also
generates a session private key.
• Server responds with the private key and
then communication proceeds.
One Way Functions
• Given a function like triple, if I tell you that 15
has been produced by triple(x), you can infer
that x is 5.
• By contrast, given certain hash functions, if I tell
you h(x) it is very hard to infer x. Such (hard to
find the inverse) functions are called one-way.
• Ex: Given x, compute x2 then take the middle 20
digits. Given those digits, hard to find x.
• There is a standard one-way hash function
called SHA-1.
One Way Functions for Spies
• You have a bunch of spies. They are good
people, very trustworthy.
• They go into enemy territory. When they return
they must say something to the guards so they
don’t get shot. Passwords can’t get reused for
security.
• Also guards might go into a bar and be tempted
to reveal secrets.
• Could we use one way functions in some way?
Should I get a file from this site?
• Suppose you want to download some file (say a
special kind of player) from a web site.
• You want to be sure that web site is to be
trusted.
• Otherwise, you might be getting a “Trojan horse”
(something that looks good but can do you
harm).
• You believe that trusted entities will keep their
private keys secret.
Some Performance Issues
(from Radu Sion)
•
•
•
•
Pentium 4, 3.6 ghz, 1 gigabyte of RAM
RSA sign: 261/sec; RSA verify: 5324/sec
Private key (3DES): 26 MB/sec
Collision free hashing: 200 MB/sec
Secure File System Protocol
(David Mazieres)
• Basically, the downloader uses an SSL
interaction with the server that downloader
trusts.
• That is, downloader encrypts a session
private key using that server’s public key.
• So, nobody else can know what
downloader requested.
• Not yet solved: how does downloader
know this server is validated by author?
Secure File System Protocol
key insight
• Each file name is of the form:
/sfs/nyu/cs5349874628/dennis/foobar
The funny number is the SHA-1 hash of the
public key of the proper server.
So, this is the proper server for the file iff the
public key of the server Pk when hashed
gives cs5349874628.
Downloader checks this before sending the
SSL message which will use Pk.
Secure File System Protocol
questions
• What if you lie about your public key?
• How do I discover these strange names?
Secure File System Protocol
answers
• What if you lie about your public key?
– You can but then you won’t have the
associated private key so won’t be able to
respond properly to my SSL request.
• How do I discover these strange names?
– There does have to be one site you trust
which has names you believe.
Secure File System Protocol
summary
• Downloader goes to trusted name server site.
• Gets name (including funny encrypted part) and
maybe other information, e.g. hash of the
contents.
• Goes to a server that allegedly holds that file. If
server’s public key when hashed equals the
funny part of the name, then that server is
legitimate.
• So, downloader engages that server in SSL and
downloads the file confidentially.
Spy Border puzzle (McCarthy/Rabin)
• Spies go across border.
• When they return, they don’t want to be
shot so they want to give a password.
• Spies are clever and professional but
border guards have been known to have
loose tongues, so conventional password
might get leaked.
• What to do? (Hint: one-way functions)
Courier Problem
• Capture a courier and you may change history
(e.g., Hannibal and his brother Mago).
• So want to send n couriers with parts of a
message. If majority of couriers arrive, then can
reconstruct message. Any minority reveal
nothing.
• How to do this?
(Hint: start with three couriers.
Hint 2: polynomial of degree 2 can be
determined by three points.)
Assuring honesty in SETI-at-home
• I want a site to execute a function f on x1,
x2, …, xn.
• How do I know it’s doing so?
• Possibility: Compute some of the values
e.g. f(xi) and f(xj), and send those too.
Ask to be told i and j.
• Still some possibility of cheating, e.g. once
those two indices are found, could give
guesses for other f(xk)s.
Better Solution
• Also send some value y such that for NO j
is it the case that y = f(xj). (Don’t tell of
course)
• Now the function producer must state that
y is not the result of f on any input.
• Function producer could guess this, but
then risks getting caught.
• Ref: Radu Sion
Descargar

DISTRIBUTED COMPUTING