Fault Tolerance in an
Event Rule Framework
for Distributed Systems
Hillary Caituiro Monge
1
Contents
1.
2.
3.
4.
5.
6.
7.
Introduction
Related Works
Overview of the Event Rule Framework (ERF)
Overview of the Fault Tolerant CORBA
Design of the Fault tolerant ERF (FT-ERF)
Performance Analysis
Conclusions
2
1. Introduction

Justification
 Distributed
Systems (DS)
 Fault Tolerance (FT)
 Reactive Components (RC)
 The Event Rule Framework (ERF)
 Motivation

Objectives
3
Distributed Systems (DS)

A DS is a


Collection of software components distributed among processors of
heterogeneous platforms
DSs purpose are:


Sharing resources and workload, and
Maximizing availability.
PDA
Server

Design goals of DSs are:




Transparency,
Scalability,
Reliability and
Performance.
Workstation
Laptop
Computer
Mainframe
4
Fault Tolerance (FT)
FT is the ability of a system to continue
operating as expected, despite internal or
external failures.
 DSs are prone to failures.

 Some
faults can be detected.
 Some others cannot be detected.

FT of a DS can be improved through the
redundancy, i.e. replication of its hardware
or software components.
5
Reactive Components (RC

Reactive Components
 React to external
 Initiate action

stimulus (i.e. events)
A RC can be
 Asynchronous or synchronous
 Non-deterministic or deterministic

A reactive component can be asynchronous and
non-deterministic (ANDRC).
6
The Event Rule Framework (ERF)

An example of a DS framework having ANDRCs is:
 ERF

(Event/Rule Framework).
ERF
 Developed
at the Center for Computing Research and
Development of University of Puerto Rico – Mayagüez
Campus.
 It is an Event-Rule Framework for developing distributed
systems.

In ERF, Events and rules are used as abstractions
for specifying system behavior.
7
Motivation (1/2)


There is a challenge to achieve fault tolerance in
ANDRCs.
In non-deterministic components:
 The output could
 Even if the same
be different;
sequence of stimuli is input with the
same initial state.

Since the component is asynchronous:
 Timing

assumptions are not valid.
Moreover, ANDRCs behavior fulfills the
Heisenberg’s uncertainly principle:
8
Motivation (2/2)

Existing fault-tolerance techniques
 Failure
detectors
Timing assumptions
 Synchronous or semi synchronous systems,

 State
transfer protocols
Deterministic systems
 Very intrusive

 Duplicates
detection and suppression
mechanisms

Sequencers
9
Objectives (1/2)

This research is about the use of active and
semi-active replication techniques for achieving
fault tolerance in ERF, which is a framework that
uses ANDRCs.

Active replication technique
 All
replicated components accept third-party incoming
events.
 A middle-tier component is in charge of


Event multicasting
Detecting and suppressing duplicated events.
10
Objectives (2/2)

Semi-active replication technique
 All
replicated components accept third-party incoming
events
 Only one (“the leader”) is able to post events,
 Backup replicas listen to the leader to make a
consistent production of events.
 Each replicated component is in charge of the
detection and suppression of duplicated events.
11
2. Related Works
Generic support of FT in DSs
 FT Event-Based DSs

12
Generic support of FT in DSs

OMG Fault-Tolerant CORBA Standard
IRL
OGM FT-CORBA Yes
Compliant
Design Style
Non-intrusive
Interoperability
Free
Replication Logic Centralized with
Implementation
Passive Replication
Asynchronism
Yes
support
Non-determinism No
support
Eternal
Yes
OGS
Soon
SENSEI
No
Non-intrusive Non-intrusive Non-intrusive
Expensive
Free
Free
Group toolkit Group toolkit Group toolkit
No
No
No
No
No
No
13
FT Event-Based DSs
Fault tolerance
NODS
ISEE
YEAST
Framework RS2.7
Not directly
supported
Watchd and
Libft
14
3. Overview of the Event Rule
Framework (ERF)

Model
 Event
Model
 Rule Model
 Behavioral Model

Components
 Event
Channel
 RUBIES

Architecture of ERF-CORBA
15
Model

Event Model

ERF provides the event
abstraction to represent
significant occurrences
in a distributed system.

i.e. Flood alert system.

The base class Event
defines the structure and
behavior applicable to all
types of events.
package erf;
import erf.lang.*;
import java.io.Serializable;
public class Event implements Serializable
{
/* Attributes */
public String id = "";
public TimeValue ttl;
public TimeValue daytime;
public DistributedObject producer;
/* Methods */
public TimeValue t() {...}
public TimeValue ts() {...}
public TimeValue ttl() {...}
public void setttl(long tv) {...}
public DistributedObject getProducer() {...}
public void setProducer(DistributedObject producer) {...}
public void sett(long tv) {...}
public String pName() {...}
public boolean isDead() {...}
public String getTypeName() {...}
}
Figure 3.2 Java definition of the class Event
16
Model

Rule Model
 In
ERF, the behavior of a DS is defined in terms of
rules.
 A rule is an algorithm that is triggered when events in
the event set match a rule’s event pattern
[package <package_specification> ]
rule <rule_id>
[priority <priority_number>]
on <trigger_events>
[use <usage_specification>]
[if <condition> then <actions> [else <alternative_actions>]]
[do <unconditional_actions>]
Figure 3.5 Syntax of rule definition language (RDL)
17
Model

Behavioral Model
 Defines
how rules are triggered and evaluated
upon the occurrence of events.
 Evaluation of rules needs to be made
periodically because RUBIES receive events
constantly.
 The evaluation of rules is performed based on
a rule priority.
18
Components (1/2)

Event Channel
 Is
a middleware distributed component
 It allows sending events to consumers.
 It allows receiving events from producers.
 Events are treated as objects.
19
Components (2/2)

Rule Based Intelligent Event Service
(RUBIES)
 Is
the main component of ERF.
 It is an engine that handles events through
the evaluation of rules.
 RUBIES is a distributed component
 It is registered to the event channel both as a
consumer and as a producer.
20
Architecture of ERF-CORBA
StructuredEvent
RUBIES
*
*
*
CORBAEventChannel
StructuredPushConsumer
StructuredPushSupplier
StructuredProxyPushSupplier
StructuredProxyPushConsumer
RuleCompiler
*
EventChannel
Figure 3.8 Architecture of ERF-CORBA
21
4. Overview of the Fault
Tolerant CORBA (FT-CORBA)
Fault Tolerant CORBA (FT-CORBA)
 Replication Management
 Fault Management
 Logging and Recovery Management

22
Fault Tolerant CORBA




Adopted by OMG through 2000.
Commitments rather than a solution.
Full interoperability among different products.
It provides support for applications that require
 High
levels of reliability
 With minimal modifications.

This research was addressed to be compliant
with this standard.
23
Replication Management

Replication management
covers a Fault Tolerant
Domain.

It is done through the
Replication Manager
component, which
inherits from the
Property Manager,
Object Group Manager,
and Generic Factory
components.
GenericFactory
ObjectGroupManager
PropertyManager
ReplicationManager
Figure 4.3 Hierarchy of the
Replication Management
24
Fault Management
StructuredPushConsumer
FaultNotifier
SequencePushConsumer
FaultDetector

It includes the Fault
Notification, Fault
Detection, and Fault
Analysis services.

The Fault Notifier sends
informs to its consumers.
The Fault Detectors are
connected to replicas or
host and provide “faults” to
the Fault Notifier.
The Fault Analyzer
analyzes faults and produce
reports to the Fault Notifier.
FaultAnalyzer

PullMonitorable
Figure 4.8 Architecture of Fault
Management

25
Logging and Recovery
Management

Loggin mechanism.
 Log

the state of the primary member.
Recovery mechanism
 Act
on fails or for new members.
 Recover from the log to the new primary.

Consistency must be controlled by the
infrastructure.
26
5. Design of the Fault tolerant
ERF (FT-ERF)








Scalability and Fault Tolerance Problems in ERF
CORBA
Architecture of Scalable and Fault Tolerant ERF
Architecture of Fault-Tolerant ERF-CORBA
EID Uniqueness
Events and Pattern equality rules.
Pattern Management
Active Replication
Semi-Active Replication
27
Scalability and Fault Tolerance
Problems in ERF CORBA
RUBIES
(b)
(a)
RULES
DB
Figure 5.1 Two possible points of scalability and fault-tolerance problems
in ERF: (a) the size of the rules database; (b) a crash of RUBIES.
28
REPLICATION DIMENSION
DISTRIBUTION DIMENSION
RUBIES
RUBIES
RUBIES
(γ11, δ1)
(γ21, δ2)
(γN1, δN)
RUBIES
RUBIES
RUBIES
(γ12, δ1)
(γ22, δ2)
(γN2, δN)
RUBIES
RUBIES
RUBIES
(γ1M, δ1)
(γ2M, δ2)
(γNM, δN)
Figure 5.3 Architecture of Scalable and fault-tolerant ERF
29
PullMonitor
Monitorable
FaultDetector
PullMonitorable
PropertyManager
RUBIES
ReplicationManager
Updateable
RuleCompiler
FaultNotifier
FTRUBIESIntOperations
GenericFactory
FTRUBIESServant
FTRUBIES
ObjectGroupManager
ServerRequestInterceptor
ClientRequestInterceptor
ServerIOGRSupport
ClientIOGRSupport
ObjectGroup
CORBAEventChannel
Figure 5.3 Architecture of FT ERF-CORBA
30
EID Uniqueness (1/2)

Each event in the system need to be uniquely
identified by an event identifier
 EID.

EID uniqueness must guaranteed in different
contexts
 Local,

replication group, system.
The use of sequencers is an option to achieve
EID uniqueness
 Each
replica start a sequencer.
 But, is only valid with deterministic components.
31
EID Uniqueness (2/2)

Events can be identified by its history.
 Each

event is produced due to an event pattern.
Such history includes
 The
list of previous events that triggered the event.
 The function or rule that caused its production.
Figure 5.5 Conceptual View of the Event Unique Identification
32
EVENT EQUALITY RULE

Two events are equal if:
 Both
are of the same Type.
 Both were produced due to the same Rule.
 Both have the same Order of production in the
time when the Rule was triggered.
 Both have the same Pattern.
33
PATTERN EQUALITY RULE

Two event patterns are equal if:
 Both
have the same number of events.
 Both have events in the same order.
 Two events for the same position are equal if
the Event Equality Rule is accomplished as
previously defined.
34
Pattern Management (1/2)

Rules use a pattern management framework
 To
prevent events being triggered more than once for
a given event pattern.

In this framework, patterns are defined in terms
of:
 Source
events (i.e., events that cause rules to trigger)
and
 Target events (i.e., events that are produced by rules).
35
Pattern Management (2/2)
Rule
SourceEvents
0..*
Event
Pattern
PatternManager
0..*
1
TargetEvents
0..*
Indexer
Figure 5.6 Architecture of Pattern Management

The framework has three main components for pattern management:

Pattern Manager to manage patterns of events.
 Pattern to store patterns of events.
 Indexer to organize patterns of events.
36
Active Replication (AR)


For systems with tight time constraints.
All replicas are running at the same time.
 Are
 Are


accepting events.
sending events.
So, duplicated events are going around.
Therefore, it is crucial.
 To
 To
detect and suppress duplicated-events.
deliver a unique reply.
 To keep consistency.
 To be fault tolerant transparent.
37
AR: Pattern Naming


For Duplicated-Events Detection and
Suppression
Is a centralized Mid-tier component that



Through an analysis of an event’s history
Detects if the event has already been delivered.
It relies on two primitives:

Event binding


Register an event.
Pattern solving.

Resolve if an equivalent event was already delivered.
38
AR: Pattern Naming
1
PatternContext
1
EventChannel
1
0..1
1
CORBAEventChannel
0..*
PatternName
1
1
EventHandler
1
0..*
FTRUBIES
Figure 5.9 Architecture of the Pattern Naming
39
Semi-Active Replication (SAR)



For systems with relatively loose time constraints.
All replicas are running at the same time.
Only the primary is able to reply to clients.
 When
the primary fails, a new member is selected.
 When a backup member fails, it is released from the
group.

Failure detectors are used to detect failures in
group members.
 Time
delay before the selection of new primary (sec).
40
SAR: Production Controller



For Duplicated-Events Detection and Suppression
It is distributed within each replica.
The following algorithm is executed on backup members.
On incoming event P from the primary

If in queue BQ is an equivalent event B for the event P then



Update B.id with P.id across the entire system
Remove P
Else

Enqueue P in PQ
On produced event B from the backup

If in queue PQ is an equivalent event P for the event B then



Update B.id with P.id across the entire system
Remove P
Else

Enqueue B in BQ
On fail and if the backup replica is elected as new primary

Post all events of the queue BQ
41
SAR: Production Controller
-primaryEvent
0..*
EventChannel
1
Updater
Event
SourceEvents
0..*
TargetEvents
Pattern
-backupEvent
0..1
CORBAEventChannel
1
0..*
1
0..*
FaultNotifier
1
FTRUBIES
ProductionController
Rule
PatternManager
1
Figure 5.14 Architecture of the Production Controller
42
6. Performance Analysis
Objectives
 Methodology

 Test
Scenarios
 Test Procedure

Test Results
43
Objectives

Measure the execution time of fault-tolerant ERF
using active and semi-active replication
techniques for:
 An
 An
increasing number of replicas.
Increasing number of failures.
 An increasing workload.

Compare the execution time of:
 Active versus semi-active replication techniques.
 Failure-free versus failure execution scenarios.
 Fault-tolerant
versus non fault-tolerant execution.
44
Test Scenarios: Services
distribution
china
ece
sorelsvr1
test
app
http server
*
event
channel
*
10. jobo
name
server
*
factory
9.
* quayaba
Replication
manager
*
**
fault detector
*
*
*
8. chironja factory
**
*
ft
7. guineo factory rubies
*
*
ft
6. acerola factory rubies
*
*
ft
5. pajuil
factory rubies
*
*
ft
4. quenepafactory rubies
*
*
*
ft
3.
toronja factoryrubies
*
*
ft
factory rubies
2. melon
adaselsvr1
*
**
*
ft
corba
1. parcha factory rubies
*
*
event
ft
**
channel
*
rubies
factory
*
ft
rubies
fault
tolerant
rubies
*
*
Figure 7.1 UML deployment diagram of the test environment. (The domain for
45
all computers is ece.uprm.edu)
Test Scenarios: Failure schedule:
First scenario




Six workstations,
3 to 8 replicas.
193 rules.
Failure schedule defined by power set F.
Where
 n is the number of replicas
 f(p=n) = ∞
 f(p=1...n-1) = p*T/n determines the time of the failure


p is the position of the replica in the sub set
T is the arithmetic average of the execution time of ten free
failure runs with n replicas.
46
Test Scenarios: Failure schedule:
Second scenario




Ten workstations,
Ten replicas.
193 rules.
Failure schedule defined by set G.
Where
 n is the number of replicas
 g(p=n) = ∞
 g(p=1...n-1) = p*T/n determines the time of the failure


p is the position of the replica in the sub set
T is the arithmetic average of the execution time of ten free
failure runs with n replicas.
47
Test Scenarios: Failure schedule
Third scenario




Ten workstations,
Ten replicas.
Six rule sets of 6, 12, 24, 48, 96, and 193 rules
each time.
The failure schedule was given by the function
G(n) defined for the second scenario.
48
Test Scenarios: Test application

Client consumer/producer of the event channel.

It starts sending two events of GageLevelReport
type to start the test
It ends its execution when an event of the
TestEventEnd type arrives.


Measures the execution time
 Starting just after second event is posted,
 Ending just after a event of TestEventEnd
and
type
arrives.
49
Methodology: Test Procedure

The procedure consisted of three major steps:
 Clear
the environment;
 Launch the infrastructure; and
 Run the test application.

The results are
 The

arithmetic media of 10 runs on each test case.
The arithmetic media of the standard deviation
was 1.46%
50
140000
120000
Time (milliseconds)
100000
80000
60000
40000
20000
8
7
6
0
5
0
1
-608
2
4
3
Number of failures
4
5
3
6
Number
of replicas
1722
7
Figure 7.4 Cost of fault tolerant ERF execution using active replication
technique with increasing number of replicas and number of failures.
51
140000
120000
Time (msec) (t)
100000
80000
60000
40000
20000
8
7
6
0
5
0
1
3051
2
4
3
Number of Failures
4
5
3
6
Number of
Replicas
956
7
Figure 7.5 Cost of fault tolerant ERF execution using semi-active
replication technique with increasing number of replicas and number
of failures.
52
160000
140000
Time (msec) (t)
120000
100000
80000
Failure-free !!!
60000
40000
20000
0
3
4
5
6
7
8
9
10
Number of Replicas (n)
1406
Failure-free
(n-1) failures
1176
Figure 7.6 Impact of (n-1) failures execution over failure- free execution
in active replication technique.
53
160000
140000
Time (msec) (t)
120000
100000
80000
60000
40000
20000
0
3
4
5
6
7
8
9
10
Number of Replicas (n)
1377
Failure-free
(n-1) failures
5341
Figure 7.7 Impact of (n-1) failures execution over failure-free execution
in semi-active replication technique.
54
160000
140000
T im e (m sec) (t)
120000
100000
80000
60000
40000
20000
0
3
4
5
6
7
8
9
10
N u m b e r o f R e p lica s (n )
1406
A c tive
S em i ac tive
1377
Figure 7.8 Comparison between active and semi active replication
techniques on failure-free scenario.
55
160000
140000
T im e (m sec) (t)
120000
100000
80000
60000
40000
20000
0
3
4
5
6
7
8
9
10
N u m b e r o f R e p lica s (n )
1176
A c tive
S em i ac tive
5341
Figure 7.9 Comparison between active and semi active replication
techniques on nine-failures scenario.
56
160000
140000
Time (msec) (t)
120000
100000
80000
60000
40000
20000
0
0
500
1000
1500
2000
2500
3000
Events
RUBIES
Non fault tolerant
Failure-free
Nine failures
Figure 7.10 Workload impact on the time of the failure-free execution and
of the nine-failures execution using the active replication technique.
Overload respect to the non fault tolerant execution and to the pure
57
RUBIES execution.
160000
140000
Time (msec) (t)
120000
100000
80000
60000
40000
20000
0
0
500
1000
1500
2000
2500
3000
Events
RUBIES
Non fault tolerant
Failure-free
Nine failures
Figure 7.11 Workload impact on the time of the failure-free execution and
the nine-failures execution using the semi-active replication technique.
Overload respect to non fault tolerant execution and pure RUBIES
execution.
58
7. Conclusions

The performance results of the implementation of the
active and semi-active replication techniques

Shows linear time curves both


Therefore,



For increasing number of replicas, failures and work load.
The proposed solutions were proved to be feasible, and
Their performance results were proved to be acceptable.
Additionally,

The active replication technique has better overall performance
than semi-active replication technique.
59
Research Contributions

Active Replication for Asynchronous, nondeterministic Reactive Components
 Tight
time constrains.
 The replication logic is in a centralized component
 Advantages



It is not significantly affected by either increasing number of
replicas, failures, or workload.
Both clients and replicas need not to be aware of the
replication mechanism.
It can be used for large distributed systems.
 Disadvantage
 Relies on a centralized component.
60
Research Contributions

Semi-active Replication for Asynchronous,
non-deterministic Reactive Components
 Loose
time constraints
 The replication logic is distributed.
 Advantages

The replication mechanism is distributed,
 Disadvantages

Relies on failure detectors.
61
Questions and/or Comments?
?
62
Descargar

Fault Tolerance in an Event Rule Framework for Distributed