Networked computer systems are prevalent in most aspects of modern society, and we have become dependent on such computer systems to perform many critical tasks. Moreover, making such systems dependable is an important goal. However, dependability issues are often neglected when developing systems due to the complexities of the techniques involved.
A common technique used to improve the dependability characteristics of systems is to replicate critical system components whereby the functions they perform are repeated by multiple replicas. Replicas are often distributed geographically and connected through a network as a means to render the failure of one replica independent of the others. However, the network is also a potential source of failures, as nodes can become temporarily disconnected from each other, introducing an array of new problems.
The majority of previous projects have focused on the provision of middleware libraries aimed at simplifying the development of dependable distributed systems, whereas the pivotal deployment and operational aspects of such systems have received very little attention. This thesis extends on previous works and emphasize the deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest.
The main contribution of this dissertation is an architecture for autonomous replication management, aimed to improve the dependability characteristics of systems through a self-managed fault treatment mechanism that is adaptive to network dynamics and changing requirements. Consequently, the architecture also improves the deployment and operational aspect of systems, and reduces the human interactions needed. The architecture has been implemented as a proof of concept prototype by extending the Jgroup object group system.
In addition, numerous supporting contributions are also included in this work: (i) an architecture for dynamic protocol composition that avoids the delays of event processing in intermediate layers of a strictly vertical protocol stack; (ii) adaptive protocol selection is also made possible on a per method/invocation basis, by annotating server methods with the replication protocol to be used; (iii) client-side membership handling is also implemented aimed to improve the load balancing and failover properties of systems when exposed to failures; (iv) online upgrade management of operational services is also implemented as an extension to the replication management architecture.
Finally, the dissertation provides extensive experimental evaluation of the fault treatment capabilities of the autonomous replication management architecture, with emphasis on testing complex failure scenarios. The first experiment examines the ability of clients to maintain correct membership when servers crash and recover. The second experiment investigates the behavior of services when exposed to multiple nearly-coincident node crash failures. In conjunction with this experiment, a novel technique has been developed to estimate various service dependability characteristics. In the third experiment the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly-coincident reachability changes are injected to simulate network partitions separating the service replicas.
To support the experimental evaluation, a set of generic tools have also been developed to aid the execution and analysis of the experiments.