Reliable and Dynamically Reconfigurable Distributed Systems

The purpose of this research is to create operating system support and a programming environment for developing and maintaining long-running parallel and distributed applications that are continually evolving. Many distributed and parallel applications, such as automated manufacturing, computer-aided control systems, and scientific computation, execute for a long time and are developed incrementally whereby changes may be required to be incorporated while the application is running. Supporting changes at runtime require an efficient dynamic reconfiguration facilities and an extensible programming environment. The facilities consist of toolkits and runtime mechanisms for generation of correct reconfiguration plans and consistency maintenance during normal execution as well as exception activities. The toolkits support generalized techniques for concurrency control, recovery and reconfiguration that utilize partial-order application semantics. The toolkit is implemented on top of a library that supplies common routines for behavior analysis, conflict analysis, and consistency restoration. The programming environment is based on a scalable software architecture for specifying and verifying complex applications behavior that can also be easily analyzable by the toolkits. The scope of the research includes building toolkits at the system level, end-user environment for developing applications that use them, algorithm design, prototyping, and evaluation of the efficiency and usefulness of the facilities.

In this system, a hierarchical state machine model is used as the underlying formalism because of its utility for analyzing dependencies among interacting operations and computing plans for maintaining consistency during failure recovery and reconfiguration. This research will enhance our understanding of the fundamental principles for maintaining consistency in distributed and parallel systems that is more general than existing techniques, such as transactions (including semantic-based transactions). Transaction-based approaches require failure atomicity and serializability to be preserved resulting in restrictive interaction of concurrent tasks. However, our approach analyzes dependencies automatically and restores applications to correct intermediate states. Reconfiguration facilities based on this approach are the underlying mechanisms for building tools for other purposes, including (1) interactive parallel program steering and control, (2) adaptive transaction processing, (3) performance tuning through dynamic selection of efficient implementations, (4) mobile distributed systems, (5) fault-tolerant computing, and (6) load balancing.