Xiao Qin > Research

Fault-Tolerant Support for Real-Time Collaborative Editing Systems

Groupware systems allow physically dispersed teams to collaborate over common tasks over distance and/or time. In a real-time groupware system, all users are required to be present at their respective sites at the same time, whereas a non real-time groupware system allows users to work on common tasks at different times. Real-time collaborative editing systems, that enable groups of geographically distributed users to simultaneously view and edit shared document, make the groupware applications more practical. This is even more pronounced if users can use real-time collaborative editing systems on the internet.

In real-time collaborative editing systems, good responsiveness, supporting unconstrained collaboration and tolerant failed processes are main issues. Hence, if a real-time collaborative editor is to be effectively used over the Internet, the system should tolerant the client and link failures, for the quality of the Internet are unpredictable. There are two main approaches to improving the fault tolerance: replication and persistence. Components are replicated to make the systems fault-tolerant by ensuring that all replicates process the same messages in the same order. If any one of them fails, others will still be able to continue. Persistence-based solutions rely on checkpointing, which can recover the failures by periodically saving the states of components. Checkpoint recovery may be preferable for small problems if local disks are available, but wide-area replication outperforms checkpoint recovery for larger-grain problem.

In a real-time collaborative editing system, the clients are able to rejoin the system in the presence of the client or link failures. One basic requirement is that the existing user can continue their work while a new crashed client join the group again. Thus, the group’s current status should transfer to the new client even in the presence of any failures. From the client’s aspect, it can rejoin the collaborative editing system without start from the very beginning. Normally speaking, starting from the scratch can result in a substantial delay, that is unnecessary is an efficient approach is applied.

In this research, we devise a new, efficient approach to support crash recovery in the real-time collaborative editing systems. In regarding to the fault-tolerant support for server in the system, we have developed the primary-backup server to tolerate the single server failure. If the primary server is crashed, the backup server will automatically continue to server the clients without restarting the server. This research have included an introduction to the notion of local final state, which records the final state of the client. In order to protect the clients from crash, local final state is stored on permanent storage at each client site. If a client gets disconnected because of the server crashes or link failures, the client are able to rejoin the collaborative editing system by loading the local final state. In order to synchronize with the current state of the whole system, the server also will resend some operations according to this local final state. How to determine the operations that should be resent by the server is the main issue discussed in our research.

Current Work

Current, we attempted to address fault-tolerant issues in real-time collaborative systems. An efficient recovery algorithm is presented to make the real-time collaborative systems more reliable. Traditional way to recover a crashed client site is transmitting the system’s final state, which includes the content of the document and locking table, from the server site. But if the volume of data of the final state is huge, the recovery latency becomes significant large, which may make use feel impatient. In our new approach, each client site maintains a local final state, which is generated periodically. As a result, if a failure occurs in the client or links, the client are able to rejoin the collaborative editing systems by loading the local final state instead of obtaining the state from remote server that may result in a noticeable delay. The consistency between the local state and remote state is maintained in our algorithm. Interval time between a client join and leave the system is an important metric that addresses in our paper. We derive an equation to determine such interval time. Regarding to this interval time, the performance of the system can be enhanced by determining an optimal frequency of generating local final state.

Future studies in this work focus on performance evaluation for our new approach. We will find out how factors, affect the performance of the system. These factors include, the data volume associated with final state of the system and the frequency of generating local final state. The new approach proposed in this paper can also be applied in other sorts of collaborative systems. We also plan to investigate a fault-tolerant mechanism to provide the basic fault-tolerant services in collaborative systems.

Publications

Xiao Qin, "Delayed Consistency Model for Distributed Interactive Systems with Real-time Continuous Media," Journal of Software, Vol.13, No.6, pp.1029-1039, June, 2002, China. [ Abstract | PDF ]
Xiao Qin and Chengzheng Sun, "Recovery Support for Internet-based Real-Time Collaborative Editing Systems," in Proceedings of the 1st International Conference on Computer Networks and Mobile Computing (ICCNMC), October 16-19, 2001, IEEE Press.
Xiao Qin, "Fault-Tolerant Support in Real-Time Collaborative Editing Systems", IEEE Distributed Systems Online, Vol.2, No.1, 2001, IEEE Computer Society, USA.
Xiao Qin, Chengzheng Sun, "Efficient Recovery Algorithm in Real-Time and Fault-TolerantCollaborative Editing Systems," in ACM CSCW'2000 Workshop on Collaborative Editing Systems, December 3, 2000 Philadelphia, Pennsylvania, USA.

Xiao Qin's Research