[Back] [PDF]

Improving Network Performance through Task Duplication for Parallel Applications on Clusters

Xiao Qin

Department of Computer Science
New Mexico Institute of Mining and Technology
801 Leroy Place, Socorro, New Mexico 87801--4796

While data replication is widely used in clusters to provide fault tolerance, it can heavily stress communication networks and degrade overall performance of parallel applications. The performance degradation is particularly unacceptable with disk-write-intensive applications. As a result, data duplication management for parallel applications running on clusters is a significant and urgent challenge. This paper presents the design, implementation, and evaluation of a network-aware task duplication management system, or TUFF, where redundant data can be regenerated by corresponding duplicate tasks rather than directly replicating through networks. In addition, TUFF is capable of improving availability performance of parallel applications, because TUFF allows two replicas of each I/O-intensive task to be executed on two different nodes. We have implemented and evaluated TUFF using extensive simulations under a diverse set of workload conditions. Experimental results show that TUFF improves the overall performance of parallel applications running on clusters by efficiently reducing network resource consumption.

Proceedings of the 24th IEEE International Performance, Computing, and Communications Conference (IPCCC 2005), pp.35-42, Phoenix, Arizona, April 7-9, 2005.