1 Research Interests

My research has primarily focused on modeling and improving reliability of parallel and distributed storage systems, and wireless networks. My current research projects are summarized below.


1.1 Mathematical Reliability Models for Disk Systems

This research is funded by the U.S. National Science Foundation. I am serving as a senior personnel to develop and validate reliability models. Many energy conservation techniques have been proposed to achieve high energy-efficiency in disk systems. Unfortunately, growing evidence shows that energy-saving schemes in disk drives usually have negative impacts on storage systems. Existing reliability models are inadequate to estimate reliability of parallel disk systems equipped with energy conservation techniques. To solve this problem, I proposed a mathematical model - called MINT - to evaluate the reliability of a parallel disk system where energy-saving mechanisms are implemented. In this research, I focused on modeling the reliability impacts of two well-known energy-saving techniques - the Popular Disk Concentration technique (PDC) and the Massive Array of Idle Disks (MAID). I started this research by investigating how PDC and MAID affect the utilization and power-state transition frequency of each disk in a parallel disk system. I then modeled the annual failure rate of each disk as a function of the disk’s utilization, power-state transition frequency as well as operating temperature, because these parameters are key reliability-affecting factors in addition to disk ages. Next, the reliability of a parallel disk system can be derived from the annual failure rate of each disk in the parallel disk system. Finally, I used MINT to study the reliability of a parallel disk system equipped with the PDC and MAID techniques. Experimental results show that PDC is more reliable than MAID when disk workload is low. In contrast, the reliability of MAID is higher than that of PDC under relatively high I/O load. After I finished building the MINT model, I then developed a trace-driven simulation to validate the MINT model. How to validate the accuracy of the MINT model is a big challenge. It is impractical to run a 1000-disk-array for years only to figure out how many of the disks will fail in the long testing period. The sample size(e.g. 1000 disks) is still too small even if I make can conduct field testing, let alone the years and money this would require. The MINT model consists of two major sub-models: the Access Rate to Utilization model and the Utilization to AFR(Annual Failure Rate) model, which is based on the maintenance data from the Google report that has already been validated. I decided to validate the Access Rate to Utilization sub-model by designing a simulator representing data movements according to access patterns from an entire month Berkeley Web Traces. Comparison results between the simulation and the mathematical model showed a similar trend which proved the validity of

the MINT model.


1.2 Energy-Efficient Parallel and Distributed Storage Systems

I have played a secondary role in the work related to energy-efficient parallel and distributed I/O systems. This research has a close relationship with my main research work, in which I play the primary role. The approaches we proposed in this project are relevant to I/O intensive applications, and can improve the energy-efficiency and performance of parallel disk systems. To substantially reduce the penalties incurred by disk spinning operations, a novel approach was developed to conserving energy of parallel I/O systems with write buffer disks, which accumulate small writes using a log file system. Data sets buffered in the log file system can be transferred to target data disks in a batch way. Thus, buffer disks aim to serve a majority of incoming write requests, attempting to reduce the large number of disk spinning operations by keeping data disks in standby for long period times. Interestingly, the write buffer disks not only can achieve high energy efficiency in parallel I/O systems, but can also shorten response times of write requests.


2 Research Plans

2.1 Future Direction for the Short Term

My short-term interest will concentrate on the following directions, which are extensions of my past and current research on reliability analytical model for parallel storage systems


Fault Tolerance Analysis for RAID Storage Systems

Although the MINT model is adequate to quantify the reliability of energy-efficient disk arrays, MINT is insufficient to analyze the energy-ware RAID system. I plan to investigate a more sophisticated model that can modify data access patterns and the stripped data placement. To reduce power, a conventional RAID system cannot simply rely on caching and powering off disks during idle time due to its disk parallelism–all disks are spinning even under a light load. By varying the number of powered-on disks via gear-shifting or switching among sets of disks (e.g. Power-Aware Redundant Array of Inexpensive Disks), the energy consumption of a RAID system can be reduced. However, after changing the number of active disks in the system, the level of RAID will be changed accordingly. This affects the reliability of the system. As a further extension of my current research, I plan to investigate the behavior of RAID levels in terms of gear shifting and the stripping data movement along with the input data access patterns.

Predictive Reliability Models for Storage Systems

Reliability evaluation of a disk system indicates the present liability of the system. However, if we can predict the reliability of the system, maintenance expenses can be reduced since disks will be replaced on time. Risks that disks will crash before being replaced will be diminished and the frequency of purchasing new disks can be decreased. The goal of this research is to build up a predictive reliability models to forecast reliability of storage systems based on data access patterns and to provide disks maintenance suggestions. Furthermore, such a strategy can be integrated with load balancing schemes to ensure that the disks reaching the end of their lifetimes will be assigned with lighter workloads and the data on disks that are likely to fail will be backed-up.


2.2 Future Directions for the Long Term

Energy-Aware Storage Systems in Data Centers

Distributed File Systems are becoming the de-facto method of data storage for the new generation of date centers ( e.g.web applications by companies like Google, Amazon and Yahoo!). There are several reasons that distributed storage mechanisms are preferred over traditional relational database systems including scalability, availability and performance. However, the energy consumption issue needs to be addressed carefully in data centers. For example, a 360-T flops supercomputer (e.g., IBM Blue Gene/L) with traditional processors needs 2,329.60KW/h to be operated. This energy requirement is approximately equal to the sum of 22,000 US households’ energy consumption. In addition, high-temperature heat dissipation caused by large-scale clusters requires cooling equipments (e.g., air conditioners) to control temperature in supercomputer and data centers. The trends in power/cooling delivery and cost highlight the need for support in data centers for power and thermal management. I plan to explore schemes in utilizing platform power management(e.g. processor frequency scaling, prefetching, caching, data management, load balancing, etc) for data centers.


Reliability-Aware Parallel Virtual File System(PVFS) in High-Performance Computing

PVFS, a popular network clustering file systems, brings state-of-the-art parallel I/O concepts to production parallel systems. It is designed to scale to petabytes of storage and provide access rates at 100s of GB/s. While working on a PVFS-related research project, I realized that the energy-saving may not be a central issue for high-performance computing(HPC) systems. One of the major reasons is that energy-efficiency schemes usually negatively affect to the main goal of a HPC system, which is performance oriented. However, the fault-tolerant issue plays an important role in HPC systems since any minor defect may cause data tragedies of the entire system. Hence, I plan to develop fault tolerant mechanism for PVFS in order to enhance availability.


Information Assurance and Security in Cloud Computing

Providing confidentiality, integrity, authenticity, privacy and availability of information are essential for the normal operation in cloud computing. Hence, information assurance and security is a critical issue. For my long-term future research, I will place emphasis on the schemes of authorization and authentication for cloud computing systems.