Xiao Qin's Research

Auburn University

Final Report

MINT: Mathematical Reliability Models for Energy-Efficient Parallel Disk Systems




Our Mission: The MINT project aims at developing mathematical reliability models for fault-tolerant energy-aware disk systems. Reliability models, which are used to estimate reliability, have been important tools in the design and development of fault-tolerant computer systems. In the past decade, a variety of practical and useful reliability models have been constructed for disk systems. However, most of these models were developed for non-energy efficient disk systems, thereby making it difficult to apply the existing reliability models to energy-aware disk systems. Therefore, the overall objective of this project is to address the mathematical underpinnings of modeling reliability of energy-efficient parallel disk systems, where fault tolerance and energy-saving techniques will be seamlessly integrated together to conserve energy without sacrificing reliability in parallel disk systems.


1. Research and Education Activities

1.1 MINT: A Reliability Modeling Architecture for Energy-Efficient Parallel Disks

We developed the MINT reliability modeling architecture for energy-efficient parallel disk systems. The MINT architecture is composed of a single disk reliability model, a systemlevel reliability model, and three reliabilityaffecting factors - temperature, disk state transition frequency (hereinafter referred to as frequency) and utilization. Many energy-saving schemes inherently affect reliability-related factors like disk utilization and transition frequency. Given an energy optimization mechanism, MINT first transfers data access patterns into the two reliability-affecting factors - frequency and utilization. The single-disk reliability model can derive individual disks annual failure rate from utilization, power-state transition frequency, age, and temperature. Each disks reliability is used as input to the system-level reliability model that estimates the annual failure rate of parallel disk systems.


1.2 A Reliability Model of Energy-Efficient Parallel Disk Systems with Data Mirroring

Conservation of energy in parallel disk systems has a strong impact on the cost of cooling equipment and backup power-generation. This is because a significant amount of energy is consumed by parallel disks in high-performance computing centers. Although a wide range of energy conservation techniques have been developed for disk systems, research on reliability analysis for energy-efficient parallel disk systems is still in its infancy. In this part of research, we made use of a Markov process to develop a quantitative reliability model for energy-efficient parallel disk systems using data mirroring. With the new model in place, a reliability analysis tool is developed to efficiently evaluate reliability of fault-tolerant parallel disk systems with two power modes. More importantly, the reliability model makes it possible to provide a good compromise between energy efficiency and reliability in energy-efficient and fault-tolerant parallel disk systems.


1.3 Improving Energy Efficiency and Reliability of Parallel I/O Systems with Disk Mirroring

Numerous energy saving techniques have been developed to aggressively reduce energy dissipation in parallel disks. However, many existing energy conservation schemes have substantial adverse impacts on the reliability of disks. To remedy this deficiency, we address the problem of making tradeoffs between energy efficiency and reliability in parallel disk systems with data mirroring. Among several factors affecting disk reliability, the most two important factors disk utilization and ages are the focus of this study. We build a mathematical reliability model to quantify the impacts of disk age and utilization on failure probabilities of mirrored disk systems. In light of the reliability model, we propose a novel concept of safe utilization zone, within which energy dissipation in disks can be reduced without degrading reliability. We developed an approach to improving both reliability and energy efficiency of disk systems through disk mirroring and utilization control that enforces disk drives to be operated in safe utilization zones. This is the first utilization-based control scheme that seamlessly integrates reliability with energy saving techniques in the context of fault-tolerant systems. Experimental results show that our approach can significantly improve reliable while achieving high energy efficiency for disk systems under a wide range of workload situations.


1.4 Modeling the Reliability of the MAID (Massive Arrays of Idle Disks) Technique

The MAID (Massive Arrays of Idle Disks) technique - developed by Colarelli and Grunwald - aims to reduce energy consumption of large disk arrays while maintaining acceptable I/O performance. MAID relies on data temporal locality to place replicas of active files on a subset of cache disks, thereby allowing other disks to spin down. We model the reliability behavior of parallel disk systems coupled with the MAID technique. With the new model in place, a reliability of disk systems with MAID can be quantitatgively evaluated. More importantly, the reliability model makes it possible to provide a good compromise between energy efficiency and reliability in energy-efficient and fault-tolerant parallel disk systems.


1.5 Modeling the Reliability of the PDC ((Popular Data Concentration) Technique

The PDC (Popular Data Concentration) technique proposed by Pinheiro and Bianchini migrates frequently accessed data to a subset of disks in a disk array. In a parallel disk with the PDC technique, the most popular files are stored in the far left disk, while the least popular files are stored in the far right disk. PDC can rely on file popularity and migration to conserve energy in disk arrays, because several network servers exhibit I/O loads with highly skewed data access patterns. The migrations of popular files to a subset of disks can skew disk I/O load towards this subset, offering other disk more opportunities to be switched to standby to conserve energy. To void performance degradation of disks storing popular data, PDC aims at migrating data onto a disk until its load is approaching the maximum bandwidth of the disk.


1.6 Modeling the Reliability of Energy-Aware Parallel Disks with Data Mirroring

Conservation of energy in parallel disk systems has a strong impact on the cost of cooling equipment and backup power-generation. This is because a significant amount of energy is consumed by parallel disks in high-performance computing centers. Although a wide range of energy conservation techniques have been developed for disk systems, research on reliability analysis for energy-efficient parallel disk systems is still in its infancy. In this part of research, we made use of a Markov process to develop a quantitative reliability model for energy-efficient parallel disk systems using data mirroring. With the new model in place, a reliability analysis tool is developed to efficiently evaluate reliability of fault-tolerant parallel disk systems with two power modes. More importantly, the reliability model makes it possible to provide a good compromise between energy efficiency and reliability in energy-efficient and fault-tolerant parallel disk systems.


1.7 Improving Energy Efficiency of Secure Disk Systems without Modifying Security Mechanisms

Improving energy efficiency of security-aware storage systems is challenging, because security and energy efficiency are often two conflicting goals. The first step toward making the best tradeoffs between high security and energy efficiency is to profile encryption algorithms to decide if storage systems would be able to produce energy savings for security mechanisms. We are focused on encryption algorithms rather than other types of security services, because encryption algorithms are usually computation-intensive. In this study, we used the XySSL libraries and profiled operations of several test problems using Conky - a lightweight system monitor that is highly configurable. Using our profiling techniques we concluded that although 3DES is much slower than AES encryption, it more likely to save energy in security-aware storage systems using 3DES than AES. The CPU is the bottleneck in 3DES, allowing us to take advantage of dynamic power management schemes to conserve energy at the disk level. After profiling several hash functions, we noticed that the CPU is not the bottleneck for any of these functions, indicating that it is difficult to leverage the dynamic power management technique to conserve energy of a single disk where hash functions are implemented for integrity checking.


1.6 Improving Energy Efficiency and Security for Disk Systems Improving security and minimizing power consumption are crucial for large-scale data storage systems. Although a handful of studies have been focused on data security and energy efficiency, most of the existing approaches have concentrated on only one of these two metrics. In this paper, we present a new approach to integrating power optimization with security services to enhance the security of energy-efficient large-scale storage systems. In our approach, we make use of the dynamic speed control for power management technique, or DRPM, to conserve energy. Numerous security services like confidentiality, integrity, and authentication can be provided to secure storage systems. In this study we develop two efficient ways of integrating confidentiality services with the dynamic disk speed control technique. The first strategy - security aggressive in nature - is focused on the improvement of storage system security with less emphasis on energy conservation. The second strategy gives high priority to energy conservation as opposed to the security optimization. Our experimental results show that the energy-aggressive approach provides better energy savings than the security-aggressive approach. However, the quality of security achieved by the security-aggressive scheme is higher than that of the energy-aggressive approach. Moreover, the empirical results show that energy savings yielded by the two approaches become more pronounced when the data size is increased. The findings illustrate that the response time of the security-aggressive approach is more sensitive to data size than that of the energy-aggressive scheme.


1.8 Energy-Efficient Processing for Write Requests

In this part of study, we focused on both large and small disk write requests issued to parallel disk systems. While large write requests are issued directly to data disks, small write requests are sent to an active buffer disk. Our previous study confirmed that seek times of small disk request dominates disk I/O processing times. Therefore, we made use a log file system to make the seek time of most write requests to be zero. The seek times of write requests handled by buffer disks are zero unless the buffer disks are in a process of moving data to data disks or responding read requests.


1.9 Energy-aware Prefetching for Parallel Disk Systems

This study focused on increasing the reliability of energy-efficient parallel disk systems by the virtue of a reduction in power state transitions. This design goal can be achieved by the making use of a buffer disk facility. Specifically, this study aimed at reducing the energy consumed of these systems by comparing two different approaches at reducing energy consumption using a disk management scheme. Our energy-efficient prefetching strategy can prefetch popular data sets into buffer disks with a desired consequence of reducing the total energy consumption of a large-scale parallel disk system. Our design depends on the observation that a small percentage of the data is frequently accessed by some data-intensive applications. The prefecthing scheme was designed to place a small amount of frequently accessed popular data sets into buffer disks reducing energy consumption.


1.10 Modeling Reliability of Energy-Efficient Parallel Disks

Many energy conservation techniques have been proposed to achieve high energy efficiency in disk systems. Unfortunately, growing evidence shows that energy-saving schemes in disk drives usually have negative impacts on storage systems. Existing reliability models are inadequate to estimate reliability of parallel disk systems equipped with energy conservation techniques. To solve this problem, we propose a mathematical model - called MINT - to evaluate the reliability of a parallel disk system where energy-saving mechanisms are implemented. In this paper, we focus on modeling the reliability impacts of two well-known energy-saving techniques - the Popular Disk Concentration technique (PDC) and the Massive Array of Idle Disks (MAID). We started this research by investigating how PDC and MAID affect the utilization and power-state transition frequency of each disk in a parallel disk system. We then model the annual failure rate of each disk as a function of the disk’s utilization, power-state transition frequency as well as operating temperature, because these parameters are key reliability-affecting factors in addition to disk ages. Next, the reliability of a parallel disk system can be derived from the annual failure rate of each disk in the parallel disk system. Finally, we used MINT to study the reliability of a parallel disk system equipped with the PDC and MAID techniques. Experimental results show that PDC is more reliable than MAID when disk workload is low. In contrast, the reliability of MAID is higher than that of PDC under relatively high load.


1.11 Improving Reliability of Energy-Efficient Parallel Storage Systems by Disk Swapping

The Popular Disk Concentration (PDC) technique and the Massive Array of Idle Disks (MAID) technique are two effective energy saving schemes for parallel disk systems. The goal of PDC and MAID is to skew I/O load towards a few disks so that other disks can be transitioned to low power states to conserve energy. I/O load skewing techniques like PDC and MAID inherently affect reliability of parallel disks, because disks storing popular data tend to have high failure rates than disks storing cold data. To achieve good tradeoffs between energy efficiency and disk reliability, we first present a reliability model to quantitatively study the reliability of energy-efficient parallel disk systems equipped with the PDC and MAID schemes. Then, we propose a novel strategy—disk swapping—to improve disk reliability by alternating disks storing hot data with disks holding cold data. We demonstrate that our disk-swapping strategies not only can increase the lifetime of cache disks in MAID-based parallel disk systems, but also can improve reliability of PDC-based parallel disk systems.


1.12 Mini Conference in the Advanced Operating Systems Class

A mini-conference model was used to motivate and educate graduate students to conduct research projects in the discipline of storage systems, energy-efficient computing, and prefetching/ caching for file systems. By the end of the Spring 2010 semester, when the Comp7500 C Advanced Operating Systems Class is taught, each graduate student is required to write a research paper and submit to a mini-conference. All the student papers were reviewed and each student gave a presentaion of 20 minutes. After each presentation, each student had a question-answer session of 5 minutes. The PI also gave constructive comments and suggestions on each students research project. In this mini-conference model, the graduate students who are taking the Comp7500 class improved their presentation and communication skills. After we receive feedbacks from the graduate students, we will formally evaluate the this class next semester.