Xiao Qin's Software

Auburn University

Software

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters




Publication

If you use our HDFS-HC module to conduct your research, please cite our paper and software in your publications. We highly appreciate if you give us the credit.

This HDFS-HC tool is based on our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF | PPT]

Introduction

HDFS-HC is a software module for rebalancing data load in heterogeneous Hadoop clusters. This data placement tool was integrated into the Hadoop distributed file system ( HDFS) to initially distribute a large data set to multiple nodes in accordance to the computing capacity of each node.

If you are not very family with the Hadoop system, please visit the Mapreduce overview that shows you the background of Mapreduce and the way of installing Hadoop in a cluster. The purpose of this document is to help you to install and use the HDFS-HC data placement tool in heterogeneous Hadoop cluster.

Do you have questions on HDFS-HC?

For questions please contact Xiao Qin at xqin@auburn.edu or Jiong Xie at jzx0009@auburn.edu

Supported Platforms


Required Software

Required software for Linux and Windows include:


Additional requirements for Windows include:


Download

You can download the HDFS-HC tool here. Note that this is a recent hadoop package, in which the HDFS-HC module is integrated.

Installation

Required software for Linux and Windows include:


Additional requirements for HDFS-HC include:

Hadoop Startup

To start your Hadoop cluster, you will need to start both the HDFS and the cluster.

Format a new distributed filesystem:
$ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the
${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown

Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

Balancer

To run the cluster balancing utility, an administrator can simply press Ctrl-C to stop the rebalancing process. See Rebalancer for more details. Other commands can be found on Hadoop commands.
Usage: hadoop balancer [-threshold ]

COMMAND_OPTION Description
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold. The threshold parameter is a fraction in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether the cluster is balanced.
-threshold 0 Percentage of disk capacity. When the threshold is 0. It will invoke our HDFS-HC balance module.

References

If you use our HDFS-HC module to conduct your research, please cite the following paper:


Copyright and Disclaimer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.


Acknowledgments

This software is based upon work supported by the US National Science Foundation under Grants CCF-0845257 (CAREER), CNS-0917137 (CSR), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), CNS-0855251 (CRI), OCI-0753305 (CI-TEAM), DUE-0837341 (CCLI), and DUE-0830831 (SFS), as well as Auburn University under a startup grant and a gift (Number 2005-04-070) from the Intel Corporation.