Software
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Publication
If you use our HDFS-HC module to conduct your research, please cite our paper and software in your publications. We highly appreciate if you give us the credit.
This HDFS-HC tool is based on our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF | PPT]
Introduction
HDFS-HC is a software module for rebalancing data load in heterogeneous Hadoop clusters. This data placement tool was integrated into the Hadoop distributed file system ( HDFS) to initially distribute a large data set to multiple nodes in accordance to the computing capacity of each node.
If you are not very family with the Hadoop system, please visit the Mapreduce overview that shows you the background of Mapreduce and the way of installing Hadoop in a cluster. The purpose of this document is to help you to install and use the HDFS-HC data placement tool in heterogeneous Hadoop cluster.
Do you have questions on HDFS-HC?
For questions please contact Xiao Qin at xqin@auburn.edu or Jiong Xie at jzx0009@auburn.edu
Supported Platforms
Required Software
Required software for Linux and Windows include:
Additional requirements for Windows include:
Download
You can download the HDFS-HC tool here. Note that this is a recent hadoop package, in which the HDFS-HC module is integrated.
Installation
Required software for Linux and Windows include:
Additional requirements for HDFS-HC include:
Hadoop Startup
To start your Hadoop cluster, you will need to start both the HDFS and the cluster.
Format a new distributed filesystem:
$ bin/hadoop namenode -format
Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh
The bin/start-mapred.sh script also consults the
${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.
Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh
The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
Balancer
To run the cluster balancing utility, an administrator can simply press Ctrl-C to stop the rebalancing process. See Rebalancer for more details. Other commands can be found on Hadoop commands.
Usage: hadoop balancer [-threshold
| COMMAND_OPTION | Description |
| -threshold <threshold> | Percentage of disk capacity. This overwrites the default threshold. The threshold parameter is a fraction in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether the cluster is balanced. |
| -threshold 0 | Percentage of disk capacity. When the threshold is 0. It will invoke our HDFS-HC balance module. |
References
If you use our HDFS-HC module to conduct your research, please cite the following paper:
Copyright and Disclaimer
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Acknowledgments
This software is based upon work supported by the US National Science Foundation under Grants CCF-0845257 (CAREER), CNS-0917137 (CSR), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), CNS-0855251 (CRI), OCI-0753305 (CI-TEAM), DUE-0837341 (CCLI), and DUE-0830831 (SFS), as well as Auburn University under a startup grant and a gift (Number 2005-04-070) from the Intel Corporation.