Xiao Qin > Software

HDFS-HC2: A Data Placement Module for Heterogeneous Hadoop Clusters

Publication

If you use our HDFS-HC module to conduct your research, please cite our paper and software in your publications. We highly appreciate if you give us the credit.

This HDFS-HC tool is based on our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF | PPT]

For information on HDFS-HC2, please refer to the report - Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters - by Sanket Reddy Chintapalli. [PDF | PPT | Source Code ]

Introduction

HDFS-HC is a software module for rebalancing data load in heterogeneous Hadoop clusters. This data placement tool was integrated into the Hadoop distributed file system (HDFS) to initially distribute a large data set to multiple nodes in accordance to the computing capacity of each node.

If you are not very family with the Hadoop system, please visit the Mapreduce overview that shows you the background of Mapreduce and the way of installing Hadoop in a cluster. The purpose of this document is to help you to install and use the HDFS-HC data placement tool in heterogeneous Hadoop cluster.

Do you have questions on HDFS-HC2?

For questions please contact Sanket Reddy Chintapalli at szc0060@auburn.edu, Jiong Xie at jzx0009@auburn.edu, and Xiao Qin at xqin@auburn.edu,

Supported Platforms

GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform

Required Software

Required software for Linux and Windows include:

JavaTM 1.6.x, preferably from Sun, must be installed.
ssh needs to be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirements for Windows include:

Cygwin is required for the shell support in addition to the required software above.

Download

You can download the HDFS-HC tool here. Note that this is a recent Hadoop hdfs project file, in which the HDFS-HC module is integrated.

Extracting the tar Archive

After downloading the tarball (i.e., hdfs-hc2.tar.gz), you can follow the command below to extract the tar archive:

tar -vfxz hdfs-hc2.tar.gz

Installation of Hadoop

You need to install Hadoop before working on HDFS-HC2. Required software for Linux and Windows include:

Java configuration. Set the /usr/lib/jvm/java-6-sun should at the top of /etc/jvm.
Checking whether the SSH service is working well.
Change the hadoop-env.sh. The only required environment variable we have to configure for Hadoop in the HDFS-HC tool is JAVA_HOME.
Open /conf/hadoop-env.sh
Update the configuration settings on: core-site.xml (hadoop.tmp.dir, fs.default.name), mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).

Hadoop Startup

To start your Hadoop cluster, you will need to start both the HDFS and the cluster.

Format a new distributed filesystem:
$ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the
${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown

Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

HDFS-HC2 (a.k.a., CRBalancer)

To run the CRBalancing utility, follow the instructions below

Instructions

Before getting started please read ‘How to contribute to Hadoop Projects’ in the following link - How to Contribute

How to Interpret Source Files

There are two files and a folder you have to be concerned with in the package.

1. JAR file –

hadoop-hdfs-2.3.0.jar

2. Script –

hdfs

3. Hadoop HDFS Project Folder –

hadoop-hdfs.tar.gz

. You can learn more about it by looking at BUILDING.txt file in the above link. It clearly describes how to compile, build and modify source code.

Steps to run the CRBalancer

1. Replace the hadoop-hdfs-2.3.0.jar file in following path - HADOOP_HOME_PATH/share/Hadoop/hdfs

2. Replace the script file hdfs in the following folder - HADOOP_HOME_PATH/bin/hdfs

3. Run the script file with the following parameters as follows

hdfs -file {full path to computation ratio file} -namenodename {Hostname of the namenode} -port {port number to access the namenode}

How should the configuration file related to the computation ratio look like?

hostname1 ratio

hostname2 ratio

hostname3 ratio

hostnameN ratio

Note: The ratios are calculated by placing entire data set on a single file and finding their least common multiple. Place the file using hadoop fs –put HDFS_DIRECTORY_PATH/filename

Example:

hpxeon01 0.36

jedi05 0.54

Example Command:

hdfs crbalancer –file /user/sanket/crmap.txt –namenodename hpxeon01 –port 54310

How to View Source Code

1. Open Hadoop hdfs project folder.

2. Navigate to the path hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/crbalancer.

3. You will find three files CRBalancer.java, CRBalancingPolicy.java and CRNamenodeConnector.java.

4. CRBalancer.java is the main program which makes the decision to transfer data among nodes based on computing power.

5. CRBalancingPolicy calculates and stores the space occupied by each node. It is used to know the total space occupied by a node currently. It aids in making the decision.

6. NamenodeConnector is used for connecting to the Namenode in order to get the information about the datanodes.

7. In every Program, I have mentioned the place where code has been changed from the original balancer which balances the node based on space utilization instead of computing utilization.

References

If you use our HDFS-HC module to conduct your research, please cite the following paper:

Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, "Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters," Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF | PPT]
Sanket Reddy Chintapalli, Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters, Technical Report, Department of Computer Science and Software Engineering, Auburn University, Nov. 2014. [PDF | PPT | Source Code ]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Acknowledgments

This software is based upon work supported by the US National Science Foundation under Grants CCF-0845257 (CAREER), CNS-0917137 (CSR), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), CNS-0855251 (CRI), OCI-0753305 (CI-TEAM), DUE-0837341 (CCLI), and DUE-0830831 (SFS), as well as Auburn University under a startup grant and a gift (Number 2005-04-070) from the Intel Corporation.

Xiao Qin's Software