Purpose

HDFS-HC is a tool for rebalance data load in a heterogeneous Hadoop cluster. The data placement mechanism in the Hadoop distributed file system or HDFS to initially distribute a large data set to multiple nodes in accordance to the computing capacity of each node.

For questions please contact Jiong Xie at JZX0009@auburn.edu

If you are not very family with Hadoop system, Please visit the Mapreduce overview which shows you what the Mapreduce is and how to set up the Hadoop installation.

The purpose of this document is to help you install and use this rebalance module.

Pre-requisites

Supported Platforms

Required Software

Required software for Linux and Windows include:

  1. JavaTM 1.6.x, preferably from Sun, must be installed.
  2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirements for Windows include:

  1. Cygwin - Required for shell support in addition to the required software above.

Download

To get a Hadoop distribution with HDFS-HC module, download a recent hadoop package including HDFS-HC module, you can directly use it. Or download the HDFS-HC java file.

Installation

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster. The step is totally the same as installing the native Hadoop system.  Another good tutorial provided by Dr. Michal G.Noll.

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.

General Configuration

Now checking your configuration:

1.  Java configuration. Set the /usr/lib/jvm/java-6-sun should at the top of /etc/jvm. 
2.  Checking whether the SSH service is working well.  
3.  Change the hadoop-env.sh. The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open <HADOOP_INSTALL>/conf/hadoop-env.sh
4.  Updating the configuration settings on: core-site.xml (hadoop.tmp.dir, fs.default.name), mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).
 
Additional configuration
1.  Building a computing ratio file computing-ratio.xml to the <HADOOP_INSTALL>/conf/ directory. 

Hadoop Startup

To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster.

Format a new distributed filesystem:
$ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown

Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.

Balancer

Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Rebalancer for more details.  Other commands can be found on Haoop commands.

Usage: hadoop balancer [-threshold <threshold>]

COMMAND_OPTION

Description

-threshold <threshold>

Percentage of disk capacity. This overwrites the default threshold.  The threshold parameter is a fraction in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether the cluster is balanced.

-threshold  0

Percentage of disk capacity. When the threshold is 0. It will invoke our HDFS-HC balance module.

 
 
 

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

 

If you need to cite the HDFS-HC papers, please use the following:

Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters.  J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF

If you need to cite the ioBalance distribution site, please use the following:

Jiong xie, "HDFS-HC: Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters  (Version 1.0)", http://www.eng.auburn.edu/_xqin/software/hdfs-hc/. April 2010.

 

Copyright and Disclaimer

Acknowledgment

The HDFS-HC module environment was based on the implementation described in [1]

References

 

Sponsors

This project has been generously supported by  NSF

This material is based upon work supported by the US National Science Foundation under Grants CCF-0845257 (CAREER), CNS-0917137 (CSR), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), CNS-0855251 (CRI), OCI-0753305 (CI-TEAM), DUE-0837341(CCLI), and DUE-0830831 (SFS), as well as Auburn University under a startup grant and a gift (Number 2005-04-070) from the Intel Corporation.