Xiao Qin's Research

Auburn University

QoSec Project

A Middleware Approach to Teaching Computer Security (2009 - )



Project 1: Introduction To Hadoop


Project Description

Hadoop is widely used for processing large data sets as well as running processing intensive applications in parallel. Hadoop is designed to run on three or more nodes in a cluster, but also has the ability to run on a single machine for debugging and testing purposes. Hadoop uses Java processes to simulate nodes when running on a single machine. While a single machine will not give the performance of a cluster, this is an excellent way to try out a Hadoop application before consuming the runtime of a cluster. This project will show how to set up a Hadoop single-node cluster.


Resources

1. Hadoop: The Definitive Guide. Author: Tom White. O’Reilly Media, 2009. Chapters 1-3.

2. Website of Michael G. Noll:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

3. Hadoop Download: http://www.apache.org/dyn/closer.cgi/hadoop/core



System Requirements

1. Ubuntu version 8.04 or later.

2. Sun Java 6

3. SSH installed



Project Tasks

This is an exploration of Hadoop on Ubuntu Linux. It is encouraged to change some of the values and see what they do. Be creative and make sure that there is an understanding of why these tasks are important.

1. (XX points) First, create a “Hadoop” user, configure SSH, and disable IPv6:

(a) Create “Hadoop” user:
$sudo addgroup hadoop
$sudo adduser --ingroup hadoop hadoop

(b) Configure SSH for the “Hadoop” user that was just created:
user@ubuntu:˜$ su - hadoop
hadoop@ubuntu:˜$ ssh-keygen -t rsa -P ""

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key.

hadoop@ubuntu:˜$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hadoop user. The step is also needed to save your local machine’s host key fingerprint to the hadoop user’s known hosts file.

hadoop@ubuntu:˜$ ssh localhost
When it asks “Are you sure you want to continue connecting?” type yes.

(c) Disable IPv6: To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

The computer must reboot for changes to take effect. You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

2. (XX points) Installation of Hadoop.

(a) Installation:
You have to download Hadoop from the Apache Download Mirrors (link listed in resources) and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hadoop user and group, for example:

$ cd /usr/local
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo mv hadoop-0.20.2 hadoop
$ sudo chown -R hadoop:hadoop hadoop
(b) Set up Hadoop environment.
• In file conf/hadoop-env.sh:

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA HOME environment variable to the Sun JDK/JRE 6 directory.

Change:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
To:
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

(c) Configuring *-site.xml files:
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little cluster only contains our single local machine.

You can leave the settings below as is with the exception of the /app/hadoop/tmp variable which you have to change to the directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point. Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir /app/hadoop/tmp
$ sudo chown hadoop:hadoop /app/hadoop/tmp
If you forget to set the required ownerships and permissions, you will see a java.io.IOException
when you try to format the name node in the next section).

3. (XX points) Formatting the NameNode:

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a runningHadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
hadoop@ubuntu:˜$ /hadoop/bin/hadoop namenode -format
The output will look like this:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/
common/branches/branch-0.20 -r 911707; compiled by ’chrisdo’
on Fri Feb 19 08:07:34 UTC 2010
************************************************************/

10/05/08 16:59:56 INFO namenode.FSNamesystem:
fsOwner=hadoop,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem:
supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem:
isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage:
Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory
.../hadoop-hadoop/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

hadoop@ubuntu:/usr/local/hadoop$

4. (XX points) Starting a Single-Node Cluster:

Run the command:
hadoop@ubuntu:˜$ /bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine. A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0).

hadoop@ubuntu:/usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode

If there are any errors, examine the log files in the /logs/ directory.
To stop the cluster, Run the command:

hadoop@ubuntu:˜$ /bin/stop-all.sh
5. (XX points) Running example Hadoop application: For this step, go to Michael G. Noll’s website that is listed in the references at the beginning of this document, and click on the link labelled “Running A MapReduce Job”. Download the three text files from the gutenberg library and run the “Word- Count” application as described on his website. You are finished when you get to the section labelled “Hadoop Web Interfaces”.



Submission

You need to submit a detailed lab report to describe what you have done and what you have observed; you also need to provide explanation to the observations that are interesting or surprising.