Xiao Qin's Research

Auburn University

QoSec Project

A Middleware Approach to Teaching Computer Security (2009 - )



Project 3: Cluster Performance


Project Description

Hadoop is at its best when working with large data sets. This project will illustrate how to run a program with a large data set and the results of that program.


Resources

1. Hadoop: The Definitive Guide. Author: Tom White. O’Reilly Media, 2009.

2. Input files are located on BlackBoard (i.e. 1MB.txt, 64MB.txt, 512MB.txt, 1GB.txt, 2GB.txt, 4GB.txt, 8GB.txt)



System Requirements

1. Ubuntu version 8.04 or later.

2. Sun Java 6

3. SSH installed

4.Apache Hadoop installed



Project Tasks

This project will introduce large data sets to running Hadoop programs. You will organize and chart the results to see what Hadoop can really do.

1. (XX points) To begin, write a script that will execute wordcount on each of the input files. Make sure that your script will output the information necessary to calculate how long each execution runs. Refer to Project 1 to refamiliarize yourself with wordcount if necessary.
Submit your job to the cluster when it is done. Make sure you proofread your script carefully so you do not have to run it again.

2. (XX points) Organize the data you collected (Excel is recommended). Use tables and charts to show your findings.

3. (XX points) Record all of your observations regarding the data and draw conclusions that your findings support. Be clear and descriptive.



Submission

You need to submit a ReadMe (including your script and command line output) as well as a detailed lab report (including your data collection and conclusions). You also need to provide an explanation to the observations that are interesting or surprising.