Assignment title: Information
PFDA Lab 2
MapReduce Median / Standard Deviation
1) Generate some random data
a) Create a working directory called MedianStdDev with a subfolder input
b) Create the following shell script called gendata.sh
#!/bin/bash
echo "******* Usage ******"
echo "./gendata.sh NumRows OutputFile"
echo "E.g.: ./gendata.sh 3000 sample.txt"
echo "..."
echo "******* Output $1 lines to $2 *******"
# set counter
count=1
# zap output file
> $2
# Loop
while [ $count -le $1 ]
do
# Generate some random text
randomnumber=`od -A n -t d -N 1 /dev/urandom`
randomtext=`cat /dev/urandom | tr -cd "[:alnum:]" | head -c $randomnumber`
# Generate a random number
randomnumber=`od -A n -t d -N 1 /dev/urandom`
# Output to file
echo "$count,$randomtext,$randomnumber" | sed -e "s: *::g" >> $2
# Increment counter
count=$(($count + 1))
if [ $(($count % 500)) -eq 0 ]
then
echo n "."
fi
done
echo "******* Output complete *******"
c) Generate some random data using gendata.sh in a file called sample1.txt and copy this
file into the input folder.
NOTE: You may have to make the script executable with: $ chmod +x gendata.sh
2) The data generated by gendata.sh is produced as a comma delimited file containing three
fields per line. The first field is a counter commencing at 1. The second field is a random set
of characters – this data is not of fixed length. The third field is a random integer number
between 0 and 255.
Your programming task is to use the Hadoop MapReduce framework to process this data to
discover a set of median and standard deviation values. To do this you should:
a) Categorise the numerical values in field 3 according to the length of the string of
characters in field 2;
b) Calculate the median of the categorised field 3 values;
c) Calculate the standard deviation of the categorised field 3 values.
3) Create a new project in Eclipse called medianStdDev
4) Add the required Hadoop libraries to your project. To do this right-click on the project and
navigate to: Build Path Configure Build Path Libraries Add External JARs
The required jar files can be found in the following directories:
/usr/local/hadoop/share/hadoop/common
/usr/local/hadoop/share/hadoop/common/lib
/usr/local/hadoop/share/hadoop/hdfs
/usr/local/hadoop/share/hadoop/mapreduce
/usr/local/hadoop/share/hadoop/yarn
5) Create classes for the following:
a) A driver class containing the main method
b) A mapper class containing the map method
c) A reducer class containing the reduce method
d) An (optional) tuple class (implementing the Writable interface) to use for outputting the
reduce data
6) Create a runMedianStdDev.sh script to run the MapReduce job as done for previous
examples.
7) View the content of the MapReduce output file part-r-00000