Assignment title: Information

PFDA Lab 2 MapReduce Median / Standard Deviation 1) Generate some random data a) Create a working directory called MedianStdDev with a subfolder input b) Create the following shell script called gendata.sh #!/bin/bash echo "*** Usage " echo "./gendata.sh NumRows OutputFile" echo "E.g.: ./gendata.sh 3000 sample.txt" echo "..." echo "* Output $1 lines to $2 ***" # set counter count=1 # zap output file > $2 # Loop while [ $count -le $1 ] do # Generate some random text randomnumber=`od -A n -t d -N 1 /dev/urandom` randomtext=`cat /dev/urandom | tr -cd "[:alnum:]" | head -c $randomnumber` # Generate a random number randomnumber=`od -A n -t d -N 1 /dev/urandom` # Output to file echo "$count,$randomtext,$randomnumber" | sed -e "s: ::g" >> $2 # Increment counter count=$(($count + 1)) if [ $(($count % 500)) -eq 0 ] then echo n "." fi done echo " Output complete *" c) Generate some random data using gendata.sh in a file called sample1.txt and copy this file into the input folder. NOTE: You may have to make the script executable with: $ chmod +x gendata.sh 2) The data generated by gendata.sh is produced as a comma delimited file containing three fields per line. The first field is a counter commencing at 1. The second field is a random set of characters – this data is not of fixed length. The third field is a random integer number between 0 and 255. Your programming task is to use the Hadoop MapReduce framework to process this data to discover a set of median and standard deviation values. To do this you should: a) Categorise the numerical values in field 3 according to the length of the string of characters in field 2; b) Calculate the median of the categorised field 3 values; c) Calculate the standard deviation of the categorised field 3 values. 3) Create a new project in Eclipse called medianStdDev 4) Add the required Hadoop libraries to your project. To do this right-click on the project and navigate to: Build Path  Configure Build Path  Libraries  Add External JARs The required jar files can be found in the following directories: /usr/local/hadoop/share/hadoop/common /usr/local/hadoop/share/hadoop/common/lib /usr/local/hadoop/share/hadoop/hdfs /usr/local/hadoop/share/hadoop/mapreduce /usr/local/hadoop/share/hadoop/yarn 5) Create classes for the following: a) A driver class containing the main method b) A mapper class containing the map method c) A reducer class containing the reduce method d) An (optional) tuple class (implementing the Writable interface) to use for outputting the reduce data 6) Create a runMedianStdDev.sh script to run the MapReduce job as done for previous examples. 7) View the content of the MapReduce output file part-r-00000