Assignment title: Information
Programming for Data Analytics
Practical Test 1
Sem. 2, 2015-2016
For this Practical Test you are required to process a set of data using the Virtual Machine Hadoop
MapReduce environments that you have established on your laptops.
You must complete each of the following tasks:
1) Generate 10,000 lines of random data using the gendata.sh bash script. This script is available
for download on Moodle. The script takes two parameters: the first parameter is the number
of lines of random data to generate, the second parameter is the name of the file to which the
data is written.
Each row of output contains the following types of data:
Field Type Description
Field 1 Integer Row counter
Field 2 Date-Time Timestamp of when the row of data was created
Field 3 Character Single uppercase character [A-Z] representing the record category
Field 4 String A user name string. This field can contain one of three messages.
Field 5 String A group name string. This field can contain one of two messages.
Field 6 Integer A random integer [0 -255]
Field 7 String A string of randomly generated alphanumeric characters
2) MapReduce Java Programming Task 1:
Process the data you have generated from step 1) to find the average of Field 6 values
grouped by Field 3 values.
3) MapReduce Java Programming Task 2:
Process the data you have generated from step 1) to find all distinct combinations of Field
2, Field 4, and Field 5 values.
NOTE:
i. Create a separate Eclipse project for each of the programming tasks.
ii. For each of the programming tasks you should create appropriate classes and methods. You
must at a minimum include the following methods: a main driver method, a map method, a
reduce method.
iii. For your submission you are required to submit
a) the source code for each project;
b) the result of each analysis;
c) the sample data that you processed.
iv. You should zip up all the required elements into a single file for submission via the Moodle
submission link.