Assignment title: Information


Tutorial : Hadoop Streaming MapReduce with AWS EMR 1) Introduction In this tutorial we will utilise the Hadoop Streaming functionality provided by AWS EMR (Amazon Web Services Elastic Map Reduce). Using the Hadoop Streaming functionality allows us to create map and reduce functions that are coded with Python (or any other programming language that can process data from STDIN and write data to STDOUT). Our goal will be to count the total number of occurrences of the characters contained in a sample piece of text. To perform this task we will create a mapper.py and a reducer.py Python programs. The text we will analyse is Voltaire's Micromegas (free for download from http://www.gutenberg.org/cache/epub/30123/pg30123.txt). Other AWS services we will utilise are S3 and EC2. S3 will be used  to store the input text that we will analyse  store the mapper.py and reducer.py Python code files  store the output from the analysis  store log and error data produced during the analysis process EC2 will be used to create the required compute infrastructure that will host the Hadoop cluster environment and process the input. This infrastructure will be automatically created for us as part of configuring and creating the EMR Hadoop cluster. 2) Create the mapper.py and reducer.py Python programs a) Using IDLE or an editor of your choice create the following two Python programs i) mapper.py ii) reducer.py b) Note: In order to use the builtin aggregate reducer the keys output by the map function must be prepended with the string 'LongValueSum:'. 3) Preparation a) Sign-in to AWS (use Google Chrome for browser) b) Please create a folder using your student Id in the ie.ncirl.mscda bucket. E.g., if your student Id is x99999999 then create a folder named x99999999 in the ie.ncirl.mscda bucket. Note: Please use this folder for storing any S3 data resources you will use for this or future tutorials/project work. c) Create the following subfolder structure under the folder that has been created using your student Id: Use the Create Folder option Note: Do not create an output or a logs folder! This will be created automatically by AWS EMR. The MapReduce task will fail to execute successfully if the output folder is created at this stage. d) Download the text from http://www.gutenberg.org/cache/epub/30123/pg30123.txt and store locally on your PC as Micromegas.txt . e) Upload the Micromegas.txt file to the s3://ie.ncirl.mscda//charcount/input folder on S3 where should be replaced with the name of the folder you created in step b). f) Upload the mapper.py file to the s3://ie.ncirl.mscda//charcount/code folder on S3 where should be replaced with the name of the folder you created in step b). g) Upload the reducer.py file to the s3://ie.ncirl.mscda//charcount/code folder on S3 where should be replaced with the name of the folder you created in step b). 4) Cluster Set-up and Job Execution The next step is to create and configure an AWS EMR cluster. a) Navigate to AWS Elastic Map Reduce and click Create Cluster . b) When creating your cluster please choose the EU(Ireland) region. charcount input code c) Cluster Configuration d) Software Configuration – you can remove the Pig and Hive installation configuration options e) Hardware Configuration/Security And Access/Bootstrap Actions - accept the defaults Provide a cluster name Specify location for log files. Replace the x99999999 in this example with your own student Id. f) Steps – we will add a Streaming Program step Proceed by clicking Configure and add Ensure that the Action on failure is set to Terminate cluster Note: The step parameters are as follows (replace the x99999999 with your own student id): Parameter Name Parameter Value Mapper s3://ie.ncirl.mscda/x99999999/charcount/code/mapper.py Reducer s3://ie.ncirl.mscda/x99999999/charcount/code/reducer.py Input S3 location s3://ie.ncirl.mscda/x99999999/charcount/input Output S3 location s3://ie.ncirl.mscda/x99999999/charcount/output Ensure you specify that the cluster will terminate after it has finished processing: Proceed by clicking Add and then Create Cluster 5) Results The analysis will take a number of minutes to perform as the required resources are provisioned in EC2 etc. The status of the cluster and how the process is progressing can be monitored. The output from the analysis is stored in the s3://ie.ncirl.mscda/x99999999/charcount/output location in a set of files named part-00000, part-00001, part-00002 etc. Download this files and review the output.