Assignment title: Information
Tutorial : Hadoop Streaming MapReduce with AWS EMR
1) Introduction
In this tutorial we will utilise the Hadoop Streaming functionality provided by AWS EMR
(Amazon Web Services Elastic Map Reduce). Using the Hadoop Streaming functionality
allows us to create map and reduce functions that are coded with Python (or any other
programming language that can process data from STDIN and write data to STDOUT).
Our goal will be to count the total number of occurrences of the characters contained in a
sample piece of text. To perform this task we will create a mapper.py and a reducer.py
Python programs. The text we will analyse is Voltaire's Micromegas (free for download from
http://www.gutenberg.org/cache/epub/30123/pg30123.txt). Other AWS services we will
utilise are S3 and EC2.
S3 will be used
to store the input text that we will analyse
store the mapper.py and reducer.py Python code files
store the output from the analysis
store log and error data produced during the analysis process
EC2 will be used to create the required compute infrastructure that will host the Hadoop
cluster environment and process the input. This infrastructure will be automatically created
for us as part of configuring and creating the EMR Hadoop cluster.
2) Create the mapper.py and reducer.py Python programs
a) Using IDLE or an editor of your choice create the following two Python programs
i) mapper.py
ii) reducer.py
b) Note: In order to use the builtin aggregate reducer the keys output by the map function
must be prepended with the string 'LongValueSum:'.
3) Preparation
a) Sign-in to AWS (use Google Chrome for browser)
b) Please create a folder using your student Id in the
ie.ncirl.mscda bucket. E.g., if your student Id is
x99999999 then create a folder named x99999999 in
the ie.ncirl.mscda bucket. Note: Please use this folder
for storing any S3 data resources you will use for this or
future tutorials/project work.
c) Create the following subfolder structure under the folder that has been created using
your student Id:
Use the Create Folder option
Note: Do not create an output or a logs folder! This will be created automatically by
AWS EMR. The MapReduce task will fail to execute successfully if the output folder is
created at this stage.
d) Download the text from http://www.gutenberg.org/cache/epub/30123/pg30123.txt and
store locally on your PC as Micromegas.txt .
e) Upload the Micromegas.txt file to the
s3://ie.ncirl.mscda//charcount/input folder on S3
where should be replaced with the name of the folder you
created in step b).
f) Upload the mapper.py file to the
s3://ie.ncirl.mscda//charcount/code folder on S3
where should be replaced with the name of the folder you
created in step b).
g) Upload the reducer.py file to the
s3://ie.ncirl.mscda//charcount/code folder on S3
where should be replaced with the name of the folder you
created in step b).
4) Cluster Set-up and Job Execution
The next step is to create and configure an AWS EMR cluster.
a) Navigate to AWS Elastic Map Reduce and click Create Cluster .
b) When creating your cluster please choose the EU(Ireland) region.
charcount
input
code
c) Cluster Configuration
d) Software Configuration – you can remove the Pig and Hive installation configuration
options
e) Hardware Configuration/Security And Access/Bootstrap Actions - accept the defaults
Provide a cluster name
Specify location for log
files. Replace the
x99999999 in this
example with your
own student Id.
f) Steps – we will add a Streaming Program step
Proceed by clicking Configure and add
Ensure that the Action
on failure is set to
Terminate cluster
Note: The step parameters are as follows (replace the x99999999 with your own student
id):
Parameter
Name
Parameter Value
Mapper s3://ie.ncirl.mscda/x99999999/charcount/code/mapper.py
Reducer s3://ie.ncirl.mscda/x99999999/charcount/code/reducer.py
Input S3 location s3://ie.ncirl.mscda/x99999999/charcount/input
Output S3
location s3://ie.ncirl.mscda/x99999999/charcount/output
Ensure you specify that the cluster will terminate after it has finished processing:
Proceed by clicking Add and then Create Cluster
5) Results
The analysis will take a number of minutes to perform as the required resources are
provisioned in EC2 etc. The status of the cluster and how the process is progressing can be
monitored.
The output from the analysis is stored in the
s3://ie.ncirl.mscda/x99999999/charcount/output
location in a set of files named part-00000, part-00001,
part-00002 etc.
Download this files and review the output.