Assignment title: Information
Programming for Data Analytics
Project
2017
DEADLINE: Monday 24th April 2017 @ 23:55
You are required to identify and carry out a series of analyses (i.e., at least two) of a large dataset (or a
collection of large datasets) utilising appropriate programming languages and programming
environments.
Your project must incorporate the following elements:
1. Utilisation of a MapReduce environment for some part of the analysis
2. Source dataset(s) should be stored in appropriate database(s) prior to processing by MapReduce
3. Post-MapReduce processing dataset(s) should be stored in appropriate database(s)
4. Programmatically accessing the MapReduce source data
5. Programmatically storing the MapReduce output data
6. Follow-up analysis on the MapReduce output data
For example, you may initially utilise MySQL to store a dataset and then your MapReduce processing
would utilise the MySQL database as an input source. After processing the data through MapReduce you
may then store the data in HBase or MongoDB. Following that you may use Python’s NumPy/ Pandas/
Matplotlib to conduct further analysis of the MapReduce output data (e.g., statistical analysis), and
generate data visualisation plots for better presentation of results.
PROJECT REPORT:
The results of your analysis should be included in a project report. The project report should discuss the
programming and data handling challenges that you encountered and the means and mechanisms you
implemented to overcome these challenges.
The report should be around 3000 words in length (excluding references), must follow the IEEE format
(see ref below), as well as appropriate referencing and academic style.
http://www.ieee.org/conferences_events/conferences/publishing/templates.htmlThe report should provide the following:
1. A description of the underlying dataset(s)
2. A description of the objective of the analysis; the analysis should answer a novel question
3. A description of the data processing activities carried out
4. Algorithms to process the dataset in a MapReduce environment
5. Presentation of results by making appropriate use of figures, tables, etc.
6. Discussion of the rationale and justification for the choices you have made in terms of data
processing, programming language choice, and algorithms that you have implemented
SUBMISSION:
Your submission must include your project report document along with any programming code and
system configuration elements.
The final report must be submitted to Moodle (Turnitin) before the deadline. Late submissions will not
be accepted and will be treated as absent.
MARKING SCHEME:
The project carries 60% of the overall marks distributed as follows:
Abstract: 5%
• Executive summary of the project objectives and achievements
• Note: Look at abstracts in your lit review to get an idea of what makes a good/bad abstract
Introduction: 10%
• Objective statement
• Motivation of the problem
• Relevance of chosen topic
• Elicitation of appropriately formed research question
Related Work: 10%
• Summarise relevant (academic) works that addressed similar problems or guided your
decisions
• Critical evaluation (i.e. go beyond just a summary of the works)
Methodology: 40%• Description of datasets and justification of choosing them
• Descriptions and justifications of the data processing activities carried
• Justifications for the choice of technologies used (i.e., programming languages, databases,
etc.)
• Description and justifications of the implemented MapReduce algorithms
Results: 25%
• Presentation of results by making appropriate use of figures, tables, etc.
• Evidence of how the project objectives were met
Conclusions and Future Work: 10%
• Discussion of research findings implications and limitations
• Future work directions to be explored