Assignment title: Information
1
Programming for Data Analytics
Project
Sem. 2, 2015-2016
DEADLINE: Wednesday 20th April 2016 @ 17:00
You are required to identify and carry out a series of analyses (i.e., at least two) of a large dataset (or a collection
of large datasets) utilising appropriate programming languages and programming environments.
Your project must incorporate the following elements:
1. Utilisation of a MapReduce environment for some part of the analysis
2. Source dataset(s) should be stored in appropriate database(s) prior to processing by MapReduce
3. Post-MapReduce processing dataset(s) should be stored in appropriate database(s)
4. Programmatically accessing the MapReduce source data
5. Programmatically storing the MapReduce output data
6. Follow-up analysis on the MapReduce output data
For example, you may initially utilise MySQL to store a dataset and then your MapReduce processing would
utilise the MySQL database as an input source. After processing the data through MapReduce you may then
store the data in HBase or MongoDB. Following that you may use Python's NumPy/ Pandas/ Matplotlib to
conduct further analysis of the MapReduce output data (e.g., statistical analysis), and generate data
visualisation plots for better presentation of results.
PROJECT REPORT:
The results of your analysis should be included in a project report. The project report should discuss the
programming and data handling challenges that you encountered and the means and mechanisms you
implemented to overcome these challenges.
The report should be around 3000 words in length (excluding references), should follow the IEEE format1, as
well as appropriate referencing and academic style.
The report should provide the following:
1. A description of the underlying dataset(s)
2. A description of the objective of the analysis; the analysis should answer a novel question
3. A description of the data processing activities carried out
4. Algorithms to process the dataset in a MapReduce environment
5. Presentation of results by making appropriate use of figures, tables, etc.
6. Discussion of the rationale and justification for the choices you have made in terms of data processing,
programming language choice, and algorithms that you have implemented
SUBMISSION:
Your submission must include your project report document along with any programming code and system
configuration elements.
The final report must be submitted to Moodle (Turnitin) before the deadline. Late submissions will not be
accepted and will be treated as absent.
1 http://www.ieee.org/conferences_events/conferences/publishing/templates.html
2
MARKING SCHEME:
The project carries 60% of the overall marks distributed as follows:
Abstract: 5%
Executive summary of the project objectives and achievements
Note: Look at abstracts in your lit review to get an idea of what makes a good/bad abstract
Introduction: 10%
Objective statement
Motivation of the problem
Relevance of chosen topic
Elicitation of appropriately formed research question
Related Work: 10%
Summarise relevant (academic) works that addressed similar problems or guided
your decisions
Critical evaluation (i.e. go beyond just a summary of the works)
Methodology: 40%
Description of datasets and justification of choosing them
Descriptions and justifications of the data processing activities carried
Justifications for the choice of technologies used (i.e., programming languages, databases, etc.)
Description and justifications of the implemented MapReduce algorithms
Results: 25%
Presentation of results by making appropriate use of figures, tables, etc.
Evidence of how the project objectives were met
Conclusions and Future Work: 10%
Discusscussion of research findings implications and limitations
Future work directions to be explored