Assignment title: Information
Programming for Data Analytics
Repeat Project
Sem. 1, 2016-2017
DEADLINE: 1st December @ 23:55
You are required to identify and carry out a series of analyses (i.e., at least two) of a large dataset (or a collection of large datasets) utilising appropriate programming languages and programming environments. Your project must incorporate the following elements:
1. Demonstration of a suitable review of the Literature of both domain and techniques
2. Utilisation of a MapReduce environment for some part of the analysis
3. Source dataset(s) should be stored in appropriate database(s) prior to processing by MapReduce
4. Post-MapReduce processing dataset(s) should be stored in appropriate database(s)
5. Programmatically accessing the MapReduce source data
6. Programmatically storing the MapReduce output data
7. Follow-up analysis on the MapReduce output data
For example, you may initially utilise MySQL to store a dataset and then your MapReduce processing would utilise the MySQL database as an input source. After processing the data through MapReduce you may then store the data in HBase or MongoDB. Following that you may use Python's NumPy/ Pandas/ Matplotlib to conduct further analysis of the MapReduce output data (e.g., statistical analysis), and generate data visualisation plots for better presentation of results.
PROJECT REPORT:
The results of your analysis should be included in a project report. The project report should discuss the programming and data handling challenges that you encountered and the means and mechanisms you implemented to overcome these challenges.
The report should be around 5000 words in length (excluding references), should follow the IEEE format1 , as well as appropriate referencing and academic style. The report should provide the following:
1. A discussion of the core literature of the domain and the techniques applied, demonstrating a clear understanding of the significance of the research topic
2. A description of the underlying dataset(s)
3. A description of the objective of the analysis; the analysis should answer a novel question
4. A description of the data processing activities carried out
5. Algorithms to process the dataset in a MapReduce environment
6. Presentation of results by making appropriate use of figures, tables, etc.
7. Discussion of the rationale and justification for the choices you have made in terms of data processing, programming language choice, and algorithms that you have implemented
SUBMISSION:
Your submission must include your project report document along with any programming code and system configuration elements.
http://www.ieee.org/conferences_events/conferences/publishing/templates.html
MARKING SCHEME:
This project covers all learning outcomes of a repeat assessment
• Abstract: 5%
o Executive summary of the project objectives and achievements
o Note: Look at abstracts in your lit review to get an idea of what makes a good/bad abstract
• Introduction: 10%
o Objective statement
o Motivation of the problem
o Relevance of chosen topic
o Elicitation of appropriately formed research question
• Literature Review and Related Work: 25%
o Summarise relevant (academic) works that addressed similar problems or guided your decisions
o Critical evaluation (i.e. go beyond just a summary of the works)
• Methodology: 30%
o Description of datasets and justification of choosing them
o Descriptions and justifications of the data processing activities carried
o Justifications for the choice of technologies used (i.e., programming languages, databases, etc.)
o Description and justifications of the implemented MapReduce algorithms
• Results: 20%
o Presentation of results by making appropriate use of figures, tables, etc
o Evidence of how the project objectives were met
• Conclusions and Future Work: 10%
o Discusscussion of research findings implications and limitations
o Future work directions to be explored