Assignment title: Information


Programming for Data Analytics Project 2017 DEADLINE: Monday 24th April 2017 @ 23:55 You are required to identify and carry out a series of analyses (i.e., at least two) of a large dataset (or a collection of large datasets) utilising appropriate programming languages and programming environments. Your project must incorporate the following elements: 1. Utilisation of a MapReduce environment for some part of the analysis 2. Source dataset(s) should be stored in appropriate database(s) prior to processing by MapReduce 3. Post-MapReduce processing dataset(s) should be stored in appropriate database(s) 4. Programmatically accessing the MapReduce source data 5. Programmatically storing the MapReduce output data 6. Follow-up analysis on the MapReduce output data For example, you may initially utilise MySQL to store a dataset and then your MapReduce processing would utilise the MySQL database as an input source. After processing the data through MapReduce you may then store the data in HBase or MongoDB. Following that you may use Python’s NumPy/ Pandas/ Matplotlib to conduct further analysis of the MapReduce output data (e.g., statistical analysis), and generate data visualisation plots for better presentation of results. PROJECT REPORT: The results of your analysis should be included in a project report. The project report should discuss the programming and data handling challenges that you encountered and the means and mechanisms you implemented to overcome these challenges. The report should be around 3000 words in length (excluding references), must follow the IEEE format (see ref below), as well as appropriate referencing and academic style. http://www.ieee.org/conferences_events/conferences/publishing/templates.htmlThe report should provide the following: 1. A description of the underlying dataset(s) 2. A description of the objective of the analysis; the analysis should answer a novel question 3. A description of the data processing activities carried out 4. Algorithms to process the dataset in a MapReduce environment 5. Presentation of results by making appropriate use of figures, tables, etc. 6. Discussion of the rationale and justification for the choices you have made in terms of data processing, programming language choice, and algorithms that you have implemented SUBMISSION: Your submission must include your project report document along with any programming code and system configuration elements. The final report must be submitted to Moodle (Turnitin) before the deadline. Late submissions will not be accepted and will be treated as absent. MARKING SCHEME: The project carries 60% of the overall marks distributed as follows: Abstract: 5% • Executive summary of the project objectives and achievements • Note: Look at abstracts in your lit review to get an idea of what makes a good/bad abstract Introduction: 10% • Objective statement • Motivation of the problem • Relevance of chosen topic • Elicitation of appropriately formed research question Related Work: 10% • Summarise relevant (academic) works that addressed similar problems or guided your decisions • Critical evaluation (i.e. go beyond just a summary of the works) Methodology: 40%• Description of datasets and justification of choosing them • Descriptions and justifications of the data processing activities carried • Justifications for the choice of technologies used (i.e., programming languages, databases, etc.) • Description and justifications of the implemented MapReduce algorithms Results: 25% • Presentation of results by making appropriate use of figures, tables, etc. • Evidence of how the project objectives were met Conclusions and Future Work: 10% • Discussion of research findings implications and limitations • Future work directions to be explored