CP3403/CP5634 Assignment Due: 25th May 2017 5pm. Page 1 Assignment - Data Mining Practice and Analysis Due date: Thursday, 25th May 2017 5pm Aims v Familiarise with some well-known data mining techniques in order to understand their working principles; v Apply data mining techniques to domain-specific datasets; v Review cutting-edge data mining techniques to gain good overview on current data mining technology; Requirements (Tasks) The whole task of this assignment consists of the following procedural steps. Step 1 : Set up (by your imagination of a real-like business situation or by applying an actual analysis problem case) a scenario in which you are given a set of domain-specific dataset and asked to analyze the given dataset. The purpose of the analysis might be to understand (overview or learn about) the given data or to solve a specific analytical problem – depending on the scenario you made up. Step 2 : Find and get your own domain-specific dataset to fit for the scenario you made up. The dataset could be unique or publicly available. Some public datasets are available from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Also refer to Resources folder of our LearnJCU subject site for more sources. Step 3 : Choose appropriate data mining techniques (algorithms) – see more details for each option in Step 4 below. ** Note: The procedural order of the above three steps can be alternated. For example, you may find an interesting dataset first and then set up a specific datamining scenario which fits for the analysis on the dataset chosen. ** Step 4 : You can select either of two options for this assignment. v Option (1) – Programming-intensive Assignment - Once you have your own domain-specific dataset and chosen data mining algorithm, then you need to design and implement the chosen algorithm in your preferred programming language. This assignment can be worked either as a group (two students at maximum) or as an individual. If you work as a group, then group members must equally contribute to the group work. Also, all group members must participate in the presentation.CP3403/CP5634 Assignment Due: 25th May 2017 5pm. Page 2 - A series of preprocessing will be required at this step. The preprocessing procedure should be designed carefully (considering what kind of processing will be required? How? Why?) to make your data ready to be fed to your program. Some parts of this preprocessing procedure can be included in your program as a part of “pre-data-mining module”. - Your final program must become a stand-alone data-mining tool designed for your own purpose of data analysis. It is expected that your program should include the following modules (and may include more sub-modules if needed); 1) pre-data-mining module – designed for necessary preprocessing and for getting the data ready to be fed to the next module (data-mining module). You don’t need to include all required pre-processing in this module. It is assumed that some initial preprocessing (e.g. cleaning noise data) can be done externally using other software tools (e.g. Excel or Weka). 2) data-mining module – the chosen data mining algorithm is implemented. You can directly borrow the algorithm from one popular existing data mining method, or you can design your own algorithm (by amending the existing one) 3) post-mining module – this module is for presenting/reporting the output result produced through previous modules. The result can be made in a simple text report or additionally in a non-text visualization way (e.g. graph, chart or diagram). - This programming-intensive assignment still requires an analysis. Try to find all the patterns you can detect with your implemented algorithm. Try to compare and contrast the result using your chosen preprocessing scheme and algorithm with using other existing algorithm or with using other preprocessing methods. Note: in particular for the comparison the result using your program with using other existing algorithm, you can use other existing data mining tools (e.g. Weka) to get the result using other algorithm. v Option (2) – Analysis-intensive Assignment - Once you have your own domain-specific dataset chosen, you need to design your own data-mining analysis scheme. This analysis scheme can consist of multiple steps of procedures: 1) Set up a strategy for preprocessing on your data. A series of preprocessing will be required and need to be designed carefully (considering what kind of processing will be required? How? Why?). You may include multiple different preprocessing schemes for the comparison analysis. 2) Set up a strategy for data-mining. you need to select one data mining areas (clustering, classification, association rules mining) of your choice and select AT LEAST ONE existing data mining algorithm in your chosen data mining area. For example, if you chose Clustering as your data mining area, you can apply two algorithms; DBScan and K-mean and compare the two results.CP3403/CP5634 Assignment Due: 25th May 2017 5pm. Page 3 Alternatively you can design a combined algorithm which applies multiple algorithms from same/different data mining areas in a series. Your strategy also can be designed to apply different parameters for one algorithm. Another strategy you can set up is to apply multiple preprocessing (attribute selection) schemes for one algorithm. - You can choose one data mining tool (e.g. Weka) to analyze your chosen dataset. Apply the data-mining strategy (you had set up) on your chosen data (preprocessed) using the data mining tool and try to find all the patterns you can detect. - Do various comparison experiments either by applying different data mining algorithms (or strategy) to the same chosen dataset or by applying a same algorithm to the differently pre-processed datasets. - Critically analyze experimental results and discuss/demonstrate why a chosen algorithm (strategy) is superior/inferior to other algorithm (strategy). Step 5 : - You need to present a short presentation (15~20 minutes presentation) based on your chosen algorithm (strategy) and experimental test, and also you need to write a scientific paper as an experimental report. • The presentation must generally include a good overview on your project, aims/objectives, reasons of your choice, brief overview of strategy/algorithm you chosen, findings, comparison including experimental results and conclusion. (We will not run in-class presentation but you need to shoot your presentation and upload the video-clip somewhere so that the marker can view it – e.g. YouTube, GoogleDrive etc.) • You need to write a research report paper of minimum 10~15 pages in length on your project to summarise your algorithm and experimental results. The report should contain all topics listed above for presentation but with more details. For CP5634 students, you need to add in your report one additional section for a brief (mini) literature review about the data mining methods (strategy, algorithm and/or preprocessing methods) you chose for your project. Please refer to the following link if you need to get further idea of “literature review”: http://www-public.jcu.edu.au/libcomp/assist/training/JCUPRD_026326 - The research paper must follow the generally accepted format of research article consisting of introduction, related work (brief review of methodologies (algorithm/strategy used), a summarized description of your experimental settings and procedures (description of data, justification of chosen data mining area, justification of chosen algorithm, preprocessing details, etc.), comparison, discussion, issues, conclusion, possible future work and a list of references. (you may add more sections if needed) - In addition to the general components listed above, the report from “Programming-intensive option” should include a summary of your program (including the program structure, implementation details, a summarized algorithm for the main modules etc. including code if necessary). - For “Analysis-intensive option”, it is required to include a more in-depth analysis on the investigation and experimental comparison made through the project.CP3403/CP5634 Assignment Due: 25th May 2017 5pm. Page 4 Submission • Due for the report submission: 25th May 2017 5pm • You need to submit your final report as a single document file (MS Word or PDF format) to LearnJCU. Your report should include the link to access your presentation. • For “Programming intensive option”, you need to submit the source code and executable file of your program accompanied to your report. Please make a zip file including all necessary files (report document and program files). Useful links • http://www.kdnuggets.com/ • http://www.cs.waikato.ac.nz/ml/weka/ • http://mlearn.ics.uci.edu/MLRepository.html • http://kdd.ics.uci.edu/ • http://www.sigkdd.org/ Writing Skills: http://www-public.jcu.edu.au/learningskills/resources/wsonline/ Scientific Report Writing: http://unilearning.uow.edu.au/report/2b.html http://writing.wisc.edu/Handbook/ScienceReport.html and more on Web – Please search !!Page 5 Marking Criteria REQUIREMENT (TASK) CRITERIA MARKS COMMENTS Data Mining Practice (10%) Dataset choice – appropriateness and complexity _/2.5 Relevance of chosen data mining area and algorithm (structured experimental setting) _/2.5 Preprocessing – appropriateness and complexity _/2.5 Scientific/Technical quality (novelty & innovation) [Programming intensive option] (Program design/implementation/completeness) [Analysis intensive option] (Comparative result analysis) _/2.5 Written Assignment (20%) Readability & (written) presentation ___/5 Analysis of results and conclusion ___/5 Structure & Organisation ___/5 [Programming-intensive option only] Scientific & Technical quality (description about design, structure, components, algorithms etc.) ___/5 [Analysis-intensive option only] Scientific & Technical quality (summary of comparison analysis) ___/5 Presentation (Video-Clip) (CP3403, CP5634 - 10%) Visual Aids _/2 Information Communication _/2 Presentation Gestures _/2 Length of Presentation & Team work _/2 Delivery _/2 Total Marks __/40Page 6 Data Mining Practice (10$) Data mining practice (10 marks) • Dataset complexity (2.5 marks); • Relevance of chosen data mining suites and algorithms to the chose data (2.5 marks) • Preprocessing (2.5 marks) • Innovation and novelty (2.5 marks) Written Report (20%) Presentation (10%) Exceeds Standards (5) Meets Standards (3) Does not Meet Standards (1) No Evidence (0) Data complexity Exhibiting excellent level of complexities including dimensionality, dirtiness, heterogeneity and complications Exhibiting good level of complexities including dimensionality, dirtiness, heterogeneity and complications Exhibiting low level of complexities including dimensionality, dirtiness, heterogeneity and complications Irrelevant to the study Relevance Excellent choice of data mining suites and algorithms for the chosen dataset Good choice of data mining suites and algorithms for the chosen dataset Almost no relevance of the choice of data mining suites and algorithms for the chosen dataset Irrelevant to the study Preprocessing Excellent preprocessing conducting most required cleaning for the main data mining practices Good preprocessing conducting some required cleaning for the main data mining practices Almost no preprocessing conducted Irrelevant to the study Innovation and novelty Excellent finding of most hidden (previously unknown) patterns Good finding of some hidden (previously unknown) patterns Almost no finding of hidden (previously unknown) patterns Irrelevant to the study Exceeds Standards (5) Meets Standards (3) Does not Meet Standards (1) No Evidence (0) Readability and presentation Exhibiting flow and presentation covering all required sections Good flow and presentation covering all or most required sections Flow and presentation covering almost no required sections Irrelevant to the study Scientific & technical quality Excellent mining with excellent contributions Good mining with some contributions Almost no mining and contribution Irrelevant to the study Analysis of results and conclusions Critically analyse results with excellent explanation Critically analyse results with good explanation Almost no critical analysis Incomplete and/or unfocused Structure & organisation Excellent organisation of the paper covering all required sections Good organisation of the paper covering almost all required sections Organisation of the paper covering almost no required sections Irrelevant to the studyPage 7 Exceeds Standards (5) Meets Standards (3) Does not Meet Standards (1) No Evidence (0) Visual aids Excellent use of graphs and figures; excellent support of visual components to support the flow of the talk; Good use of graphs and figures; good support of visual components to support the flow of the talk; Almost no use of graphs and figures; little support of visual components to support the flow of the talk; Irrelevant to the study Information communication Excellent communication of information; Most information is accurate; fluency of presentation Good communication of information; Some information is accurate Almost no communication of information; Little information is accurate Irrelevant to the study Presentation gestures Excellent eye contact and animated gestures Good eye contact and animated gestures Almost no eye contact and animated gestures Irrelevant to the study Length of presentation & Team work Finishes before/after 1 minute of the given time limit Finishes before/after less than 2 minutes of the given time limit Finishes before/after less than 3 minutes of the given time limit Irrelevant to the study Delivery Smooth delivery Delivery with minor issues Delivery with major issues Irrelevant to the study