CP3403/CP5634 Assignment
Due: 25th May 2017 5pm.
Page 1
Assignment - Data Mining Practice and Analysis
Due date: Thursday, 25th May 2017 5pm
Aims
v Familiarise with some well-known data mining techniques in order to understand
their working principles;
v Apply data mining techniques to domain-specific datasets;
v Review cutting-edge data mining techniques to gain good overview on current data
mining technology;
Requirements (Tasks)
The whole task of this assignment consists of the following procedural steps.
Step 1 :
Set up (by your imagination of a real-like business situation or by applying an actual
analysis problem case) a scenario in which you are given a set of domain-specific
dataset and asked to analyze the given dataset. The purpose of the analysis might be
to understand (overview or learn about) the given data or to solve a specific analytical
problem – depending on the scenario you made up.
Step 2 :
Find and get your own domain-specific dataset to fit for the scenario you made up.
The dataset could be unique or publicly available. Some public datasets are available
from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Also refer
to Resources folder of our LearnJCU subject site for more sources.
Step 3 :
Choose appropriate data mining techniques (algorithms) – see more details for each
option in Step 4 below.
** Note: The procedural order of the above three steps can be alternated. For
example, you may find an interesting dataset first and then set up a specific datamining scenario which fits for the analysis on the dataset chosen. **
Step 4 :
You can select either of two options for this assignment.
v Option (1) – Programming-intensive Assignment
- Once you have your own domain-specific dataset and chosen data mining
algorithm, then you need to design and implement the chosen algorithm in
your preferred programming language.
This assignment can be worked either as a group (two students at maximum)
or as an individual. If you work as a group, then group members must equally
contribute to the group work. Also, all group members must participate in the
presentation.CP3403/CP5634 Assignment
Due: 25th May 2017 5pm.
Page 2
- A series of preprocessing will be required at this step. The preprocessing
procedure should be designed carefully (considering what kind of processing
will be required? How? Why?) to make your data ready to be fed to your
program. Some parts of this preprocessing procedure can be included in your
program as a part of “pre-data-mining module”.
- Your final program must become a stand-alone data-mining tool designed for
your own purpose of data analysis. It is expected that your program should
include the following modules (and may include more sub-modules if
needed);
1) pre-data-mining module – designed for necessary preprocessing and for
getting the data ready to be fed to the next module (data-mining
module). You don’t need to include all required pre-processing in this
module. It is assumed that some initial preprocessing (e.g. cleaning
noise data) can be done externally using other software tools (e.g. Excel
or Weka).
2) data-mining module – the chosen data mining algorithm is implemented.
You can directly borrow the algorithm from one popular existing data
mining method, or you can design your own algorithm (by amending the
existing one)
3) post-mining module – this module is for presenting/reporting the output
result produced through previous modules. The result can be made in a
simple text report or additionally in a non-text visualization way (e.g.
graph, chart or diagram).
- This programming-intensive assignment still requires an analysis. Try to find
all the patterns you can detect with your implemented algorithm. Try to
compare and contrast the result using your chosen preprocessing scheme and
algorithm with using other existing algorithm or with using other
preprocessing methods.
Note: in particular for the comparison the result using your program with
using other existing algorithm, you can use other existing data mining tools
(e.g. Weka) to get the result using other algorithm.
v Option (2) – Analysis-intensive Assignment
- Once you have your own domain-specific dataset chosen, you need to design
your own data-mining analysis scheme. This analysis scheme can consist of
multiple steps of procedures:
1) Set up a strategy for preprocessing on your data.
A series of preprocessing will be required and need to be designed
carefully (considering what kind of processing will be required? How?
Why?). You may include multiple different preprocessing schemes for the
comparison analysis.
2) Set up a strategy for data-mining.
you need to select one data mining areas (clustering, classification,
association rules mining) of your choice and select AT LEAST ONE
existing data mining algorithm in your chosen data mining area. For
example, if you chose Clustering as your data mining area, you can apply
two algorithms; DBScan and K-mean and compare the two results.CP3403/CP5634 Assignment
Due: 25th May 2017 5pm.
Page 3
Alternatively you can design a combined algorithm which applies multiple
algorithms from same/different data mining areas in a series. Your
strategy also can be designed to apply different parameters for one
algorithm. Another strategy you can set up is to apply multiple
preprocessing (attribute selection) schemes for one algorithm.
- You can choose one data mining tool (e.g. Weka) to analyze your chosen
dataset. Apply the data-mining strategy (you had set up) on your chosen data
(preprocessed) using the data mining tool and try to find all the patterns you
can detect.
- Do various comparison experiments either by applying different data mining
algorithms (or strategy) to the same chosen dataset or by applying a same
algorithm to the differently pre-processed datasets.
- Critically analyze experimental results and discuss/demonstrate why a chosen
algorithm (strategy) is superior/inferior to other algorithm (strategy).
Step 5 :
- You need to present a short presentation (15~20 minutes presentation) based
on your chosen algorithm (strategy) and experimental test, and also you need
to write a scientific paper as an experimental report.
• The presentation must generally include a good overview on your
project, aims/objectives, reasons of your choice, brief overview of
strategy/algorithm you chosen, findings, comparison including
experimental results and conclusion. (We will not run in-class
presentation but you need to shoot your presentation and upload the
video-clip somewhere so that the marker can view it – e.g. YouTube,
GoogleDrive etc.)
• You need to write a research report paper of minimum 10~15 pages in
length on your project to summarise your algorithm and experimental
results. The report should contain all topics listed above for
presentation but with more details. For CP5634 students, you need to
add in your report one additional section for a brief (mini) literature
review about the data mining methods (strategy, algorithm and/or
preprocessing methods) you chose for your project. Please refer to the
following link if you need to get further idea of “literature review”:
http://www-public.jcu.edu.au/libcomp/assist/training/JCUPRD_026326
- The research paper must follow the generally accepted format of research
article consisting of introduction, related work (brief review of methodologies
(algorithm/strategy used), a summarized description of your experimental
settings and procedures (description of data, justification of chosen data
mining area, justification of chosen algorithm, preprocessing details, etc.),
comparison, discussion, issues, conclusion, possible future work and a list of
references. (you may add more sections if needed)
- In addition to the general components listed above, the report from
“Programming-intensive option” should include a summary of your
program (including the program structure, implementation details, a
summarized algorithm for the main modules etc. including code if necessary).
- For “Analysis-intensive option”, it is required to include a more in-depth
analysis on the investigation and experimental comparison made through the
project.CP3403/CP5634 Assignment
Due: 25th May 2017 5pm.
Page 4
Submission
• Due for the report submission: 25th May 2017 5pm
• You need to submit your final report as a single document file (MS Word or PDF
format) to LearnJCU. Your report should include the link to access your
presentation.
• For “Programming intensive option”, you need to submit the source code and
executable file of your program accompanied to your report. Please make a zip file
including all necessary files (report document and program files).
Useful links
• http://www.kdnuggets.com/
• http://www.cs.waikato.ac.nz/ml/weka/
• http://mlearn.ics.uci.edu/MLRepository.html
• http://kdd.ics.uci.edu/
• http://www.sigkdd.org/
Writing Skills: http://www-public.jcu.edu.au/learningskills/resources/wsonline/
Scientific Report Writing:
http://unilearning.uow.edu.au/report/2b.html
http://writing.wisc.edu/Handbook/ScienceReport.html
and more on Web – Please search !!Page 5
Marking Criteria
REQUIREMENT
(TASK)
CRITERIA MARKS COMMENTS
Data Mining Practice
(10%)
Dataset choice – appropriateness and
complexity _/2.5
Relevance of chosen data mining area and
algorithm (structured experimental setting) _/2.5
Preprocessing – appropriateness and complexity
_/2.5
Scientific/Technical quality
(novelty & innovation)
[Programming intensive option] (Program
design/implementation/completeness)
[Analysis intensive option] (Comparative result
analysis)
_/2.5
Written Assignment
(20%)
Readability & (written) presentation
___/5
Analysis of results and conclusion ___/5
Structure & Organisation ___/5
[Programming-intensive option only]
Scientific & Technical quality (description
about design, structure, components, algorithms
etc.)
___/5
[Analysis-intensive option only]
Scientific & Technical quality (summary of
comparison analysis)
___/5
Presentation
(Video-Clip)
(CP3403, CP5634
- 10%)
Visual Aids
_/2
Information Communication
_/2
Presentation Gestures
_/2
Length of Presentation & Team work
_/2
Delivery
_/2
Total Marks __/40Page 6
Data Mining Practice (10$)
Data mining practice (10 marks)
• Dataset complexity (2.5 marks);
• Relevance of chosen data mining suites and algorithms to the chose data (2.5 marks)
• Preprocessing (2.5 marks)
• Innovation and novelty (2.5 marks)
Written Report (20%)
Presentation (10%)
Exceeds Standards
(5)
Meets Standards (3) Does not Meet
Standards (1)
No Evidence (0)
Data
complexity
Exhibiting excellent
level of complexities
including
dimensionality,
dirtiness,
heterogeneity and
complications
Exhibiting good level
of complexities
including
dimensionality,
dirtiness,
heterogeneity and
complications
Exhibiting low level
of complexities
including
dimensionality,
dirtiness,
heterogeneity and
complications
Irrelevant to the
study
Relevance Excellent choice of
data mining suites
and algorithms for
the chosen dataset
Good choice of data
mining suites and
algorithms for the
chosen dataset
Almost no relevance
of the choice of data
mining suites and
algorithms for the
chosen dataset
Irrelevant to the
study
Preprocessing Excellent
preprocessing
conducting most
required cleaning for
the main data
mining practices
Good preprocessing
conducting some
required cleaning for
the main data
mining practices
Almost no
preprocessing
conducted
Irrelevant to the
study
Innovation
and novelty
Excellent finding of
most hidden
(previously
unknown) patterns
Good finding of
some hidden
(previously
unknown) patterns
Almost no finding of
hidden (previously
unknown) patterns
Irrelevant to the
study
Exceeds Standards
(5)
Meets Standards (3) Does not Meet
Standards (1)
No Evidence (0)
Readability
and
presentation
Exhibiting flow and
presentation
covering all required
sections
Good flow and
presentation
covering all or most
required sections
Flow and
presentation
covering almost no
required sections
Irrelevant to the
study
Scientific &
technical
quality
Excellent mining
with excellent
contributions
Good mining with
some contributions
Almost no mining
and contribution
Irrelevant to the
study
Analysis of
results and
conclusions
Critically analyse
results with
excellent
explanation
Critically analyse
results with good
explanation
Almost no critical
analysis
Incomplete and/or
unfocused
Structure &
organisation
Excellent
organisation of the
paper covering all
required sections
Good organisation
of the paper
covering almost all
required sections
Organisation of the
paper covering
almost no required
sections
Irrelevant to the
studyPage 7
Exceeds Standards
(5)
Meets Standards (3) Does not Meet
Standards (1)
No Evidence (0)
Visual aids Excellent use of
graphs and figures;
excellent support of
visual components
to support the flow
of the talk;
Good use of graphs
and figures; good
support of visual
components to
support the flow of
the talk;
Almost no use of
graphs and figures;
little support of
visual components
to support the flow
of the talk;
Irrelevant to the
study
Information
communication
Excellent
communication of
information; Most
information is
accurate; fluency of
presentation
Good
communication of
information; Some
information is
accurate
Almost no
communication of
information; Little
information is
accurate
Irrelevant to the
study
Presentation
gestures
Excellent eye
contact and
animated gestures
Good eye contact
and animated
gestures
Almost no eye
contact and
animated gestures
Irrelevant to the
study
Length of
presentation &
Team work
Finishes before/after
1 minute of the
given time limit
Finishes before/after
less than 2 minutes
of the given time
limit
Finishes before/after
less than 3 minutes
of the given time
limit
Irrelevant to the
study
Delivery Smooth delivery Delivery with minor
issues
Delivery with major
issues
Irrelevant to the
study