Data Mining
Task 1: Choose one from the following three tasks.
1. Mining association rules over distributed databases
Review the popular association-mining algorithms and propose a new algorithm to mine association
rules over distributed databases.
2. Mining classification over large databases
Review the popular classification-mining algorithms and propose a new algorithm to mine classification
over large databases in order to improve efficiency and scalability.
3. Mining clusters over large databases
Review the popular cluster-mining algorithms and propose a new algorithm to mine cluster over large
databases in order to improve performance (e.g. efficiency, scalability, able to deal with noise and
outliers).
(5 written reports Each report should have : An introduction , Description, Evaluation
(Advantages & Disadvantages ) , summary , Conclusion (own review) + new algorithm )
review [8 marks]
proposal [5 marks]
Task 2: A database in .ARFF format has been provided for you on Studynet. Analyse this database
using the WEKA toolkit and tools introduced within this module. Produce a report explaining which tools
you used and why, what results you obtained, and what this tells you about the data. Marks will be
awarded for: variety of tools used, quality of analysis, and interpretation of the results. An extensive
report is not required (at most 4000 words), nor is detailed explanation of the techniques employed, but
any graphs or tables produced should be described and analysed in the text. A reasonable report could
be achieved by doing a thorough analysis using three techniques. An excellent report would use at
least four tools to analyse the dataset, and provide detailed comparisons between the results.
You should perform the following steps:
1. Analyse the attributes in the data, and consider their relative importance with respect to the target
class.
[6 marks]
2. Construct graphs of classification performance against training set size for a range of classifiers
taken from those considered in the module. You may need to experiment with different training sets,
depending on what you have discovered about the data in step (1).
[10 marks]
3. Analyse the data structure/representation generated by at least three classifiers when trained on the
complete dataset. What does your analysis tell you about the data set?
[7 marks]
4. Combine the results from the previous three steps and all your classifiers to develop a model of why
instances fall into particular classes. (Your answer to this question should be understandable by
someone who is not a specialist in data mining.)
[4 marks]
[Total 40 marks]