Assignment title: Information
Data Mining 1 25-Jul-2016
COSC2111 Data Mining
Assignment 1
This assignment counts for 23% of the total marks in this course.
PART 1: CLASSIFICATION 20 marks
1. This part of the assignment is concerend with the the file:
/KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/ data/other/bankbalanced.csv.
Thereisadescriptionofthedatainthefilebank-names.txtinthesamedirec- tory.
2. Run the following classifiers, with the default parameters, on this data: ZeroR, OneR,
J48, IBK and construct a table of the training and cross-validation errors. You can get
the training error by selecting "Use training set" as the test option. What do you
conclude from these results?
3. Using the J48 classifier, can you find a combination of the C and M parameter values
that minimizes the amount of overfitting? Include the results of your best five runs,
including the parameter values, in your table of results.
4. Reset J48 parameters to their default values. What is the effect of lowering the number
of examples in the training set? Include your runs in your table of re- sults.
5. Using the IBk classifier, can you find the value of k that minimizes the amount of
overfitting? Include your runs in your table of results.
6. Try a number of other classifiers. Aside from ZeroR, which classifiers are best and
worst in terms of predictive accuracy? Include 5 runs in your table of results.
7. What are the implications of the above range of accuracies for developing a bank
application using classificationtechniques?
8. Compare the accuracy of ZeroR, OneR and J48. What do you conclude?
9. What golden nuggets did you find, if any?
10. [OPTIONAL] Use an attribute selection algorithm to get a reduced attribute set. How
does the accuracy on the reduced set compare with the accuracy on the full set.
Submit: Up to two pages that describe what you did for each of the above ques- tions
and your results and conclusions.
Run No Classifier Parameters
Parameters
Trainin
g
Error
Crossvalid
Error
OverFitting
1.
Zero
R.
Non
e.
30.0%
.
30.0%
.
NoneData Mining 1 25-Jul-2016
PART 2: NUMERIC PREDICTION 10 marks
1. Numeric Prediction of the Balance attribute in the bank data of part 1.
2. Run the following classifers, with default parameters, on this data: ZeroR, MP5, IBk
and construct a table of the training and cross-validation errors. You may want to turn
on "Output Predictions" to get a better sense of the magnitude of the error on each
example. What do you conclude from theseresults?
3. Explore different parameter settings for M5P and IBk. Which values give the best
performance in terms of predictive accuracy and overfitting. Include the results of the
best five runs in your table of results.
4. Investigate three other classifiers for numeric prediction and their associated parameters. Include your best five runs in your table of results. Which classifier gives the
best performance in terms of predictive accuracy andoverfitting?
5. What golden nuggets did you find, if any?
Submit: Up to one page that describes what you did for each of the above ques- tions
and your results and conclusions.
PART 3: CLUSTERING 10 marks
1. Clustering of the bank data of part 1.
For this part use only the attributes Age, Marital, Education and Balance.
2. Run the Kmeans clustering algorithm on this data for the following values of K:
1,2,3,4,5,10,20. Analyse the resulting clusters. What do you conclude?
3. Choose a value of K and run the algorithm with different seeds. What is the effect of
changing the seed?
4. RuntheEM algorithmonthis datawiththedefaultparameters and describethe output.
5. The EM algorithm can be quite sensitive to whether the data is normalized or
not. Use the weka normalize filter (Preprocess --> Filter --> unsupervised
--> normalize) to normalize the numeric attributes. What difference does this make to
the clustering runs?
6. The algorithm can be quite sensitive to the values of minLogLikelihoodImprovementCV minStdDev and minLogLikelihoodImprovementIterating, Explore the effect of
changing these values. What do you conclude?
7. How many clusters do you think are in the data? Give an English language description
of one of them.
8. Compare the use Kmeans and EM for clustering tasks. Which do you think is best?
Why?
9. What golden nuggets did you find, if any?Data Mining 1 25-Jul-2016
Submit: Up to one page that describes what you did for each of the above ques- tions
and your results and conclusions.
PART 4: ASSOCIATION FINDING 10 marks
1. The files supermarket1.arff and supermarket2.arff in the folder
/KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/arff
contain the same details of shopping transactions represented in two different ways. You
can use a text viewer to look at the files.
2. What is the difference inrepresentations?
3. Load the file supermarket1.arff into weka and run the Apriori algorithm on this data.
You will need to restrict the number of attributes and/or the number of examples. What
significant associations can you find?
4. Explore different possibilities of the metric type and associated parameters. What do you
find?
5. Loadthefilesupermarket2.arffintowekaandruntheApriorialgorithmon thisdata.Whatdo
youfind?
6. Explore different possibilities of the metric type and associated parameters. What do you
find?
7. Try the other associators. What are the differences to Apriori?
8. What golden nuggets did you find, if any?
9. [OPTIONAL] Can you find any meaningful associations in the bank data?
Submit: Up to one page that describes what you did for each of the above questions and your
results and conclusions.
Submission instructions: Submit through Blackboard assessment tasks.
Assessment Criteria: 70% of the marks are allocated for carrying out the runs and reporting
the results. 30% of the marks are for investigative strategy and interpretation of the results.
.