Assignment title: Information

Statistical Learning for Data Mining 1) Consider the following decision tree to predict class 0 or 1. The numbers of rows in each class are shown in the node. a) What is the predicted class for the following row of data? x1 x2 x3 x4 x5 Class 2.6 healthy diet 1.2 high exercise high blood pressure ? b) Calculate the entropy information gain for the split at node 2. You do not need to simplify an expression with fractions and logarithms (natural). c) Calculate the training error rate of the tree. d) Calculate the true positive rate for this tree and data. e) Apply reduced error pruning with the following validation data to decide if the children of node 3 should be pruned. x1 x2 = diet x3 x4 x5 = blood pressure Class 2.6 healthy 4.3 7.9 normal 0 8.4 healthy 2.1 4.5 high 1 9.3 average 6.9 3.5 high 0 10.4 healthy 4.5 2.6 high 0 12.6 average 8.6 2.9 high 0 12.9 average 4.5 1.7 normal 1 14.8 average 3.4 8.8 normal 1 15.7 average 1.6 3.3 normal 1 2) Determine the first split for a regression tree applied to the following data x1 x2 y no exercise 1.3 1 no exercise 1.7 7 exercise 3.7 2 exercise 4.8 6 3) Consider the following data and triangular kernel function K(x1, x0) = 3 - ||x1 – x0||2 for 0 ≤ ||x1 – x0||2 ≤ 3 0 for 3 ≤ ||x1 – x0||2 where ||x1 – x0||2 is the squared Euclidean distance between x1 and x0. Use a weighted nearest neighbor model to predict y at x0 = (4,6). You need to simplify your answer to at least an expression with all numbers—that is, not use ||x1 – x0||2 in your answer. x1 x2 y 3 7 6 4 7 9 3 8 12 4) Circle ALL of the following that are correct. Consider a nearest neighbor classifier based on a Gaussian kernel with parameter . As we increase , we expect to a) Decrease the error rate on the training data b) Decrease the error rate on test data c) Increase the bias in the predictions d) Increase the variance in the predictions e) Obtain better generalization error 5) Circle ALL of the following that are correct. a) In a nearest neighbor classifier, if the threshold to assign to a positive class is increased, the sensitivity will remain the same or increase b) If the test data in cross validation folds are used to select values of parameters, the cross validation estimate of generalization error is expected to be optimistic c) An optimistic error rate is expected to be a better estimate of generalization error than a pessimistic error rate d) True positive rate and false positive rate always sum to one e) Nearest neighbor classifiers are particularly sensitive to the units of measure for predictor attributes 6) A rule-based classifier will use a general to specific strategy to learn a rule to predict class + from the following data. a) Assume the first conjunct in the rule is {x1 = red}. List all the possible conjuncts that could be added to this rule. Do not calculate information gain. b) Are any additional conjuncts added to the rule {x1 = yes, x2 = red  +}? State Yes or No? Explain. c) If the rule {x1 = c, x2 = red  +} is added to the rule set, list the row numbers that are eliminated before the next rule is computed, or state not enough information to determine. 7) The following Weka screen shot is for JRip rule-based classifier applied to a subset of the Breast Cancer data with 500 rows and two classes (class 2: 343, class 4: 157). Here 5-fold cross validation is used and the number of folds used for pruning is set to 4 (which implies 1/4 of the rows are used for pruning). a) Predict the class of the following row of data Sample Code = 61789 Clump Thickness = 6 Uniformity of Cell Size = 3 Uniformity of Cell Shape = 8 Marginal Adhesion = 6 Single Epithelial Cell Size = 7Bare Nuclei = 3 Bland Chromatin = 4 Normal Nucleoli = 7 Mitoses = 7 Class ? b) Calculate the number of rows that are used to prune the rules in each fold, or state that there is not enough information available. c) Estimate the generalization error of this model, or state that there is not enough information available. Instances: 500 Attributes: 11 Sample Code Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Bland Chromatin Marginal Adhesion Normal Nucleoli Single Epithelial Cell Size Mitoses Class Bare Nuclei Test mode: 5-fold cross-validation === Classifier model (full training set) === JRIP rules: =========== (Uniformity of Cell Size >= 4) and (Uniformity of Cell Size >= 5) => Class=4 (121.0/3.0) (Bare Nuclei >= 4) and (Bare Nuclei >= 6) => Class=4 (31.0/1.0) (Single Epithelial Cell Size >= 3) and (Clump Thickness >= 6) and (Bare Nuclei >= 4) => Class=4 (5.0/0.0) => Class=2 (343.0/4.0) Number of Rules : 4 === Stratified cross-validation === === Summary === Correctly Classified Instances 483 96.6 % Incorrectly Classified Instances 17 Kappa statistic 0.9215 Mean absolute error 0.0511 3.4 % Root mean squared error 0.1814 Relative absolute error 11.8493 % Root relative squared error 39.0784 % Total Number of Instances 500 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.971 0.045 0.979 0.971 0.975 0.922 0.963 0.974 2 0.955 0.029 0.938 0.955 0.946 0.922 0.963 0.915 4 Wted Avg. 0.966 0.040 0.966 0.966 0.966 0.922 0.963 0.955 === Confusion Matrix === a b <-- classified as 333 10 | a = 2 7 150 | b = 4 8) A data set contains 1000 instances. A decision tree is developed with a 10-fold cross validation to estimate generalization error in the usual manner. Within each fold, 1/3 of the data is used for validation. Within each fold, a tree is grown, then pruned two different ways: 1) the unpruned tree is pruned with a pessimistic error rate based on the number of child nodes and 2) the unpruned tree is pruned with a pessimistic error rate based on the confidence interval approach described in class. Between these two pruned trees, the validation error is used to select the best one based on error rate. Describe how the final model is built for this study—state the number of instances used in elements of your answer.