Preliminary On the Moodle page of the course, in the Workshops section you will ﬁnd a directory called Datasets & Source Code. The directory contains the Home Equity (HMEQ) dataset (both in its original form and after the missing values for the categorical variables REASON and JOB have been identiﬁed, hmeq.csv and clean hmeq.csv respectively. It also contains a ﬁle containing functions that I designed for these workshops, my Rfunctions.R. This last ﬁle is updated weekly so download the most recent version every week. 1. Save the datasets into the R directory you are using for the course 2. Set the working directory in RStudio to that directory 3. After downloading the most recent version of my Rfunctions.R, source it: > source("my Rfunctions.R") This command will load (and if necessary download and install) every required package. If the command terminates with an error please let me know. 4. Load the dataset: > Data <- read.csv("clean hmeq.csv") We next assign observations to the training and test set. To ensure reproducibility set the seed to 1 as in previous workshops. > set.seed(1) > train <- sample(1:nrow(Data), floor(0.7*nrow(Data))) > Data.train <- Data[train,] > Data.test <- Data[-train,] Decision Trees There are a number of diﬀerent packages to build decision trees in R reﬂecting the diﬀerent algorithms and implementation of this family of methods. One of the most widely used is the package rpart, which stands for recursive partitioning. To load the package call: # Loading rpart library for decision trees > library(rpart) The function to build decision trees is called, rpart(). This function takes a minimum of three arguments: 1. formula = a formula that speciﬁes the target variable and the inputs (predictors). The syntax of the formula is the same as that used for logistic regression models, Y ∼ X1 + X2 +···Xq, where Y is the response (dependent variable) and Xi are the predictors (independent variables) 2. data = Dataset used to build the tree (training set) 3. method = "class" Setting method to “class” ensures that a decision tree for classiﬁcation (and not regression) is built. Since decision trees perform variable selection during construction, we will use all the available variables as inputs. The formula BAD ∼ . is a shorthand notation for all the variables in the data frame (excluding BAD). # Build decision tree on training data > tr <- rpart(BAD ∼ . , data = Data.train, method = "class") In the above we left all the options in the rpart() function assume their default values, and as a result the Gini coeﬃcient is used as impurity measure. A very appealing property of decision trees is that they are readily visualised (see Fig. 13). # Creates plot of tree structure > plot(tr) # Inserts binary test conditions at each internal node and predicted label in leaf nodes > text(tr) | DELINQ< 0.5 DEBTINC< 45.66 DELINQ< 2.5 LOAN>=6050 DEROG< 2.5 DEBTINC< 43.92 CLAGE>=75.84 CLAGE>=328 0 1 0 1 1 1 1 0 1 DELINQ< 0.5 DEBTINC< 45.66 DELINQ< 2.5 LOAN>=6050 DEROG< 2.5 DEBTINC< 43.92 CLAGE>=75.84 CLAGE>=328 0 1 0 1 1 1 1 0 1 Figure 13: Plot of decision tree In Fig. 13 all the splits are binary. This is accidental, rpart considers only binary splits for all types of variables. Note that the plot() function sets the length of the branches to reﬂect the decrease in impurity measure after each split. You will observe that the ﬁrst split doesn’t decrease impurity substantially but the splits at the second level (depth 2) do. It is not uncommon for the largest decrease in impurity to be observed after the ﬁrst few splits in a decision tree. Notice further that as we go down the tree impurity decreases much more gradually. To obtain more detailed information concerning each split of the decision tree we can print out the variable. > tr R prints output corresponding to each branch of the tree (see Fig. 14). R displays (in this order) 1. the number at the start of each line corresponds to the number (id) assigned to the node 2. the test condition associated with the node (e.g. DELINQ < 0.5), 3. the number of observations allocated to that node 4. the number of misclassiﬁed observations (assuming predictions are made based on the majority class) 5. the predicted class (0 or 1) determined as the majority class at the branch 6. the proportion of observations of class 0 and 1 7. an asterisk indicates this is a leaf (terminal node) of the tree The summary function, provides much more detailed information about the tree: > summary(tr) The complete output is very long so Fig. 15 illustrates the header and the information for the ﬁrst node of tree. Figure 14: Print out of tree Figure 15: Output of summary() function First we will examine in more detail the output for each node (ignoring for the moment the ﬁrst table appearing right after the printout of the Call). • Variable importance: This line summarises the importance of the attributes used in the tree (either as primary or surrogate variable). A variable may appear in the tree many times. An overall measure of variable importance is the sum of the goodness of split measures (i.e. the decrease in impurity) for each split for which it was the primary variable, plus goodness of split × (adjusted agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum to 100 and the rounded values are shown, omitting any variable whose proportion is less than 1%. • Node number: Information for each node contains: – The number (id) of the node and the number of observations allocated to this node (4172) – complexity param: The value of the complexity parameter (α in the lecture slides) for which this node would be a leaf node. In other words, the value of the complexity parameter for which a subtree in which this node is a leaf is optimal. – predicted class: The prediction is based on the majority class at the node – expected loss: Proportion of misclassiﬁed observations (misclassiﬁcation error) if all observations that reach this node are predicted according to the majority class (824/4172) – P(node): Probability of an observation reaching this node (proportion of training set observations that reach this node) – class counts: The number of observations of each class that reach this node – probabilities: The associated probabilities of each class – left son/ right son: The id as well as the number of observations allocated to the left and right children of current node. The observations for which the splitting condition is true are allocated to the left child. (In the current example for the root of the tree this means that the 3336 observations at the root node satisfying DELINQ < 0.5 are allocated to the left child which has node id 2.) – Primary splits: Splits with respect to the top ﬁve variables (in terms of impurity reduction). As you can verify at each node of the tree the split with the largest improvement is selected as the splitting rule. – Surrogate splits: The default behaviour in rpart is to deal with missing values using surrogate splits. Surrogate splits are rules (splitting conditions) that mimic the partition induced by the primary variable at a particular node, using diﬀerent variables. Thus observations for which the primary variable is missing can be assigned to the left or right child using the surrogate split. For the root node of our tree we see that the ﬁrst surrogate splitting rule: CLNO < 47.5 agrees on 78.3% of the observations with the primary split, DELINQ<0.5. By default up to 5 surrogate rules are printed but only if their utility (i.e. their ability to mimic the split induced by the best primary split) exceeds the baseline which to assign observations with missing data to the child to which the majority of observations are assigned. For the root node of our tree the majority of observations is assigned to the left child (3336 out of 4127). There are only two surrogate rules that manage to mimic the split induced by DELINQ<0.5 (marginally) better than simply assigning observations with missing values to the left child. These are CLNO<47.5 and LOAN<2350. The improvement of a surrogate rule over simply assigning according to the majority of the observations is called adjusted agreement (adj in the output) and is deﬁned as: AdjustedAgreement = (#obs.correctlyassignedbysurrogate)−(#obs.correctlyassignedbymajority) (total#obs.)−(#obs.correctlyassignedbymajority) Model Selection – Tree Pruning By default rpart() builds decision trees by assessing the penalised average tree impurity, Iα(T), deﬁned as, Iα(T) = X leavest nt n I(t) + αsize(T), where, • I() is the impurity measured used (by default the Gini coeﬃcient) • n is the number of observations in the training set • nt is the number of observations allocated to leaf t • the size of the tree is equal to the number of leaves • α is the complexity parameter (denoted as cp in the rpart package) At each stage of tree building, rpart() selects to split the leaf that will decrease Iα(T) the most. In other words at each step the leaf whose split will maximise the reduction in average tree impurity is selected. If splitting any of the existing leaves doesn’t reduce average tree impurity more than α then the tree stops growing (rpart considers only binary splits). We can therefore view α as a parameter that determines the “cost” of adding a new split, “variable”, to the model. Clearly the setting of this parameter is critical in determining the ﬁnal tree. Higher values of α require a larger reduction of impurity after each leaf is split and therefore result in smaller trees than lower values of α. If α is very large (e.g. α →∞) no split will be allowed, where as if α = 0 the largest possible tree will be built (as the termination condition in this case is that the gain has to be positive and this is a very loose condition). In fact it has been shown that there is a nested family of subtrees (i.e. one tree is part of another) such that each is optimal for a range of values of α. This information is provided in the ﬁrst table presented by the summary() function and reproduced without the additional node information from the function plotcp(), as shown in Fig. 16. From the ﬁrst Figure 16: Output from printcp() function column of the table depicted in Fig. 16 we can infer the range of values of α (denoted as CP) that would cause the tree growth to be terminated after nsize splits (reported in the second column). In this example, • α > 0.04496403: causes no splits to be made and the decision tree contains only the root node. • α ∈ (0.041966.0.044964): the tree building process terminates after two 2 splits (i.e. 3 leaves) • α ∈ (0.041966,0.037170): the tree building process stops after 3 splits (i.e. 4 leaves), etc To verify this and to understand what is meant by a “sequence of nested trees” try setting the complexity penalty parameter (cp = ) manually and inspect the output. The output of the following commands is depicted in Fig. 17. > tr1 <- rpart(BAD ∼ . , data = Data.train, method = "class", cp = 0.042) > plot(tr1) > text(tr1) > tr2 <- rpart(BAD ∼ . , data = Data.train, method = "class", cp = 0.040) > plot(tr2) > text(tr2) The table in Fig. 16 also hints that the default setting for α is 0.01, as values smaller than this are not considered. We next discuss the entries in the subsequent 3 columns. The column titled "rel error" reports the error of each subtree over the entire training set relative to the error of a tree containing only the root node (Recall that such a tree assigns all observations to the majority class and therefore this is a comparison against the naive predictor). By deﬁnition, the relative error of a tree containing only the root node (ﬁrst row of the table) is always 1. As the tree becomes larger, performance on the training data can only improve (which is why we never use the training set to perform model selection), and therefore relative error declines. The next two columns "xerror" and "xstd" report the average and standard deviation of the error (respectively) estimated through 10-fold cross validation. This is information which we can use to perform model selection using the 1 Standard Deviation Rule that we also applied when in Logistic Regression. In this case cross-validation allows us to select a sensible value for the complexity parameter α. We next discuss how to do this. As Fig. 16 shows for the range of complexity parameters considered the average cross validation error decreases. So we will consider a broader range to determine when the minimum is achieved. To do this we need to set one of the control (a) α = 0.042 (b) α = 0.040 Figure 17: Nested trees as complexity penalty parameter changes parameters of the rpart() function. This is done by modifying elements of a structure called rpart.control which contains the control parameters for the rpart() function. Useful pre-pruning parameters of rpart.control: cp = Minimum value of the complexity parameter to be considered minsplit = Minimum number of observations that must exist in a node for a split to be allowed maxdepth = The maximum depth a tree is allowed to reach xval = Number of cross validation partitions (K). Default: K = 10 Having deﬁned all of the above let’s proceed to ﬁnd the optimal value of α through cross validation. First, we will allow the decision tree to grow to its maximum size. That is achieved when α = 0. # Setting α = 0 > tr3 <- rpart(BAD ∼ . , data = Data.train, method = "class", control = rpart.control(xval=10, cp=0)) > printcp(tr3) Fig. 18 illustrates the output of the printcp() command. Note that although larger trees always reduce the error over all the training dataset (as shown by the "rel error" column) this is not always true with respect to average cross validation error (as shown by the "xerror" column). Also note that the improvement in relative cross validation error as the tree becomes larger is much more modest than that of the relative error when all the data is used (another indication that performance on the training set is not a reliable estimate of generalisation performance). From the output of printcp() we see that the lowest average cross validation error is achieved by subtree number 18 which has 55 splits (and therefore 56 leaves). The lowest average error is 0.72422 and the standard deviation of the cross validation error for that tree is 0.027252. The 1 Standard Deviation Rule says that we should select the smallest subtree (and associated value of α) whose average cross validation performance is within one standard deviation of the best average cross validation performance. In our context, this means the smallest subtree with average cross validation error greater or equal to 0.72422 + 0.027252. From the ﬁgure we see that this is the subtree in the 12th row which has 26 splits. The optimal choice of α is therefore in the range (0.00479616,0.00539568) We can visualise cross-validation performance and the 1 standard deviation rule using the plotcp() function (see Fig. 19). # Creates Figure (19) > plotcp(tr3) To extract the optimal subtree use the prune() command, which takes as input the value of the complexity parameter. Any value in the interval identiﬁed above will select the same subtree. Figure 18: Output of printcp() for cp=0 Figure 19: Cross validation error and model selection through 1 Standard Deviation Rule > bestTree <- prune(tr3, cp = 0.005) We can inspect the selected tree using the plot() and text() functions as before (see Fig. 20). > plot(bestTree) # cex argument controls font size, which we want to make smaller here > text(bestTree, cex=0.6) After plotting the tree you can use the function path.rpart() to select nodes in the tree and see the rules that get an observation assigned to that node. To terminate this function press on the Finish button on the top right of the Viewer window in RStudio. > plot(bestTree) # Issue the following command and then select nodes on the plot of the tree. # The rules associated with the selected node will be printed on the console > path.rpart(bestTree) Selection of Impurity Measure The default impurity measure used in rpart is the Gini index. You can modify this to use entropy, by setting one of the optional argument parms as follows: # Selecting entropy as impurity measure > trE <- rpart(BAD ∼ . , data = Data.train, method = "class", parms=list(split="information") ) The option parms contains optional parameters related to the splitting function. In the above command split="information" corresponds to using entropy as an impurity measure because the gain when entropy is used to measure impurity is a quantity known as information gain. Plot and compare the tree constructed using entropy (Fig. 21) with the tree built using Gini. Assess the important variables in this tree and compare with before. Performance Assessment To assess performance we ﬁrst need to apply the decision tree to a dataset and obtain for each observation, the probability of belonging to the class of interest. In our case we want to estimate the probability of belonging to BAD=1. As with Figure 20: Tree selected through cross validation | DELINQ< 0.5 DEBTINC< 45.66 DEBTINC< 43.92 DELINQ< 2.5 LOAN>=6050 DEROG< 2.5 CLAGE>=75.84 0 1 0 1 1 1 1 1 Figure 21: Tree constructed using Entropy logistic regression, we use the predict() function to estimate these probabilities. I will apply this function to get estimated probabilities for the test set. To do this I need to set newdata = Data.test and to ensure that the output is the estimated probability I need to set type = "prob". # Generate probability predictions for the test using fully grown decision tree > probs1 <- predict(tr3, newdata = Data.test, type = "prob") The output of the above command is not a vector but rather a matrix. For each observation probs1 reports the probability of belonging to each value of the target variable BAD. (This is sensible because decision trees can be applied to problems with more than two classes.) To verify this do: > head(probs1) Ensure that the sum of the probabilities on each row is equal to 1 (since an observation must be assigned to either class 0 or class 1). To build the ROC curve we only need the probability of belonging to class 1. This corresponds to the second column of the matrix probs1. > roc <- roc(Data.test$BAD, probs1[,2], plot=TRUE, col="red", grid=TRUE) We want to compare this against the performance of the tree chosen through cross validation (bestTree), and the tree constructed using the default settings (tr). # Probability predictions for pruned decision tree > probs2 <- predict(bestTree, newdata = Data.test, type = "prob") > roc <- roc(Data.test$BAD, probCVs[,2], plot = TRUE, add = TRUE, col="blue") # Probability predictions for default tree that uses cp=0.01 > probs3 <- predict(tr, newdata = Data.test, type = "prob") > roc <- roc(Data.test$BAD, probs3[,2], plot = TRUE, add = TRUE, col="green") Specificity Sensitivity 0.00.20.40.60.81.0 1.0 0.8 0.6 0.4 0.2 0.0 Figure 22: ROC on test set using fully grown tree (red); pruned tree using cross validation (blue); and tree for default value of α = 0.01 (green) In this case the fully grown tree performs better than the one selected by cross-validation and the one using the default settings. Missing Values – Non-ignorable missing We discussed during lectures that decision tree algorithms have a number of ways to handle missing values internally (that is without asking the user to explicitly handle them). We saw earlier how surrogate rules are used in rpart when they are more informative than the simple majority rule. Using surrogate splits is similar to assuming Missing At Random (since a surrogate split can be considered a very simple model predicting whether the value of the missing variable exceeds or not the threshold of the primary splitting rule). (Note that surrogate splits are much less sophisticated than some of the more advanced models employed by the mice library.) Neither surrogate splits nor the majority rule can account for the case when the presence of a missing value is informative about the probability of belonging to the class of interest. We discussed in a previous workshop that to accommodate for this event we need to introduce missing value indicator variables. We will consider this in the case of decision trees also, and assess its importance. # Construct missing value indicator variables > Data.ind <- na indicator(Data) Next build a full decision tree using the new data frame (note that only the observations assigned to the training set are used in the following command). # Build full (i.e. no pruning) decision tree > T <- rpart(BAD ∼ . , data = Data.ind[train,], method = "class", cp=0) The default when building a tree is to estimate 10-fold cross-validation error so I haven’t explicitly set the optional argument: control=rpart.control(xval=10) as I did earlier. Let’s inspect this tree using the summary() and printcp() functions. The output of the summary() function is too long so we can access the estimated variable importance as follows: > T$variable.importance (a) printcp(T) (b) T$variable.importance Figure 23: Pruning information and variable importance for tree constructed with inclusion of missing variable indicators There are some conclusions that can be immediately drawn from Fig. 23, especially if one compares this with the information from the trees built before. First, the most important primary variable is one of the missing value indicators and more speciﬁcally the indicator of missing values for the DEBTINC variable. A number of other missing value indicators are also used in the ﬁnal tree but these are not as important. It is better to postpone any deﬁnitive conclusion about these until after we have considered pruning the tree. Second, if one compares the reduction in the error (both over all the training set and the cross validation estimate) achieved by the present tree and the full tree built on the original data (Figs. 23 and 18 respectively) one observes that the most recent achieves a substantially lower error. In particular, the lowest average CV error before was 0.72422, whereas the lowest average CV error now is 0.53357 (by a tree of substantially smaller size than before). Also noteworthy is the fact that a tree with 9 leaves achieves an average CV error within one standard deviation of the best performing tree, while previously the 1 standard deviation rule recommended a tree with 27 leaves. The plot of cross-validation error and the associated best tree selected through the 1 standard deviation rule is provided in Figure 24. cp X−val Relative Error 0.40.60.81.0 Inf 0.052 0.012 0.0064 0.0045 0.0022 0.00093 0.00037 1 2 3 5 6 8 9 15 19 22 31 41 55 72 size of tree Figure 24: Cross validation error for decision trees built with missing value indicator variables Figure 24 reveals clear signs of overﬁtting as the tree size exceeds a certain threshold. The average cross validation error of the fully grown tree is actually larger than that of a tree with 5 leaves. Setting the complexity parameter in the interval α ∈ (0.00719424,0.01079137) produces the recommended tree. > T1 <- prune(T,0.01) # Plot and annotate tree > plot(T1) > text(T1) • Compare this with the previously selected tree through cross validation. • Assess the signiﬁcance of each variable in this tree. To assess the performance of trees using the missing value indicators on the test set we construct the ROC curve. First we need to obtain the predicted probabilities for the test set data. # Predicted probabilities on test set for fully grown tree > probs4 <- predict(T, newdata = Data.ind[-train,], type = "prob") # Predicted probabilities on test set for pruned tree > probs5 <- predict(T1, newdata = Data.ind[-train,], type = "prob") Next I will plot the ROC for both trees and compare them with the previous ones. # ROC curve for fully grown tree without missing value indicators > roc <- roc(Data.test$BAD, probs1[,2], plot = TRUE, col="green", grid=TRUE) # ROC curve for tree selected by 1SD rule without missing value indicators > roc <- roc(Data.test$BAD, probs2[,2], plot = TRUE, col="blue", add=TRUE) # ROC curve for tree selected by fully grown tree including missing value indicators > roc <- roc(Data.ind[-train,]$BAD, probs4[,2], plot = TRUE, col="green", add=TRUE) # ROC curve for tree selected by 1SD rule including missing value indicators > roc <- roc(Data.ind[-train,]$BAD, probs5[,2], plot = TRUE, col="black", add=TRUE) Specificity Sensitivity 0.00.20.40.60.81.0 1.0 0.8 0.6 0.4 0.2 0.0 Figure 25: ROC on test set using fully grown tree with no missing value indicators (red); pruned tree using cross validation without missing value indicators (blue); fully grown tree with missing value indicators (green); pruned tree through cross validation with missing value indicators (black)