MSCI 331 Data Mining for Direct Marketing and Finance Coursework 2 Student Name: Library card number: Date Submitted: May 2nd 2016 Introduction: The purpose of Coursework 2 is to consider Decision Tree and Multilayer Perceptron models in addition to the Logistic Regression Model recommended in the last coursework, and decide which is the most appropriate model that will enable the bank to identify customers with a high probability of being “Bad”. For each type of classifier, a number of models with different variables and options available were examined and evaluated based on misclassification rates and plots produced by SAS EM. For comparison purposes, we made use of key Lift charts in terms of Cumulative % Captured Response and ROC curves. From all the models examined, we conclude that the Logistic Regression Model seems to be the most powerful and thus is the overall recommended model. Furthermore, the bank recognises that no classifier will be able to identify correctly all ‘Bad’ customers, and so we recommend how to use our model to achieve a sensitivity of 75%. Overall Recommended Model: The Logistic Regression gave the most powerful model in identifying the probability of customers being “Bad”. In the Regression model all given variables were entered as well as imputed indicators for the variables CREDIT HISTORY, TIME AT ADDRESS and PROPERTY which from the preliminary analysis of data were found to include proportions of missing values. Because all variables with missing values were class variables, the most commonly occurring non-missing level (mode) of the corresponding variable in the sample was used. Any transformations or discretised variables did not improve the performance of the model. The Stepwise method was used to give as an indication of the most significant variables. This method starts with no candidate effects in the model and then systematically adds effects that are significantly associated with the target while it removes any effects added but which are not significantly associated with the target. The Stepwise regression model suggested that the most significant variables were: ACCOUNT BALANCE, CREDIT HISTORY, INSTALMENT PERCENTAGE, OCCUPATION, PURPOSE, SAVINGS and M_CREDIT(Imputed Indicator for Credit History). The figure on the left shows the Lift chart of our Model in terms of Cumulative Percentage of Captured response based on the performance of the test data set. Looking at the first 10% of customers, our model was able to capture “30%” of BAD customers. Compared to other Logistic Regression models considered, this was the highest amount which shows the ability of our model to capture a larger percentage of defaulters for each percentage of random sample of customers. Our model does not come without any reservations. As we can observe from the confusion matrix, our model still fails to identify all customers that are expected to default. More specifically, out of the 294 customers which our model predicted to be Good, turned out to be “Bad”. Another possible problem may lie in the way that missing values are treated in a Logistic Regression Model. By using the replacement node, we might replace a missing value which perhaps is an unusual value for the variable with an imputed typical value of the variable. Thus, imputation may lead to unreliable outcomes. The bank recognises the fact that no classifier is able to produce a model that will identify 100% of Bad customers and therefore we aim to use our model to achieve a Sensitivity or Captured Response of 75%. The Receiver Operating Characteristic (ROC) chart based on the test data set of our recommended model is a plot of false positives against true positives for all cut-off probabilities. For 75% Sensitivity, looking at the Cumulative Captured response lift chart, we should target the 45th percentile of customers. If we consider the cumulative % response chart, then we obtain that the 45th percentile corresponds to a % Response of approximately 52% (from the %Response chart). TP= 0.52(0.45300) = 70.2 FN=(70.2-(70.20.75))/0.75= 23.4 FP=(70.2-(70.20.52))/0.52=64.8 TN=300-70.2-23.4-64.8= 141.6 Specificity= 141.6/(141.6+64.8)=69% A low cut-off probability means that more customers will be identified as BAD. As it can be seen from the Threshold based chart on the left, if the cut-off point is low, there are fewer false negatives but more false positives in which case the test is highly sensitive but not very specific, thus the accuracy rate decreases. To achieve a 75% level of sensitivity and therefore 69% specificity, the cut-off point should be set approximately to Justification of recommended model for each type of classifier Decision Tree Another type of classifier considered was a decision tree. The goal of a decision tree is to build a tree that will allow us to identify various target groups based on the values from a set of input variables. Recommended Decision Tree Model The lift chart on the right indicates that comparing Model 2 with Model 10 which includes only post-pruning options and Model 11 which includes both post-and pre-pruning options, the best model is Model 2. Model 2 includes two variables, Account Balance and Occupation. The splitting criterion is set to Entropy reduction and the rest of the options in the Basic Tab are set to default. The model assessment measure is Proportion misclassified and the Sub-tree option is set to Most Leaves. No treatment of missing values has been applied because decision trees can handle missing values by treating them as different values for each attribute. Our Model resulted in a tree with 15 leaves. Interpreting the outcome of the tree, we can observe that the strongest split occurred from the variable Account Balance which is at the root. According to the if-then rule, for customers with an account balance lower than £1000, then 53.7% are expected to be Good Customers and 46.3% are expected to be Bad while for customers with an account balance greater than £1000, 89.6% are expected to be Good and 10.4% are expected to default. The reason for this might be that the bank might be have more strict policies for customers with a higher account balance. It is worth to notice the misclassification rates of our model which are 0.2375 for the Training data set and 0.2000 for the Validation data set. One weakness of the model is that it includes only two variables compared to the other types of classifiers. However, when more variables were added, we had evidence of increased overfitting. A strength of the model lies in the fact that decision trees are highly flexible and can handle missing values without any imputations which sometimes can be unreliable. Decision Tree Models Considered For the first set of models considered, the splitting criterion was set to the Entropy reduction which is a measure of variability and homogeneity in categorical data. Therefore, the best split is selected by the smallest entropy statistic. The rest of the pre-pruning settings were initially set to their default values and remained fixed in order to conclude first on our selection of variables. In the advanced tab, we set the Model Assessment measure to Proportion misclassified which means that the best tree was selected with the lowest misclassification rate. For the sub-tree selection option, we used the Most Leaves which selects the entire decision tree. These options were selected to enable us to construct trees as large as we wish them to be, without SAS-EM limiting their size at a later stage. For our first model, the two variables with the highest Information values were used. That is, the Account Balance and the Amount. This resulted in a tree with 23 leaves and misclassification rate of 0.3033 in the Validation set. Another model with the same options was examined but this time we removed the variable “Amount” which had a suspiciously high information value and replaced it with the variable “Occupation”. This second attempt gave us a smaller tree with 15 leaves and improved misclassification rate of 0.2. We then continued by changing the splitting criterion from Entropy reduction to Gini reduction which is a measure of purity but we did not observe any difference in the performance of the model. Tree pruning is an essential process which limits the number of rules making it easier to interpret the results while achieving stability in the decision tree. In our models, both pre- and post-pruning techniques were used to select the right sized-tree from the validation data set. As part of the pre-pruning process, we set for Model 2 the maximum depth of tree to 20 and compared it with a more complicated tree including all variables and maximum depth of 20. Comparing the two models, the model with only two variables clearly performed better. In an attempt to avoid overfitting in the model and improve the stability and the interpretation of the decision tree, we set the minimum number of observations in a leaf to 50 and the maximum number of branches from a node to 100. The result was a much smaller tree with only five leaves but higher misclassification rates. As part of post-pruning, we set the sub-tree option to the best assessment value which selects the best subtree based on the modelling assessment statistics that were previously specified from the validation data set option and therefore performs post-pruning, according to the Reduced Error pruning. We tried post-pruning options both in combination with pre-pruning settings and without. MLP Multi-layer Perception The third type of classifier considered was the Neural Network which is designed to perform nonlinear modelling in the process flow diagram. Recommended MLP model According to the lift chart illustrating the performance of the key MLP models considered, we conclude that the best model is Model 1. The model includes all given variables including imputed indicators for missing values. The number of hidden neurons is set to two under the option of Noiseless data. For the Training Technique, we chose the Levenberg Marquadt' method while the Average Error was set as the Model Selection criterion. According to the performance plot, our model has a small Average error on the Training set shown by the blow line but changing the weights to improve performance on the training set causes a higher error on the validation set which is evidence of overfitting in our data. To avoid overfitting, the vertical line stopped the process at the weight vector that produced the lower Average Error on the validation set, a process called Early Stopping. For our model, the vertical line is where the average error of our model is 0.409 for the Training data set and 0.454 on the Validation data set. MLP Models Considered The first step in our MLP model building process was to use a Transformation node to help us normalise all interval variables (Duration of the loan, Credit Amount, Instalment Percentage, Age and Existing Credit) since binary and nominal variables will remain unstandardized. The node applies the following formula so that all transformed variables now have a mean of 0 and variance=1: Moving forward, because MLP cannot handle missing values we used a Replacement node which gave as imputed indicators. Otherwise, observations that contain missing values would not be used to estimate the weights during training resulting in a poor performance of our model. For our first model, all variables including the imputed indicators for variables with missing values (CREDIT HISTORY, TIME AT ADDRESS and PROPERTY) were entered. There is no a scientific approach in selecting the number of hidden layers. Initially, we set the Network Architecture to the option noiseless data which has two hidden neurons compared to other options which produce an MLP with only one hidden neuron hoping that this option will better explain the variability in the target values. For the Training Technique, we chose the Levenberg Marquadt' method which is the most reliable and thus more suitable for a hard optimization problem such as the training of an MLP. For the first attempt, the Average Error was set as the Model Selection criterion. For Model 2, the Model Selection criterion was changed to “Misclassification rate” but in our case, the performance of the model did not improve. An attempt to improve the performance was also made by increasing the number of hidden neurons and so the rule of thumb was used to decide on the appropriate number. According to the rule: Number of hidden neurons= 1/2 [#inputs + #outputs] In our case, we have 31 input variables (including the imputed indicators) so according to the rule we set the number of hidden neurons to 16. Increasing the number of hidden neurons to 12 enables the MLP to achieve a much smaller error on the training set suggesting perfect prediction power but at the same time we have evidence of increased overfitting in our data as we can observe from the performance plot below. Due to the fact that there are no variable selection methods such as Stepwise available in the case of MLP, we considered two models including the variables that were found to be significant from the Logistic Regression with the Stepwise method. In the first model we used the rule of thumb and since our Regression model suggested 8 variables, the number of hidden neurons was set to 4 while in the second we model we set the option to Noiseless data. Comparison of the best models of each type of classifier The lift chart compares the performance of the best models for each type of classifier. According to this output, the Decision tree underperformed both the Logistic Regression and the Multilayer Perceptron models. This may be due to the fact that the model includes only two of the given variables compared to the other two methods which include more variables and therefore their predictive powers are increased. The performances of the Logistic Regression and the Multilayer Perceptron models are very similar but looking at the first percentile of customers, the Logistic Regression model has a higher Captured Response and therefore it is preferred. The reason might be due to the fact that for the Logistic Regression we had the option of the Stepwise method which identified the most significant variables, something that was not the case with the Neural Network model.