Introduction The aim of this report is to provide a bank with a Logistic Regression Model which will identify customers with high risk of defaulting their loans. Data of 1000 past loans was given by the bank and whatever the customers were ‘’good’’ (they repaid the loan) or ‘’bad’’ (fail to repay the loan). The structure of this report is as follow: part A refers to the exploration of data through a preliminary analysis using SAS Miner Enterprise, part B describes the chosen model and how can be used to achieve a sensitivity 50%, 70% and 90% and the part C describes the range of models considered before conclude to the best model. Preliminary Investigation The table below (extract from the insight note) shows the 16 predictors (inputs) used for these model together with their Model Role, Measurement level and type of each variable. As it can be seen from the table the variable GOOD_BAD has a model role of ‘’target ‘’ and indicates whatever a customer defaulted on the loan or not. After the imputation of the data I split them into 3 categories in order to analyse them. The training set (40% of the total data) which is used for building the models, the validation set (30% of the data) which is used to compare the models and the test data (30% of the data) which is used to assess the considered/best model. The figure on the left shows the distribution of the target variable based on the metadata sample. As it shown from the plot there is 30% probability that a new customer will be bad (in other words will default) and 70% probability that the customer will be good. Visualisation of Data In order to ensure that our modelling techniques used, are best supported, we need to modify, analyse and prepare appropriately our dataset. First step in analysing our data is to visualise them. Below are the histograms of the five interval (continuous) variables used in our case. GOOD_BAD = 0 GOOD_BAD = 1 GOOD_BAD = 0 represent the good customers responds while the GOOD_BAD =1 represents the bad customers responds. From the histograms above we can see that there are significant differences in the shape of histograms of the variables install_perc, duration and age for the two categories of response (good and bad). From this, evidence arose that these three variables have discriminatory power over our target variable and can be good predictors for our estimations. The scatter plots on the left shows the correlations between the five interval variables. From the figure we can conclude that there is high positive correlation between the amount and duration. By that we can expect that the movements of those two variables will be in the same way but that is not always the case. Interactive Grouping An additional way to analyse the data before modelling is the interactive grouping. Through interactive we have the ability to group variable values and we can collect important information about the Information Value and the Weight of Evidence for each variable. Information value measures the predictive power of a variable to distinguish between the separate classes of the target event (good_bad= 1) and the target non-event (good_bad=0). The higher the Information Value of a variable the better. Information Variable is the average weighted of WOE for each class where the WOE measures the relative risk of a group/class. Hight negative values of WOE means that there is high risk of a default while high positive values indicates a low risk. I inputted my variables to the interactive grouping note and set as commit criteria the information value and as commit value 0.3. The output summary listed below shows which variables have more than 0.3 Information Value hence there are good predictors (those values have a Keep status of Yes) and have discriminatory power. Interactive Grouping Output Summary Variable Keep Commit Type Auto-Grouped Infoval Committed Inforval Auto-Grouped Gini Committed Gini ACCOUNT_BALANCE YES AUTO 0.73538 0.73538 39.8116 39.8116 AGE YES AUTO 0.37071 0.37071 33.7303 33.7303 AMOUNT YES AUTO 0.49163 0.49163 34.4145 34.4145 CREDIT HISTORY NO . . . . DURATION YES AUTO 0.33477 0.33477 31.6449 31.6449 EMPLOYED NO 0.07832 . 14.9501 . EXIST_CREDIT NO 0.02042 . 6.8367 . INSTALL_PERCENT YES AUTO 0.40160 0.4016 30.5074 30.5074 OCCUPATION NO . . . . OTHER NO . . . . PROPERTY NO 0.02311 . 6.7241 . PURPOSE NO . . . . RES_STATUS NO . . . . SAVINGS NO 0.11472 . 13.549 . TIME_AT_ADDRESS NO 0.00485 . 3.6405 . A more detailed chart about the groups and the WOE of evidence of the variable duration is shown below. That tab shows a distribution histogram for the Duration with the reference lines that define the groups. On the bottom left there is a bar chart that shows the proportion of good customers (0’s) and bad customers (1’s) for each group. At the bottom right there is a line plot that shows the WOE values for each group. Despite the fact that there is no way to obtain satisfactory grouping a linear curve of the WOE is preferred. For example the WOE for group 1 which contains values from 0-9 is near to 1. By that we can conclude that any customers that ‘’fall’’ in that group have low risk of defaulting on their loans. Variable Transformation From the visualisation of the data above it is clear that some input interval variables have highly skewed distribution. In that cases a small percentage of the points may result in a significant influence. To overcome such a problem I transformed those variables with highly skewed distributions and I used their logs in order to normalise their distributions and to yield a better fitting model. This procedure was followed for the variables amount, age and duration. Below is the distribution of the variable age before and after the transformation. Before After Missing Values Because Regression cannot handle records with missing values on its own, a replacement note was added in order to substitute the missing data and to get more accurate and reliable results. The best approach is to introduce new binary variables through replacement note by generating imputed indicators. That indicates that values was missing in the original dataset and then substitution of missing values was done. The table on the right shows what variables has missing values. In more detail 9% of the data for CREDIT_HISTORY are missing, 13% of the data for TIME_AT_ADDRESS and 15% of the data for property. For interval variables the most common way of replacing the missing values is the mean but in our case we do not have interval variables with missing values. For the class variables the most popular way of replacing is the most frequently occurring level (mode) and is set by default in the replacement note. Any replacement used in the models that follows are replacing the missing values of the three class variables using the mode. Description of Recommended Regression Logistic Model After considering and trying a range of models the one that performs best is the Model 7 which consists of a replacement and a regression note as it can been seen from the figure on the left. All the variables were used in that model so a sensible way to ensure the accuracy of the results it was to replace any missing values. As was mentioned above missing values had only the three class variables CREDIT_HISTORY, TIME_AT_ADDRESS and PROPERTY. The method used for replacement was the mode as it the most popular for class variables and a generation of imputed indicators was done. Then a Regression with all variables plus the new indicators of the three class variables with the missing values was set. The method used for regression was the Stepwise where is a method that begins with no candidate effects in the model and then adds effects that are significantly associated with the target. However stepwise may remove any effect that is already to the model but is not significantly associated with the target. A list with the significant variables that consists the best model together with their coefficients and there p-values is shown in the table on the left. In the same table there is a column ‘’Estimate’’ which shows the coefficient of each variable relative to the target. The coefficient shows the degree to which two variable’s movements are associated. For example the variable account_balance 1 has an estimate of 1.1337. That mean that for each additional unit of that variable, our target variable (good_bad) will increased by 1.1337. That column refers to the logs of the variables so if we want a clearer picture we should see the exponential estimate column which shows that for 1 additional unit of the variable account_balance 1 the target variable will increased by 3.107 units. The figure on the right shows the confusion matrix of the recommended model. The model predicts that 307 of the customers are going to be good but the actual results show that 53 of them were bad. Compared with the range of models considered, this model performed the best in the prediction of the good customers. Cumulative % Captured Response curve The chart below shows the lift chart of Cumulative % Captured Response of the preferred model based on the test data. That chart shows the ability of models in predicting the customers at risk. The blue line, called baseline, shows that if we take 10% of the population without using a model, we can captured only the 10% of the customers that defaulted. The red line represents our model and it is obvious that performs better than the baseline as if we took 10% of the population we manage to capture near to the 30% (20% more than the baseline) of the customers at risk. The bigger the gap between the two lines the better performed our model than without using a model (baseline). As the percentile of population increases the % Captured Response increases as well. ROC curve The exhibit on the right shows the Receiver Operating Characteristic (ROC) curve for Model 7 on the test data. ROC chart shows the range of all possible values among the sensitivity and specificity probabilities. The area under the curve indicates the accuracy of the model, so the larger the area the mose sensitive and specific is our model. To obtain a sensitivity of 50% and hence cumulative % captured response of 50%, we should target the 20th percentile as we obtain that from the % captured response chart. Then from the % Response chart we obtain that the 20th percentile corresponds to approximately 72% response. Those numbers are used to the formulas below in order to find the True/False Positive, True/False Negative and Specificity. TP = 0.72(0.20300) =43.2 FN = [43.2-(43.20.50)]/0.5 = 43.2 FP = [43.2-(0.72*43.2)]/0.72 = 16.8 TN = total –TP-FN-FP= 196.8 After calculating all the variables we can now obtain the specificity. TNR= 196.8/ (196.8+16.8) = 0.92 Specificity for a sensitivity level of 50%. Following the same way for 75% and 90% sensitivity we obtain the results below: Sensitivity 75% 90% Specificity 0.75 0.47 If the cut-off point increases then there are fewer false positives and more false negatives, hence the model is highly specific but not sensitive enough. On the other hand if the cut-off point decreases there are less false negatives and more false positives and hence the model is highly sensitive but not specific as before. So to obtain higher sensitivity each time you need to decrease the cut-off point but we should take into account specificity as well. A trade-off between the two to achieve the best combination is recommended. Justification of recommended Logistic Regression Model Before conclude in which model performs best, a process with a range of models was followed and the models together with all important information are listed below. Models 1-3: Regression with All variables and three different methods After visualising the data and analyse them the first step was to run three regression models with all the interval variables as the most of them have a discriminatory power. I ran one regression model with selected method of none (Model 1), one model with the forward method (Model 2) and one with the stepwise method (Model 3). I chose do not use the backward method as is not recommended for binary target variables. The model with the none method (Model 1) had a higher chi-squared and better lift charts from the other two. Model 4: Regression with three interval variables and the none method Based on the Model 1 and after assessing the T-scores of the variables, I chose the three interval variables with the highest T-scores which were the install_percent, duration and age. I ran a regression model with those three variables and I used the default method none (Model 4). But again the Model 1 was performing better based on the lift charts and as it had higher chi-squared. Models 5-6: Stepwise Regression of Interval Variables plus Transformation Based on the preliminary analysis and the distributions of the highly skewed interval variables I added a transform note and proceed to the transformation of the variables amount, age and duration as mentioned in the preliminary part. I ran a regression model with the stepwise method imputing the logs of the transformed variables and the remaining two interval variables. Important to mentioned that the most two significant variables from the stepwise results were the logs of the transformed variables age and duration. Then I repeat the same process but with using the none method instead of stepwise (Model 6). Despite the transformations, none of the two models managed to dominate the initial Model 1. Model 7: Regression with All variables and missing values (Preferred Model) In this model I added all the variables but to ensure about the accuracy and reliability of my results I used the missing values indicators as well. I used a stepwise method and the variables that consist the best model were: account_balance, amount, credit_history, install_percent, occupation and time_at_address. Model 8: Stepwise Regression with creation of new variables After taking into account the T-scores of the most significant variables from model 7 I noticed that some categories of the variables where more significant than others. Based on the table ‘’Analysis of Maximum Likelihood estimates’’ in page 5 we can see that category account_balance 1 has higher coefficient than account_balance 2 and higher T-scores as well. Based on the T-scores I created new variables by separating the most significant categories of the variables account_balance, occupation, time_at_adress and credit_history. I ran a Stepwise Regression model with all the variables including the new created variables but the model fail to perform better than model 7. Model 9: Stepwise Regression model with Interactive Grouping Finally I ran a model with interactive grouping. After running the interactive I used the WOE of the variables that have a status of keep: YES. I used all the original variables and the new WOE variables. Despite that among the most significant variables were many with WOE the model’s chi-squared and lift charts were lower than model’s 7. The charts below represent the % cumulative capture response of the key models using the validation data set and the test data set. Based on the validation set, Model 9 (pink line) performs the best in the first percentile but then model 7 (red line) dominates and in aggregate considered as the best model. Based on the Test data set, the decision about the best model is far easier as in all percentiles model 7 captures most of the customers at risk compared to the rest models. Another way to assess and compare models is the chi-squared. Validation Set Test Set