Preliminary Information • Do not model your answer on the workshop material. The objective of the workshops is to introduce you to different data mining tasks discussed in lectures, and not to give you a roadmap on how to answer the coursework. Therefore if you simply reproduce the steps in the workshops you are very likely to make serious mistakes. • In this coursework you will be assessed on your understanding of the data mining process, your ability to use correctly the tools that we covered in the course, and the ability to draw correct conclusions from what you observe. You will not be assessed by your capability to use R or any other software. Therefore, don’t include information about commands you used, or options you set, or how to draw a figure etc. You will be simply wasting valuable space. • You are free to use any software to do the coursework. However, you can’t use as an excuse the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular capability which we covered in the workshops. • The page limit for this report is 7 pages using at least 11 point typeface. This limit is strict and it includes appendices (which I strongly recommend that you don’t use). Standard penalties apply for exceeding this limit. • The coursework report is not a business report and there is no need to include a title page, table of contents, executive summary etc. It is important that you address all tasks/ questions and that the answer to each question is clearly delineated. • Please pay particular attention to the disclaimer at the end of the assignment that gives more details about the assessment of your report. • This is an individual piece of assessment, and you should ensure that your report reflects your own work exclusively. All reports go through automated software to detect plagiarism from a variety of sources (including past and current students’ reports as well as online resources, conference and journal publications etc.) The consequences of plagiarism are very serious. Important This coursework uses the same dataset as the first coursework. Your understanding of the dataset will be clearly informed by the work you did for the first coursework, but do not reference your first coursework in this assignment. In other words don’t include claims like: As I discussed in the first coursework the important variables for this are ... . I have no way of verifying which is your coursework (as marking is done anonymously) and therefore can’t verify if such claims are valid or reasonable. If you want to refer to any conclusions from the first coursework you must provide a justification. Description of the Problem and the Data A bank has provided you with a sample of 2000 observations concerning the repayment behaviour of mortgage customers over the past two years. (For a variable description refer to previous coursework) The overall objective of the bank is to develop an understanding of what influences repayment behaviour, and use this knowledge to make future decisions. The bank faces a trade-off between accepting customers to retain its share in the mortgage loan market and increase its profit through interest payments, and on the other hand incurring losses due to customers defaulting on their debt. The bank managers are interested in the following questions: • How many and which are the most important variables that determine the repayment behaviour of mortgage customers. 1 • What is the best way to model the repayment behaviour given the data that you have. In other words, which model would you recommend for this task? • What is the best way for the bank to use a statistical model to achieve the following goals: – Accept the maximum number of good customers if at least 90% of bad customers are correctly identified – Accept at least 75% of good customers while rejecting as many as possible bad customers (The two models need not be the same) • If the previous two goals were not specified would the model you recommend change, and if so how? Justify your answer appropriately. • Would the answer to the previous questions change (and how) if the bank imposed the further constraint that the model used needs to be interpretable? • Provide a short discussion of the relative merits and disadvantages of the considered classification models that you developed for this specific task. Model Development and Justification • For each classifier type: Logistic Regression, Decision Tree, and Multi-Layer Perceptron: Perform the appropriate data pre-processing and discuss how to choose “appropriate” settings for the most important parameters of the classifier (consider the choice of input variables as part of this question also, in addition to other parameters specific to each classifier). • For each classification method, develop one or a few candidate models that you think are promising before providing a final recommendation of the most appropriate model (for each question in the previous section). • Describe the models you investigated before making your recommendation for each type of classifier. Indicate in a clear and logical fashion the steps you followed and justify the different data pre-processing options and parameter settings you tried, including details of: – Missing value options you tried, and why. – Whether and why you chose different encodings of the variables and what was the outcome. – How were the input variables selected for each model and why did you follow this approach? You do not need to include every possible model that you tried in detail, but you must include the results for what you consider as the important steps in the process that led to your final recommendations. • For each classification method, justify the recommended model(s), including appropriate performance measures. Comment on your findings and the generalisation performance of model(s) you recommend for each type of classifier. • Discuss the strengths and weaknesses of each model in relation to the objectives of this project. (Relate this discussion to the objectives in the previous section) 2 Report Assessment Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps you took to complete the assignment. In other words, your report needs to document that you are able to intelligently analyse the provided data, that you draw correct conclusions from you observations, and that these conclusions lead you either to the next logical step of the data mining process, or to the revision of decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in the first lectures and in particular to the feedback loops) Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark. Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any other output from R that you don’t comment/ explain in the text. I will not assume that you know how to interpret these correctly. 3