Assignment title: Information


​​ Big Data Analytics MSc Data Mining Assignment: 2016 page 1 of 13 Data Mining Assignment: Case Study of American Charity Data Contents 1 Introduction ......................................................................................................................... 2 1.0 Learning Outcomes ..................................................................................................... 2 1.1 Deadline ...................................................................................................................... 2 1.2 Submission: ................................................................................................................. 2 1.3 Assessment for this module ......................................................................................... 2 1.4 Problem outline ............................................................................................................ 2 1.5 Variable details for each Individual ............................................................................. 3 1.5.1 Locality Variable ......................................................................................................... 4 1.5.2 Socio-Economic Variables for Individual's Locality ..................................................... 3 (based on ZIP code) ........................................................................................................ 3 1.5.3 Individual's giving profile to date (the date being late 1997!) ...................................... 4 1.5.4 Target Variable .......................................................................................................... 3 1.6 Dataset Availability ...................................................................................................... 4 1.7 Background and Aim of this Investigation .................................................................... 5 1.7.1 Scenario ..................................................................................................................... 5 1.8 Hints and Tips: ............................................................................................................. 6 1.8.1 SAS date variables and labels ................................................................................... 8 1.8.2 Moving Data Between SAS products ......................................................................... 6 1.8.3 Moving Data from Enterprise Miner ............................................................................ 7 1.8.4 Understand your variables ......................................................................................... 7 1.9 Format ......................................................................................................................... 8 2 Specific Analysis ................................................................................................................. 9 2.0.1 Initial Exploration of the Data and Data Prepartion ..................................................... 9 2.1 Formation of Two datasets........................................................................................... 9 2.1.1 - i Dataset A ............................................................................................................ 9 2.1.1 - ii Dataset B ........................................................................................................... 9 2.1.2 Questions using Unsupervised Methods .................................................................. 10 2.1.3 Questions on Supervised Techniques ...................................................................... 11 Big Data Analytics MSc Data Mining Assignment: 2016 page 2 of 13 1 Introduction 1.0 Learning Outcomes This assignment will assess the following learning outcomes: 1) Apply and justify a range of data mining techniques to the extraction of business information from data. 2) Critically evaluate the validity of the techniques employed with respect to the relevant data and also to their intended use. 3) Interpret the intelligence provided by the data mining process in a practical business setting. 4) Effectively communicate the data mining process and its results to a decision maker. 1.1 Deadline 23:59 Sunday 22nd May 2016 1.2 Submission You should submit your work TWICE on Blackboard. Once to turnitin so that it is checked for plagiarism and to secondly to an area that permits electronic marking and feedback. It should ideally be submitted as a Word document. You will find both submission areas in the assessment section of the Blackboard site. 1.3 Assessment for this module This module will be assessed via a case study. This will involve the analysis of a dataset that is described below. In order to make this task more manageable, and to allow feedback as you progress through the module, the case study will be broken down into smaller aspects of the analyses. The marks for each aspect are shown alongside the question so that you have some idea how much each part contributes. This should also help you decide how long to spend on each question. There are a total of 200 marks available. This will be converted to a percentage. You need 40% to pass this module. 1.4 Problem outline The aim of this assignment is to analyse a sample of data extracted from an American national veteran's organisation. Although this is an old dataset it still encapsulates the type of problems typical today. Big Data Analytics MSc Data Mining Assignment: 2016 page 3 of 13 The veteran's organisation is non-profit-making and provides programs and services for US veterans with spinal cord injuries or disease. The results of a recent fund raising appeal for this organisation are provided in a dataset that is described below in section Error! Reference source not found.. This mailing was sent to a total of 3.5 million donors who were on the organizations database as of June 1997. Everyone included in this mailing had made at least one prior donation. The mailing included a gift (or "premium") of personalised name and address labels plus an assortment of 10 note cards and envelopes. All of the donors who received this mailing were acquired through similar premium-oriented appeals. The population for this analysis will be Lapsed donors who received the June '97 renewal mailing (appeal code "97NK"). Therefore, the analysis data set contains a subset of the total universe who received the mailing. The analysis file includes 5000 lapsed donors who received the mailing, with responders to the mailing marked with a flag in the TARGET_B field. The overall response rate for this direct mail promotion is 5.1%. The average donation amount (in $) among the responders is $15 The package cost (including the mail cost) is $0.68 per piece mailed. 1.5 Variable details The variables included for this exercise are as follows: 1.5.1 Target Variable TARGET_B Target Variable: Binary Indicator for Response to 97NK Mailing 1 = Responded to mailing 0 = Did not respond (Note: NK mailings are blank cards with labels) 1.5.2 Socio-Economic Variables for Individual's Locality (based on ZIP code) HV2 Average Home Value in hundreds ($) HU1 Percent Owner Occupied Housing Units HU2 Percent Renter Occupied Housing Units HU3 Percent Occupied Housing Units HU4 Percent Vacant Housing Units HU5 Percent Seasonal/Recreational Vacant Units RP500 Percent Renters Paying >= $500 per Month RP400 $500 > Percent Renters Paying >= $400 per Month RP300 $400 > Percent Renters Paying >= $300 per Month RP200 $300 > Percent Renters Paying >= $200 per Month RP00 Percent Renters paying < $200 per Month IC4 Average Family Income in hundreds IC15 Percent Families w/ Income < $15,000 Big Data Analytics MSc Data Mining Assignment: 2016 page 4 of 13 1.5.3 Individual's giving profile to date (the date being late 1997!) NUMPROM Lifetime number of promotions received to date RAMNT_3 Amount ($'s) of gift for 96NK RAMNT_14 Amount ($'s) of gift for 95NK RAMNT_24 Amount ($'s) of gift for 94NK RAMNTALL Dollar amount of lifetime gifts to date NGIFTALL Number of lifetime gifts to date MAXRAMNT Dollar amount of largest gift to date LASTGIFT Dollar amount of most recent gift LASTDATE Date associated with the most recent gift AVGGIFT Average dollar amount of gifts to date CONTROLN Control number (unique record identifier) AMPERGFT Average of all gifts received in last 36 months AVNKGIFT Average of all gifts received in last 36 months for proportion consisting of blank cards with labels PGIFTH Proportion of lifetime gifts to lifetime number of promotions MYAMNT Total of gifts from four particular proportions: 95 promotion with blank cards with labels 96 promotion with Christmas cards with labels 96 promotion with general greeting cards (an assortment of birthday, sympathy, blank, & get well) with labels 96 promotion with calendars with stickers but do not have labels SUSPECT Estimate of the ratio of gifts to solicitations from June 1996 to June 1997 see Georges and Milley(1999)† MAJOR Major ($$) Donor Flag "Not majo" = Not a Major Donor "Major" = Major Donor 1.5.4 Locality Variable State State abbreviation (a nominal/symbolic field) 1.6 Dataset Availability The full dataset of 5000 observations is available on the SAS server and is called charity i.e. charity.sas7bdat The path to this from either SAS Enterprise Guide or Enterprise Miner is: E:\SHUUsers\!SharedData\Teresa\BD Assign † Georges and Milley(1999) KDD'99 Competition: Knowledge Discovery Contest http://www-cse.ucsd.edu/users/elkan/saskdd99.pdf (last accessed 2/04/05) Big Data Analytics MSc Data Mining Assignment: 2016 page 5 of 13 You will need to create a library to access it. 1.7 Background and Aim of this Investigation You are required to analyse this data using suitable statistical software. Although this is a real dataset, the aim is to use it in the fictitious situation described below. 1.7.1 Scenario You work for an imaginary, small, and up and coming data mining consultancy company. Your firm has been commissioned by the charity owning the data to undertake the following two main tasks. These scenarios are given to give you guidance as to what to expect later on and what to look for in the initial stages. Please see the "Specific Analysis" section for details of what is required. The two scenarios are: a) Currently, all potential donors on their database are mailed. It would be much more efficient to mail to only a subset of these based on their likelihood to respond. Given the costs and potential profit, can you find a suitable model to predict who will be likely to respond together with a mailing strategy to maximise profit? You should, using your model, aim to explain why it is certain individuals are likely to respond. b) Also of interest is the desire to buy in additional databases in order to allow the mailing of new potential donors not currently on the charity's database. The first stage of this is to profile existing donors (i.e. those on their existing database) and then use this information to apply these profiles to new potential donors. This will allow for the development of new types of promotion and eventually subsequent modelling. However, the first stage would just be to look for patterns in the data. Obviously this would need to be done using only Socio-economic variables (since the bought-in databases would only contain this information, and would clearly not include information on most of the other variables). This leads to a further scenario: Profile potential donors on the current data set using only the list of Socio-economic variables. Investigate these profiles for similarities, and explain what each profile represents. Big Data Analytics MSc Data Mining Assignment: 2016 page 6 of 13 1.8 Hints and Tips: 1.8.1 Moving Data Between SAS products To process data from one place in SAS and load it into another you need to save the processed data into a library for which you have READ AND WRITE permission. So in either Enterprise Guide or Enterprise Miner create a library to point to the path on the SAS server that ends with your user code. This is NOT the same as the path, as explained in section 1.6, where the data is originally stored. For example the path you need might be: E:\SHUUsers\55-7573-00S_BigDataMSc\ But in general: E:\SHUUsers\ \ (Please note different students are enrolled on different modules – so find one that has your user code). It is the same path that you will have used to create an Enterprise Miner project. You may wish to create a folder within this to separate datasets from projects. For example, if you create a folder called mydat, the path would then become: E:\SHUUsers\ \\mydat To access this outside of SAS open a windows explorer window and type in: \\sas94-meta-live in the address bar. Once this opens navigate to SHUUsers, and then the correct path ending in your user code. From there you can create new folders. \\sas94-meta-live is equivalent to E: from within SAS. WARNING: Ensure that you do not tamper with the folders that are the names of your Enterprise Miner or Enterprise Guide projects. When you create a project, this is the path that contains all the information about you project, the settings etc. The folders within it are created by SAS and it will need them in this structure when you re-open your projects. Moving, deleting or overwriting any files or folders will corrupt your projects. Big Data Analytics MSc Data Mining Assignment: 2016 page 7 of 13 1.8.2 Moving Data from Enterprise Miner If you wish to process data in Enterprise Miner and then use that subsequently in SAS enterprise Guide or IML/studio it is necessary to save it. The way to do this is to add a SAS code node to your Enterprise Miner stream, following the last processing node e.g.: Data mylib.newcharity; Set &EM_IMPORT_DATA; Run; Where mylib is a library that points to the SAS server where you have write permission (as explained above). Create a library in Enterprise Guide to the same location to read in this data. 1.8.3 Understand the variables Make sure you understand the data. For example, the variables in 1.5.2 are variables about where an individual lives and IC4 is not the individual's income, but the average income of people living in the same area. The same principle applies for all this section of variables. On the other hand the variables list in 1.5.3 are related to that individual directly – it is their giving and marketing profile to date. Also note that some of the variables in 1.5.2 are percentages that should sum to 100: HU1 + HU2 =100, HU3 + HU4 = 100, and RP00+RP200+RP300+RP400+RP500=100 100 – HU5 = HU6(say) Percent Seasonal/Recreational Occupied Units With these you will need to ensure which variables to use. You should not for example include both: HU1 and HU2 HU3 and HU4 HU5 and HU6 (as defined above) Nor should you include all five RP00-RP500. With these sort of data you will need to leave out one of the percentage variables in each group. An alternative might be to work with log odds of these. This idea was seen, for example, in the formulation and plots of the dependent variables in logistic regression. Big Data Analytics MSc Data Mining Assignment: 2016 page 8 of 13 1.8.4 SAS date variables and labels SAS date variables are integer values where 0 is 1/1/60, 1 is 2/1/60 etc. The variable LASTDATE is such a variable. The LASTDATE variable is a date-integer variable but has a SAS date format applied to it. SAS will print it as a date, but it may be used as a variable (because it is really an integer variable). Hence in SAS the variable LASTDATE may be used in analysis as it is but will be displayed as a date. The SAS dataset has labels attached to the variable names which do not alter the use of this dataset but mean that variable labels are often printed rather than variable names. Variable names should be used in any code. 1.9 Report Format Although you should answer the questions below, you should bear in mind that this scenario would in real-life be presented as a report. You should therefore write your solutions in a report format as if it were to be presented to the charity. There should be section headings and sub-headings suitably numbered. Figures relevant to your discussion should be in the main report and should be labelled, numbered and referenced in the report. Further examples of figures or other "extras" should go into an Appendix. You should discuss all the technical and statistical details of your findings, but you should make sure that any conclusions are understandable to a non-specialist. Bear-in-mind that the charity is unlikely to be interested in SAS code or Enterprise Miner settings. These should be only be discussed where requested. Any further detail (e.g. SAS code) should go in the Appendix. The aim is to have a well structured report with relevant output. The reader should not have to scroll backwards and forwards to the appendix to find results. It should look professional and use correct English in the passive voice. (i.e. not first person). There are 20 marks (10% of the total) available for the quality of this report. Big Data Analytics MSc Data Mining Assignment: 2016 page 9 of 13 2 Specific Analysis 2.0.1 Initial Exploration of the Data and Data Preparation 1) Carry out the initial data preparation as follows: a) Carry out a simple exploration of all the data producing appropriate plots and summary statistics of the variables. The aim of you exploration should be to get a 'feel' for the data and to identify any particular issues with any of the variables. Bear in mind that you intend to use this data for the techniques outlined in questions 2) through to 8) inclusive. As a result of this, discuss how each variable should be treated, e.g. • whether they should be omitted from further analysis • whether they have missing values coded incorrectly • whether a transformation would be useful • whether they should be included without further modification • whether new variables should be created from the existing ones • whether other modifications would be appropriate (combination of groups, removal of outliers, recoding of missing values and imputation etc.) (18 marks) b) Carryout any essential data cleansing, preparation etc. as you highlighted in question 1)a), but keep it simple! Potentially, the number of transformations, formation of new variables, etc. is endless. You may suggest more sophisticated possibilities that could be carried out if you had time to investigate them further, but do not spend an excessively long amount of time at this stage. You should bear in mind that you are preparing the data for clustering and principal component analysis in parts 2) and 3) below. (7 marks) Total marks available for Question 1: 25 marks 2.1 Formation of Two datasets As a result of the analysis so far you should form two distinct datasets corresponding to the two scenarios, A and B. 2.1.1 - i Dataset A For scenario A, this should consist of all the cleansed variables. You should have a set of variables corresponding to those listed in sections 1.5.4, 1.5.2, 1.5.3 and 1.5.1. You may also have decided to omit some variables. 2.1.1 - ii Dataset B For scenario B this should use the same set of suitably prepared variables as in dataset A, but only those that are derived from those listed in 1.5.2: 'Socio-Economic Variables for Individual's Locality.' Again this might consist of a mixture of the original variables, transformations of the variables and new variables whilst other variables may have been excluded. Big Data Analytics MSc Data Mining Assignment: 2016 page 10 of 13 2.1.2 Questions using Unsupervised Methods USE DATASET B FOR QUESTIONS 2), 3) AND 4) 2) Using only data set B, the cleansed version of the variables listed in 1.5.2 profile the donor list. a) Carry out a cluster analysis of this data using only these cleansed interval variables. Your client is ideally looking to segregate the data into just a few clusters (less than 10). i) Use two different methods to determine the number of clusters. How many clusters would appear to be appropriate? On what do you base this decision? (8 marks) ii) Produce a final cluster solution for these data. Discuss the final results and the output produced. Does this suggest the clustering has been successful? (10 marks) b) Illustrate the validity of your cluster solution by producing suitable plots of the Orignal Socio-economic variables against the clusters. Hence profile one of your clusters and interpret the factors that make it unique. (7 marks) Total marks available for question 2: 25 marks 3) Further validate the clusters by carrying out a principal component analysis on dataset B, the cleansed version of the variables listed in 1.5.2. a) Clearly explain how many components you would keep and justify your choice. (5 marks) b) Fully interpret the meaning of each component you keep. You may use any appropriate plot or printed output to achieve this. Give a name to each component. (6 marks) c) Produce a plot of the first two components coloured by clusters. Discuss you plot and what is tells us about these data. (4 marks) 4) Use the results of 2)b) and 3) to confirm what the clusters represents. Draw any conclusions regarding your cluster solution. Explain how the charity may use your results to the clustering and the principal components. You should bear-in-mind the eventual aim of this study. (10 marks) Total marks available for questions 3 and 4: 25 marks Big Data Analytics MSc Data Mining Assignment: 2016 page 11 of 13 2.1.3 Questions on Supervised Techniques USE DATASET A FOR REMAINING QUESTIONS 5) Use suitable software to fit a variety of models that will predict the target variable described in 1.5.1. a) Attempt to fit a decision tree to these data. i) Begin by using proportion misclassified as the criteria. For this the Subtree: Method should be set as "Assessment" and the Subtree: Assessment Measure set as "Misclassification". You should choose the remaining settings appropriately. Explain what happens? (2 marks) ii) View the Subtree assessment plot and explain why this has happened. Given what you know of this data set why might this be occurring? (4 marks) b) Fit a working decision tree. Change the settings of the Subtree: Assessment Measure to "Lift". Experiment with the settings to find an appropriate tree. Examine the results and explain why a tree is now produced: i) Fully discuss the settings you used to grow this tree. Explain what you chose and why they are appropriate. Why does it seem to work? (5 marks) ii) Discuss the tree you have produced. Explain how many leaves it has, its depth, variables used, purity of nodes and the Subtree assessment plot. How well does the tree seem to perform? (9 marks) Total marks available for Question 5: 20 marks 6) Fit two logistic regression models to the target. Use suitable methods to select the variables in your model. a) Fully explain and justify the settings you used to create your models. (5 marks) Pick one logistic regression model to discuss in b)-c) below: b) For your selected model, discuss the tests given in the output for your model and fully interpret them. (9 marks) c) Discuss the model and fully interpret the parameters and odds ratios. (6 marks) d) Are there any reservations you have about your logistic regression models? Explain how you might further check the model validity. (You are not required to carry out these checks). (5 marks) Total marks available for Question 6: 25 marks Big Data Analytics MSc Data Mining Assignment: 2016 page 12 of 13 7) Compare the regression models and the decision tree model. a) Produce appropriate summary measures of their goodness of fit. Discuss your results. (7 marks) b) Produce at least three suitable charts, one of which should be the ROC chart. Fully discuss any charts you produce. Illustrate what each chart means by explaining what one point on it represents. Which model would you recommend based on these charts? Based on your charts, comment on how well the models perform. (15 marks) c) Using your results to 7)a) and 7)b) explain which model you would prefer. Fully justify your answer. (3 marks) Total marks available for Question 7: 25 marks 8) Use the first observation in the dataset to illustrate how the predicted probability of responding is calculated. Do this for: a) One of your logistic regression models of your choice (7 marks) b) The decision tree (5 marks) 9) Discuss and summarise your results: a) Discuss the relative accuracy of all three models. What reservations, if any do you have about your models? (4 marks) b) Use all of your results to select one particular method that you would recommend to predict responders. You should bear in mind practical as well as statistical considerations. (4 marks) c) Use your models to give advice to the charity on who they should target with their campaigns. (6 marks) d) Investigate other options available within Enterprise Miner and hence suggest any further work that may be useful to build an improved model in the future. This might be improvements to the existing models as well as the production of other models. (9 marks) Total marks available for Questions 8 and 9: 35 marks Big Data Analytics MSc Data Mining Assignment: 2016 page 13 of 13 2.2 Summary of Breakdown of Marks Question 1 25 Question 2 25 Question 3+4 25 Question 5 20 Question 6 25 Question 7 25 Question 8+9 35 Report 20 (Total marks available: 200 marks This will be converted to a percentage)