Assignment title: Information
Big Data Analytics MSc Data Mining Assignment: 2016
page 1 of 13
Data Mining Assignment:
Case Study of American Charity Data
Contents
1 Introduction ......................................................................................................................... 2
1.0 Learning Outcomes ..................................................................................................... 2
1.1 Deadline ...................................................................................................................... 2
1.2 Submission: ................................................................................................................. 2
1.3 Assessment for this module ......................................................................................... 2
1.4 Problem outline ............................................................................................................ 2
1.5 Variable details for each Individual ............................................................................. 3
1.5.1 Locality Variable ......................................................................................................... 4
1.5.2 Socio-Economic Variables for Individual's Locality ..................................................... 3
(based on ZIP code) ........................................................................................................ 3
1.5.3 Individual's giving profile to date (the date being late 1997!) ...................................... 4
1.5.4 Target Variable .......................................................................................................... 3
1.6 Dataset Availability ...................................................................................................... 4
1.7 Background and Aim of this Investigation .................................................................... 5
1.7.1 Scenario ..................................................................................................................... 5
1.8 Hints and Tips: ............................................................................................................. 6
1.8.1 SAS date variables and labels ................................................................................... 8
1.8.2 Moving Data Between SAS products ......................................................................... 6
1.8.3 Moving Data from Enterprise Miner ............................................................................ 7
1.8.4 Understand your variables ......................................................................................... 7
1.9 Format ......................................................................................................................... 8
2 Specific Analysis ................................................................................................................. 9
2.0.1 Initial Exploration of the Data and Data Prepartion ..................................................... 9
2.1 Formation of Two datasets........................................................................................... 9
2.1.1 - i Dataset A ............................................................................................................ 9
2.1.1 - ii Dataset B ........................................................................................................... 9
2.1.2 Questions using Unsupervised Methods .................................................................. 10
2.1.3 Questions on Supervised Techniques ...................................................................... 11
Big Data Analytics MSc Data Mining Assignment: 2016
page 2 of 13
1 Introduction
1.0 Learning Outcomes
This assignment will assess the following learning outcomes:
1) Apply and justify a range of data mining techniques to the extraction of business
information from data.
2) Critically evaluate the validity of the techniques employed with respect to the relevant data
and also to their intended use.
3) Interpret the intelligence provided by the data mining process in a practical business
setting.
4) Effectively communicate the data mining process and its results to a decision maker.
1.1 Deadline
23:59 Sunday 22nd May 2016
1.2 Submission
You should submit your work TWICE on Blackboard. Once to turnitin so that it is checked for
plagiarism and to secondly to an area that permits electronic marking and feedback. It should
ideally be submitted as a Word document. You will find both submission areas in the assessment
section of the Blackboard site.
1.3 Assessment for this module
This module will be assessed via a case study. This will involve the analysis of a dataset that is
described below. In order to make this task more manageable, and to allow feedback as you
progress through the module, the case study will be broken down into smaller aspects of the
analyses. The marks for each aspect are shown alongside the question so that you have some
idea how much each part contributes. This should also help you decide how long to spend on
each question. There are a total of 200 marks available. This will be converted to a percentage.
You need 40% to pass this module.
1.4 Problem outline
The aim of this assignment is to analyse a sample of data extracted from an American national
veteran's organisation. Although this is an old dataset it still encapsulates the type of problems
typical today.
Big Data Analytics MSc Data Mining Assignment: 2016
page 3 of 13
The veteran's organisation is non-profit-making and provides programs and services for US
veterans with spinal cord injuries or disease. The results of a recent fund raising appeal for this
organisation are provided in a dataset that is described below in section Error! Reference source
not found.. This mailing was sent to a total of 3.5 million donors who were on the organizations
database as of June 1997. Everyone included in this mailing had made at least one prior donation.
The mailing included a gift (or "premium") of personalised name and address labels plus an
assortment of 10 note cards and envelopes. All of the donors who received this mailing were
acquired through similar premium-oriented appeals.
The population for this analysis will be Lapsed donors who received the June '97 renewal mailing
(appeal code "97NK"). Therefore, the analysis data set contains a subset of the total universe who
received the mailing.
The analysis file includes 5000 lapsed donors who received the mailing, with responders to the
mailing marked with a flag in the TARGET_B field. The overall response rate for this direct mail
promotion is 5.1%.
The average donation amount (in $) among the responders is $15
The package cost (including the mail cost) is $0.68 per piece mailed.
1.5 Variable details
The variables included for this exercise are as follows:
1.5.1 Target Variable
TARGET_B Target Variable: Binary Indicator for Response to 97NK Mailing
1 = Responded to mailing
0 = Did not respond
(Note: NK mailings are blank cards with labels)
1.5.2 Socio-Economic Variables for Individual's Locality (based on ZIP code)
HV2 Average Home Value in hundreds ($)
HU1 Percent Owner Occupied Housing Units
HU2 Percent Renter Occupied Housing Units
HU3 Percent Occupied Housing Units
HU4 Percent Vacant Housing Units
HU5 Percent Seasonal/Recreational Vacant Units
RP500 Percent Renters Paying >= $500 per Month
RP400 $500 > Percent Renters Paying >= $400 per Month
RP300 $400 > Percent Renters Paying >= $300 per Month
RP200 $300 > Percent Renters Paying >= $200 per Month
RP00 Percent Renters paying < $200 per Month
IC4 Average Family Income in hundreds
IC15 Percent Families w/ Income < $15,000
Big Data Analytics MSc Data Mining Assignment: 2016
page 4 of 13
1.5.3 Individual's giving profile to date (the date being late 1997!)
NUMPROM Lifetime number of promotions received to date
RAMNT_3 Amount ($'s) of gift for 96NK
RAMNT_14 Amount ($'s) of gift for 95NK
RAMNT_24 Amount ($'s) of gift for 94NK
RAMNTALL Dollar amount of lifetime gifts to date
NGIFTALL Number of lifetime gifts to date
MAXRAMNT Dollar amount of largest gift to date
LASTGIFT Dollar amount of most recent gift
LASTDATE Date associated with the most recent gift
AVGGIFT Average dollar amount of gifts to date
CONTROLN Control number (unique record identifier)
AMPERGFT Average of all gifts received in last 36 months
AVNKGIFT Average of all gifts received in last 36 months for proportion consisting of
blank cards with labels
PGIFTH Proportion of lifetime gifts to lifetime number of promotions
MYAMNT Total of gifts from four particular proportions:
95 promotion with blank cards with labels
96 promotion with Christmas cards with labels
96 promotion with general greeting cards (an assortment of
birthday, sympathy, blank, & get well) with labels
96 promotion with calendars with stickers but do not have labels
SUSPECT Estimate of the ratio of gifts to solicitations from June 1996 to June 1997 see
Georges and Milley(1999)†
MAJOR Major ($$) Donor Flag
"Not majo" = Not a Major Donor
"Major" = Major Donor
1.5.4 Locality Variable
State State abbreviation (a nominal/symbolic field)
1.6 Dataset Availability
The full dataset of 5000 observations is available on the SAS server and is called charity i.e.
charity.sas7bdat
The path to this from either SAS Enterprise Guide or Enterprise Miner is:
E:\SHUUsers\!SharedData\Teresa\BD Assign
†
Georges and Milley(1999) KDD'99 Competition: Knowledge Discovery Contest
http://www-cse.ucsd.edu/users/elkan/saskdd99.pdf (last accessed 2/04/05)
Big Data Analytics MSc Data Mining Assignment: 2016
page 5 of 13
You will need to create a library to access it.
1.7 Background and Aim of this Investigation
You are required to analyse this data using suitable statistical software. Although this is a real
dataset, the aim is to use it in the fictitious situation described below.
1.7.1 Scenario
You work for an imaginary, small, and up and coming data mining consultancy company. Your firm
has been commissioned by the charity owning the data to undertake the following two main tasks.
These scenarios are given to give you guidance as to what to expect later on and what to look for
in the initial stages. Please see the "Specific Analysis" section for details of what is required. The
two scenarios are:
a) Currently, all potential donors on their database are mailed. It would be much more efficient
to mail to only a subset of these based on their likelihood to respond. Given the costs and
potential profit, can you find a suitable model to predict who will be likely to respond
together with a mailing strategy to maximise profit? You should, using your model, aim to
explain why it is certain individuals are likely to respond.
b) Also of interest is the desire to buy in additional databases in order to allow the mailing of
new potential donors not currently on the charity's database. The first stage of this is to
profile existing donors (i.e. those on their existing database) and then use this information
to apply these profiles to new potential donors. This will allow for the development of new
types of promotion and eventually subsequent modelling. However, the first stage would
just be to look for patterns in the data. Obviously this would need to be done using only
Socio-economic variables (since the bought-in databases would only contain this
information, and would clearly not include information on most of the other variables). This
leads to a further scenario:
Profile potential donors on the current data set using only the list of Socio-economic
variables. Investigate these profiles for similarities, and explain what each profile
represents.
Big Data Analytics MSc Data Mining Assignment: 2016
page 6 of 13
1.8 Hints and Tips:
1.8.1 Moving Data Between SAS products
To process data from one place in SAS and load it into another you need to save the processed
data into a library for which you have READ AND WRITE permission. So in either Enterprise
Guide or Enterprise Miner create a library to point to the path on the SAS server that ends with
your user code. This is NOT the same as the path, as explained in section 1.6, where the data is
originally stored.
For example the path you need might be:
E:\SHUUsers\55-7573-00S_BigDataMSc\
But in general:
E:\SHUUsers\ \
(Please note different students are enrolled on different modules – so find one that
has your user code).
It is the same path that you will have used to create an Enterprise Miner project.
You may wish to create a folder within this to separate datasets from projects. For example, if you
create a folder called mydat, the path would then become:
E:\SHUUsers\ \\mydat
To access this outside of SAS open a windows explorer window and type in:
\\sas94-meta-live in the address bar. Once this opens navigate to SHUUsers, and then the
correct path ending in your user code. From there you can create new folders.
\\sas94-meta-live is equivalent to E: from within SAS.
WARNING: Ensure that you do not tamper with the folders that are the
names of your Enterprise Miner or Enterprise Guide projects.
When you create a project, this is the path that contains all the
information about you project, the settings etc. The folders
within it are created by SAS and it will need them in this
structure when you re-open your projects. Moving, deleting or
overwriting any files or folders will corrupt your projects.
Big Data Analytics MSc Data Mining Assignment: 2016
page 7 of 13
1.8.2 Moving Data from Enterprise Miner
If you wish to process data in Enterprise Miner and then use that subsequently in SAS enterprise
Guide or IML/studio it is necessary to save it. The way to do this is to add a SAS code node to
your Enterprise Miner stream, following the last processing node e.g.:
Data mylib.newcharity;
Set &EM_IMPORT_DATA;
Run;
Where mylib is a library that points to the SAS server where you have write permission (as
explained above). Create a library in Enterprise Guide to the same location to read in this data.
1.8.3 Understand the variables
Make sure you understand the data. For example, the variables in 1.5.2 are variables about where
an individual lives and IC4 is not the individual's income, but the average income of people living in
the same area. The same principle applies for all this section of variables. On the other hand the
variables list in 1.5.3 are related to that individual directly – it is their giving and marketing profile to
date.
Also note that some of the variables in 1.5.2 are percentages that should sum to 100:
HU1 + HU2 =100,
HU3 + HU4 = 100,
and RP00+RP200+RP300+RP400+RP500=100
100 – HU5 = HU6(say) Percent Seasonal/Recreational Occupied Units
With these you will need to ensure which variables to use. You should not for example include
both:
HU1 and HU2
HU3 and HU4
HU5 and HU6 (as defined above)
Nor should you include all five RP00-RP500.
With these sort of data you will need to leave out one of the percentage variables in each group.
An alternative might be to work with log odds of these. This idea was seen, for example, in the
formulation and plots of the dependent variables in logistic regression.
Big Data Analytics MSc Data Mining Assignment: 2016
page 8 of 13
1.8.4 SAS date variables and labels
SAS date variables are integer values where 0 is 1/1/60, 1 is 2/1/60 etc. The variable LASTDATE
is such a variable.
The LASTDATE variable is a date-integer variable but has a SAS date format applied to it. SAS
will print it as a date, but it may be used as a variable (because it is really an integer variable).
Hence in SAS the variable LASTDATE may be used in analysis as it is but will be displayed as a
date.
The SAS dataset has labels attached to the variable names which do not alter the use of this
dataset but mean that variable labels are often printed rather than variable names. Variable names
should be used in any code.
1.9 Report Format
Although you should answer the questions below, you should bear in mind that this scenario would
in real-life be presented as a report. You should therefore write your solutions in a report format as
if it were to be presented to the charity. There should be section headings and sub-headings
suitably numbered. Figures relevant to your discussion should be in the main report and should be
labelled, numbered and referenced in the report. Further examples of figures or other "extras"
should go into an Appendix.
You should discuss all the technical and statistical details of your findings, but you should make
sure that any conclusions are understandable to a non-specialist. Bear-in-mind that the charity is
unlikely to be interested in SAS code or Enterprise Miner settings. These should be only be
discussed where requested. Any further detail (e.g. SAS code) should go in the Appendix.
The aim is to have a well structured report with relevant output. The reader should not have to
scroll backwards and forwards to the appendix to find results. It should look professional and use
correct English in the passive voice. (i.e. not first person). There are 20 marks (10% of the total)
available for the quality of this report.
Big Data Analytics MSc Data Mining Assignment: 2016
page 9 of 13
2 Specific Analysis
2.0.1 Initial Exploration of the Data and Data Preparation
1) Carry out the initial data preparation as follows:
a) Carry out a simple exploration of all the data producing appropriate plots and summary
statistics of the variables. The aim of you exploration should be to get a 'feel' for the data
and to identify any particular issues with any of the variables. Bear in mind that you intend
to use this data for the techniques outlined in questions 2) through to 8) inclusive. As a
result of this, discuss how each variable should be treated, e.g.
• whether they should be omitted from further analysis
• whether they have missing values coded incorrectly
• whether a transformation would be useful
• whether they should be included without further modification
• whether new variables should be created from the existing ones
• whether other modifications would be appropriate (combination of groups, removal
of outliers, recoding of missing values and imputation etc.)
(18 marks)
b) Carryout any essential data cleansing, preparation etc. as you highlighted in question 1)a),
but keep it simple! Potentially, the number of transformations, formation of new variables,
etc. is endless. You may suggest more sophisticated possibilities that could be carried out if
you had time to investigate them further, but do not spend an excessively long amount of
time at this stage. You should bear in mind that you are preparing the data for clustering
and principal component analysis in parts 2) and 3) below.
(7 marks)
Total marks available for Question 1: 25 marks
2.1 Formation of Two datasets
As a result of the analysis so far you should form two distinct datasets corresponding to the
two scenarios, A and B.
2.1.1 - i Dataset A
For scenario A, this should consist of all the cleansed variables. You should have a set of
variables corresponding to those listed in sections 1.5.4, 1.5.2, 1.5.3 and 1.5.1. You may
also have decided to omit some variables.
2.1.1 - ii Dataset B
For scenario B this should use the same set of suitably prepared variables as in dataset A,
but only those that are derived from those listed in 1.5.2: 'Socio-Economic Variables for
Individual's Locality.' Again this might consist of a mixture of the original variables,
transformations of the variables and new variables whilst other variables may have been
excluded.
Big Data Analytics MSc Data Mining Assignment: 2016
page 10 of 13
2.1.2 Questions using Unsupervised Methods
USE DATASET B FOR QUESTIONS 2), 3) AND 4)
2) Using only data set B, the cleansed version of the variables listed in 1.5.2 profile the donor list.
a) Carry out a cluster analysis of this data using only these cleansed interval variables. Your
client is ideally looking to segregate the data into just a few clusters (less than 10).
i) Use two different methods to determine the number of clusters. How many clusters
would appear to be appropriate? On what do you base this decision?
(8 marks)
ii) Produce a final cluster solution for these data. Discuss the final results and the output
produced. Does this suggest the clustering has been successful?
(10 marks)
b) Illustrate the validity of your cluster solution by producing suitable plots of the Orignal
Socio-economic variables against the clusters. Hence profile one of your clusters and
interpret the factors that make it unique. (7 marks)
Total marks available for question 2: 25 marks
3) Further validate the clusters by carrying out a principal component analysis on dataset B, the
cleansed version of the variables listed in 1.5.2.
a) Clearly explain how many components you would keep and justify your choice.
(5 marks)
b) Fully interpret the meaning of each component you keep. You may use any appropriate
plot or printed output to achieve this. Give a name to each component.
(6 marks)
c) Produce a plot of the first two components coloured by clusters. Discuss you plot and what
is tells us about these data.
(4 marks)
4) Use the results of 2)b) and 3) to confirm what the clusters represents. Draw any conclusions
regarding your cluster solution. Explain how the charity may use your results to the clustering
and the principal components. You should bear-in-mind the eventual aim of this study.
(10 marks)
Total marks available for questions 3 and 4: 25 marks
Big Data Analytics MSc Data Mining Assignment: 2016
page 11 of 13
2.1.3 Questions on Supervised Techniques
USE DATASET A FOR REMAINING QUESTIONS
5) Use suitable software to fit a variety of models that will predict the target variable described
in 1.5.1.
a) Attempt to fit a decision tree to these data.
i) Begin by using proportion misclassified as the criteria. For this the Subtree: Method
should be set as "Assessment" and the Subtree: Assessment Measure set as
"Misclassification". You should choose the remaining settings appropriately. Explain
what happens? (2 marks)
ii) View the Subtree assessment plot and explain why this has happened. Given what
you know of this data set why might this be occurring?
(4 marks)
b) Fit a working decision tree. Change the settings of the Subtree: Assessment Measure to
"Lift". Experiment with the settings to find an appropriate tree. Examine the results and
explain why a tree is now produced:
i) Fully discuss the settings you used to grow this tree. Explain what you chose and why
they are appropriate. Why does it seem to work?
(5 marks)
ii) Discuss the tree you have produced. Explain how many leaves it has, its depth,
variables used, purity of nodes and the Subtree assessment plot. How well does the
tree seem to perform?
(9 marks)
Total marks available for Question 5: 20 marks
6) Fit two logistic regression models to the target. Use suitable methods to select the variables in
your model.
a) Fully explain and justify the settings you used to create your models.
(5 marks)
Pick one logistic regression model to discuss in b)-c) below:
b) For your selected model, discuss the tests given in the output for your model and fully
interpret them. (9 marks)
c) Discuss the model and fully interpret the parameters and odds ratios. (6 marks)
d) Are there any reservations you have about your logistic regression models? Explain how
you might further check the model validity. (You are not required to carry out these checks).
(5 marks)
Total marks available for Question 6: 25 marks
Big Data Analytics MSc Data Mining Assignment: 2016
page 12 of 13
7) Compare the regression models and the decision tree model.
a) Produce appropriate summary measures of their goodness of fit. Discuss your results.
(7 marks)
b) Produce at least three suitable charts, one of which should be the ROC chart. Fully discuss
any charts you produce. Illustrate what each chart means by explaining what one point on
it represents. Which model would you recommend based on these charts? Based on your
charts, comment on how well the models perform.
(15 marks)
c) Using your results to 7)a) and 7)b) explain which model you would prefer. Fully justify your
answer.
(3 marks)
Total marks available for Question 7: 25 marks
8) Use the first observation in the dataset to illustrate how the predicted probability of responding
is calculated. Do this for:
a) One of your logistic regression models of your choice (7 marks)
b) The decision tree (5 marks)
9) Discuss and summarise your results:
a) Discuss the relative accuracy of all three models. What reservations, if any do you have
about your models?
(4 marks)
b) Use all of your results to select one particular method that you would recommend to predict
responders. You should bear in mind practical as well as statistical considerations.
(4 marks)
c) Use your models to give advice to the charity on who they should target with their
campaigns.
(6 marks)
d) Investigate other options available within Enterprise Miner and hence suggest any further
work that may be useful to build an improved model in the future. This might be
improvements to the existing models as well as the production of other models.
(9 marks)
Total marks available for Questions 8 and 9: 35 marks
Big Data Analytics MSc Data Mining Assignment: 2016
page 13 of 13
2.2 Summary of Breakdown of Marks
Question 1 25
Question 2 25
Question 3+4 25
Question 5 20
Question 6 25
Question 7 25
Question 8+9 35
Report 20
(Total marks available: 200 marks
This will be converted to a percentage)