Assignment title: Information
Learning Outcomes
This assignment will be assessed upon your ability to:
LO1. Describe and critically evaluate the role and relevance of [analytical
investigation] to the solution of business information problems.
LO2. Explain the concepts that underpin the subject area of [data mining] making
reference to main established concepts and some developing areas.
LO3. Apply concepts and justify decisions when modelling, designing and
constructing practical examples or paper descriptions of applications in this area.
1.1 Deadline:
Your completed assignment should be submitted electronically, in word format, through
the assessment areas on Blackboard, at or before 23:59 on Tuesday the 17th of
January 2017. Failure to submit the assignment on time, without a valid extension, will
result in zero marks being awarded.
1.2 Assessment Criteria
Marks are indicated next to each question.
This is the first assignment and consists of an individual piece of work that contributes
50% to the final marks of this module. It takes the form of a well-structured report of
2,500 words in length. Marks will be deducted if it is outside the 10% allowance.
1.3 Notes
You are required to submit this assignment twice once through turn-it-in, to generate
a turn-it-in report and again, through the SHU assignment submission area so that
the work can be easily marked. These will be made available to you on blackboard in
due course.
All references must follow the APA‡ format.
‡ Help on this can be found in the library resource: http://libguides.shu.ac.uk/referencing You must be
logged onto the library for this link to work).
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 2 of 9
1.4 Problem outline
For this assignment you are required to analyse a data set taken from the data mining
competition prior to the third international conference of Principles and Practices of knowledge
discovery in data bases (PKDD). This conference was held in Prague in 1999§
. One of the
challenges given for the competition was a set of datasets concerning financial transactions and
details for customers at a Czech bank. Several of these files have been combined to give
4 500 observations of various financial and personal details relating to different accounts.
Details of the variables included are given below.
Details of the relationships are shown below:
Figure 1: Database Relationships for Czech Bank Data
You are required to analyse one table resulting from a query from this database as detailed
below. Full details of the fields in this table are given below and in the appendix.
§ http://lisp.vse.cz/pkdd99/
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 3 of 9
1.5 Data Provided
The final query is saved as a SAS dataset for use in Enterprise Miner. It is called
czech16.sas7bdat. It is available on the SHU server in the path:
D:\SASMetaShare\!SharedData\DWDM. You will need to create a library to access the data.
1.6 Details of the Query and resulting data
We wish to investigate if there are any groups of accounts with similar properties. Also we wish
to determine which accounts have a loan attached to that account. For this purpose a subset
of variables (fields) are selected from the final query on the database for each account.
These variables can be seen to represent, for each account different types of credits and
withdrawals that take place:
Credits (this is where customers pay into their account):
• Cash
• Salary
• Pension
• Interest
Withdrawal (taking money out i.e. customers spend their money):
• Cash
• Insurance payment
• Credit Card
• Overdraft Penalty
• Statement Payment
• Household Payment
• Other bank withdrawal
All amounts are measured in Czech Koruna (Kč). For each of these the mean size of the
payment is calculated as the total value of all payments over the total number of payments. So
for example the variable for mean insurance payments (minsure) is calculated as the total
value of insurance payments divided by the number of insurance payments. For an account
which has made ten monthly insurance payments of £12, and then a further 10 of £15 (perhaps
their premium went up), the value of minsure would be:
10 10
minsure = 12+12+12...+12+15+15+...+15
20 =
120+150
20 =
270
20 = 13.5Kč
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 4 of 9
In addition to the mean for each of the variables above the average gap between cash credits
(mcashcrcgap) and cash withdrawals(mcashwdcgap) is also included. (These are NOT
included for the other credit/debit methods). So if an account shows a cash payment in (cash
credit) each week on the same day, then the value for mcashcrcgap would be 7 days, an
account with no cash credits or only one would have mcashcrcgap=0. Similarly, suppose an
account shows a gap of 3 days between the first cash withdrawal and the second withdrawal.
Then if subsequent cash withdrawals occur again after 5 further days, 4 further days and finally
after a further 10 days; then the value of mcashwdcgap = 3 + 5 + 4 + 10
4 =5.5 days.
Finally additional information is held about each account:
Account id, Age of primary account holder, if they have a credit card or not (with this bank),
number of days account open, if they have a loan or not, if there is a second user of the
account and the gender of the main account holder (sex). This gives the set of variables as
shown in the appendix
1.7 Questions
The bank wishes to see if different customers have similar profiles and have therefore asked
that the czech16 data be clustered. They are looking for about five clusters. You will need to
build an Enterprise Miner diagram and use appropriate nodes to answer these questions. A
typical diagram is shown in Figure 3 on page 7.
1) Since cluster analysis requires the use of fields that are symmetrical as possible you should
first investigate each of the interval fields in the czech16 data.
a) Produce suitable summary measures (means, standard deviation, skewness etc.) of the
data and fully interpret your results. (No marks will be awarded for discussing the
kurtosis). For example, what is the largest average type of credit? Which is the most
consistent? What about withdrawals? Are there any other interesting features?
(7 marks)
b) Produce suitable plots of each of these fields. Are there any unusual features to any of
your plots? By examining the actual data records can you discern why these features
might occur? Are they what you might expect?
(7 marks)
(You may initially find it best to achieve this with the “explore” feature available within
Enterprise Miner, to obtain detailed statistics use a StatExplore node with loan set as
input. Remember to reset loan back to target for building the tree in question 9) below).
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 5 of 9
2) Also investigate the remaining binary variables by producing suitable plots. Fully discuss
your results.
(2 marks)
3) Use the transform node in Enterprise Miner and the "Maximum Normal" option for interval
variables to find suitable transformations of the interval variables. The aim is to make them
as symmetrical as possible. You should ensure that in your settings, you still retain a copy of
the original variables (set both Hide and Reject to "no").
a) Explain what actual transformations the software has picked.
(3 marks)
b) Produce further plots of the transformed variables and use these to present evidence of
whether the transformations have been successful. Comment clearly on your results.
(Please note that it may not be possible to make all variables totally symmetrical).
Consequently, state for each interval variable whether subsequent analysis should use
the original (untransformed) variable or the new transformed variable. Hence explain
which set of interval variables you would use for clustering.
(6 marks)
4)
a) Clustering of these data has been carried out using code resulting in the dendrogram on
the following page (Figure 2). Fully discuss this dendrogram and explain how many
clusters you might pick. You should give all reasonable possibilities and then explain
what your final choice is likely to be.
(4 marks)
b) Use a clustering node to cluster the interval variables only. You should use the variable
set that you specified in 3)b) above. Ensure that the resulting set of interval variables
are suitably standardised. Set the “Initial Cluster Seed” options as follows:
• Seed Initialization Method MacQueen
• Minimum Radius1
• Drift during Training1
Hence obtain a CCC plot for cluster solutions of the interval variables.
(1 mark)
c) Fully discuss the CCC plot and clearly explain how many clusters you would pick.
(5 marks)
5) Change the settings of the clustering node so as to fit five clusters. Discuss the resulting
"segment size" plot and "Mean statistics" in the output.
(6 marks)
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 6 of 9
Figure 2: Dendrogram Produced Using Interval Variables Only
6) Produce suitable plots to investigate the nature of each cluster. Hence illustrate the validity
of your cluster solution by profiling one of your clusters and interpret the factors that make it
unique. Draw any conclusions regarding your cluster solution.
(8 marks)
7)
a) From within the cluster results, obtain the decision tree for your cluster solution.
Discuss this tree explaining how many leaves it has, its depth and any other relevant
features. (Little credit will be given for just writing the tree out in words.) Give two
examples of how the results of the tree support your answer to question 6) above.
(6 marks)
b) Explain the disadvantages and advantages of using this tree rather than a new tree, built
by adding a tree node to the stream.
(3 marks)
8) Use all of your results above to discuss how the cluster solution may be utilised by the bank.
How may the results of any of the analyses above be utilised to carry out further supervised
data mining? What other possible targets might be appropriate?
(8 marks)
Data Warehousing and Data Mining - Assignment 1
Academic Year 2016/17
page 7 of 9
9) Add a partition node and a tree node to your diagram. Set the field loan as the target in the
input (czech16) node. Connect the input (czech16) to the partition node and the partition
node to the tree node. A typical final Enterprise Miner diagram is shown in Figure