Assignment title: Information
FINAL EXAM – DATA MINING
ALL DATA MENTIONED IN THE 'DATA FOLDER CAN BE LOCATED IN THE LINK GIVEN IN THIS MODULE or in the link in week 1.
DIRECTIONS: You have until the Dropbox closes to submit this exam. NO
late submittals unless very extenuating circumstances such as car
accident or house fire. No Incompletes given unless documented reason.
The problems below are for the most part real. Don't expect all kinds of clean
answers and obvious explanations. There may be missing data and
techniques like J48 may actually run out of attributes and NEVER
create a single homegeneous classification. Be Careful. A good answer may actually
be that there is no pattern to the data. Not every dataset has neat patterns - sometimes it's good for management to realize that their data is random - that flipping a coin or guessing is what they need to do and not assume there's some
high payoff simple rule that will make them lots of money and save time. I have not intentionally placed data like that in the exam but the data is real and may not be well-behaved.
So when you justify your answers - write simple and persuasive English - it doesn't have to be shakespeare quality. Just explain why you selected a technique , what you did and what the results tell you. That is perfect.
Here's how this will be graded: There is no "right" answer I am
seeking. I don't have a 'key' . Rather, I am going to grade HOW you went about
solving the problem and whether what you did is reasonable. If you
blindy go applying a technique without any type of analysis - that's reckless
in real life as well as the final. Instead, you should perhaps run
some statistics, clustering, etc, in an EFFORT, to gather some helpful
information BEFORE trying a technique and IN SUPPORT of APPLYING technique.
Simply giving me an 'answer' without convincing that your answer is reasonable will not receive credit.
Ask yourself before turning in your exam: Do my answers convince me???
Think in terms of training and test sets. DON'T just run your
techniques on the entire dataset. Instead, when it makes sense to you,
divide your dataset into two pieces (not necessarily equal) and SEE
FOR YOURSELF HOW YOU'RE PERFORMING. If you train on one set of data
and then test on another - and you correctly classify the test set ->
that indicates a high confidence in the result. It's hard to criticize
that type of method.
One of the main problems you face is the "windowing" problem - use some data to train and the rest to test - I talked
about – how to use the data you have to mine and get some confidence in your results. For some problems, there's lots of data. Do you need it all? How much should you train on? Test on? When are the results acceptable? When are the results possibly really misleading?
EVERYONE MUST DO PROBLEM 1 AND PROBLEM 10 - and 3 additional problems of your choice for a total of 5 problems.
SELECT 3 PROBLEMS from the problems 2-9 to work on. Extra credit for any others you do.
(1) Perceptron - what is the equation of the separating surface given the two classes below. PLEASE GIVE THE EQUATION FOLLOWING BY EITHER THE PROGRAM YOU USED OR ALL THE CALCULATIONS. NO SUPPORT = NO CREDIT.
Class 1
(1,2,3)
(4,5,6)
(7,8,9)
(25,15,20
Class 2
(-1,2,3)
(-4,-5,-6)
(7,-8,-9)
(-25,-15,-20)
(2) Bigram / text processing
Given the bigram example of homework week 5, ( I am Sam , Sam I am, I do not like green eggs and ham, etc) -
(A) what word most likely follows the word 'I' ? answer am
(B) what word most likely follows the answer of (a) above?
(3) The movie TITANIC was a media sensation as you no doubt know. The
movie suggested that the rich people survived and the poor did not.
The actual survival records are given as data in the DATA FOLDER
Assume you get funded by a TV network to use data mining in the
survival data and report your findings. Was there some pattern as to
who survived and who did not?
(4). Use the k means algorithm with 5 cluster centers to cluster the ecoli data here:
http://archive.ics.uci.edu/ml/datasets/Ecoli
what are the 5 cluster centers?
(5). Use Nearest neighbor algorithm with the ecoli data above and make your own test set from the data given. Submit your classification results. How well did the nearest neighbor approach work?
(6) Assume a corporation needs to know the yearly income of a customer
and does not want to annoy the customer by asking this question
directly. Instead, the desire is to ask the consumer some "innocent"
questions that can predict yearly income. The corporation gives you US
Census data and asks you if this can be used to classify people as to
whether they earn above or below $50,000 yearly. What are some
questions that might be asked given this data?
Data is found here under ADULT DATABASE in the data link in the DATA FOLDER:
Assume your job is to come up with an automatic way to classify
consumers if given the same data as appears in the data file. Justify
and test your approach. What happened? How accurate were you?
(7) What can you tell me about the MUSHROOM DATA given in the data link in the DATA Folder. If you can build a classifier – do that. Do whatever you can with this data and tell me what you did and why. Justify your results and approach.
(8) Arrythmia is a heart ailment there is a database of people who either are normal or suffer from some type
of arrhythmia given on the data link in the DATA FOLDER. Assume your job is to come up with an
automatic way to classify patients if given the same data as appears
in the data file. Justify and test your approach. What happened? How
accurate were you?
(9) Select any data set we have not done in the data link given in the data folder and MINE THAT FOR WHATEVER YOU CAN. Your choice – whatever interests you.
10. EVERYONE MUST DO THIS PROBLEM in addition to the selected 4 problems:
Given the LABOR data in the WEKA data folder, what are the relationships among the INCREASE-FIRST YEAR, INCREASE-SECOND YEAR, and INCREASE-THIRD YEAR attributes. How do you know these relationships? Show me.