Assignment title: Information
FINAL EXAM – DATA MINING
PLEASE: The first two questions must be answered by giving me the final weight vectors only and attach your program at the very END of the final exam answers.
ALL DATA MENTIONED IN THE 'DATA FOLDER CAN BE LOCATED IN THE LINK GIVEN IN THIS MODULE or in the link in week 1.
DIRECTIONS: You have until the Dropbox closes to submit this exam. NO
late submittals unless very extenuating circumstances such as car
accident or house fire. No Incompletes given unless documented reason.
The problems below are for the most part real. Don't expect all kinds of clean
answers and obvious explanations. There may be missing data and
techniques like J48 may actually run out of attributes and NEVER
create a single homegeneous classification. Be Careful. A good answer may actually
be that there is no pattern to the data. Not every dataset has neat patterns - sometimes it's good for management to realize that their data is random - that flipping a coin or guessing is what they need to do and not assume there's some
high payoff simple rule that will make them lots of money and save time. I have not intentionally placed data like that in the exam but the data is real and may not be well-behaved.
So when you justify your answers - write simple and persuasive English - it doesn't have to be shakespeare quality. Just explain why you selected a technique , what you did and what the results tell you. That is perfect.
Here's how this will be graded: There is no "right" answer I am
seeking. I don't have a 'key' . Rather, I am going to grade HOW you went about
solving the problem and whether what you did is reasonable. If you
blindy go applying a technique without any type of analysis - that's reckless
in real life as well as the final. Instead, you should perhaps run
some statistics, clustering, etc, in an EFFORT, to gather some helpful
information BEFORE trying a technique and IN SUPPORT of APPLYING technique.
Simply giving me an 'answer' without convincing that your answer is reasonable will not receive credit.
Ask yourself before turning in your exam: Do my answers convince me???
Think in terms of training and test sets. DON'T just run your
techniques on the entire dataset. Instead, when it makes sense to you,
divide your dataset into two pieces (not necessarily equal) and SEE
FOR YOURSELF HOW YOU'RE PERFORMING. If you train on one set of data
and then test on another - and you correctly classify the test set ->
that indicates a high confidence in the result. It's hard to criticize
that type of method.
One of the main problems you face is the "windowing" problem - use some data to train and the rest to test - I talked
about – how to use the data you have to mine and get some confidence in your results. For some problems, there's lots of data. Do you need it all? How much should you train on? Test on? When are the results acceptable? When are the results possibly really misleading
1. Consider the following Perceptron Problem:
class +:
(0 0)
(1 0)
class -:
(0 1)
(1 1)
IMPORTANT: Instead of starting the weight vector w = (0 0) instead you must start the weight vector at:
w = (-101 200)
(a) what is the final weight vector and
(b) what equation does this vector represent?
2. Same setup as above but start the weight vector at:
w = (1010 35 )
a) what is the final weight vector and
(b) what equation does this vector represent?
3. Arrythmia is a heart ailment. There is a database of people who either are normal or suffer from some type
of arrhythmia given on the data link in the DATA FOLDER week 1. Assume your job is to come up with an automatic way to classify patients if given the same data as appears in the data file. Justify and test your approach. What happened? How accurate were you?
4. Assume a corporation needs to know the yearly income of a customer
and does not want to annoy the customer by asking this question
directly. Instead, the desire is to ask the consumer some "innocent"
questions that can predict yearly income. The corporation gives you US
Census data and asks you if this can be used to classify people as to
whether they earn above or below $50,000 yearly. What are some
questions that might be asked given this data?
Data is found here under ADULT DATABASE in the data link in the DATA FOLDER:
Assume your job is to come up with an automatic way to classify
consumers if given the same data as appears in the data file. Justify
and test your approach. What happened? How accurate were you?
5. In the data folder week 1, there is a wine quality database that judges white and red wine based on some chemical tests. Assume there is a wine importer business that discovers that customers will buy wines of quality 6 and above but won't buy wines rated 5 and below. Mine the wine quality data for WHITE WINE ONLY and tell me how to determine if a wine is "bad" quality and answer this question: Should I buy $500,000 worth of the wine measured below:
6;0.33;0.18;3;0.036;5;85;0.99125;3.28;0.4;11.5
6. Look at the Parkinson's data set in the folder for week 1. Parkinson's disease is a horrible disease that disables and often kills many people each year. The STATUS attribute tells whether the person has Parkinson's Disease or not. Assume your company wants to automate the detection process and create a machine that will take these identical measurements on a person and then pass through the 'normal' people but refer the others, presumably with a strong indication of parkinson's disease, to a physician. Can you create a detector based on this data that would be accurate (with respect to missed detections and false alarms)? Try to create such a detector and test it. What do you think? Is this detector worth selling or not? What would you tell family member that got a diagnosis of Parkinson's from such a detector?
7. take a look at the student alcohol consumption dataset in the week 1 data folder. Student alcohol abuse is a national health problem almost everywhere in the developed world in one way or another. Tell me everything you think might be useful to policy makers (mayors, school officials, parents, anyone!) that might help them understand what this data might be telling them. What useful knowledge lurks in that data if any ????? To receive credit you must explicitly detail your findings.