Assignment title: Information

FINAL EXAM – DATA MINING ALL DATA MENTIONED IN THE 'DATA FOLDER CAN BE LOCATED IN THE LINK GIVEN IN THIS MODULE or in the link in week 1. DIRECTIONS: You have until the Dropbox closes to submit this exam. NO late submittals unless very extenuating circumstances such as car accident or house fire. No Incompletes given unless documented reason. The problems below are for the most part real. Don't expect all kinds of clean answers and obvious explanations. There may be missing data and techniques like J48 may actually run out of attributes and NEVER create a single homegeneous classification. Be Careful. A good answer may actually be that there is no pattern to the data. Not every dataset has neat patterns - sometimes it's good for management to realize that their data is random - that flipping a coin or guessing is what they need to do and not assume there's some high payoff simple rule that will make them lots of money and save time. I have not intentionally placed data like that in the exam but the data is real and may not be well-behaved. So when you justify your answers - write simple and persuasive English - it doesn't have to be shakespeare quality. Just explain why you selected a technique , what you did and what the results tell you. That is perfect. Here's how this will be graded: There is no "right" answer I am seeking. I don't have a 'key' . Rather, I am going to grade HOW you went about solving the problem and whether what you did is reasonable. If you blindy go applying a technique without any type of analysis - that's reckless in real life as well as the final. Instead, you should perhaps run some statistics, clustering, etc, in an EFFORT, to gather some helpful information BEFORE trying a technique and IN SUPPORT of APPLYING technique. Simply giving me an 'answer' without convincing that your answer is reasonable will not receive credit. Ask yourself before turning in your exam: Do my answers convince me??? Think in terms of training and test sets. DON'T just run your techniques on the entire dataset. Instead, when it makes sense to you, divide your dataset into two pieces (not necessarily equal) and SEE FOR YOURSELF HOW YOU'RE PERFORMING. If you train on one set of data and then test on another - and you correctly classify the test set -> that indicates a high confidence in the result. It's hard to criticize that type of method. One of the main problems you face is the "windowing" problem - use some data to train and the rest to test - I talked about – how to use the data you have to mine and get some confidence in your results. For some problems, there's lots of data. Do you need it all? How much should you train on? Test on? When are the results acceptable? When are the results possibly really misleading? EVERYONE MUST DO PROBLEM 1 AND PROBLEM 10 - and 3 additional problems of your choice for a total of 5 problems. SELECT 3 PROBLEMS from the problems 2-9 to work on. Extra credit for any others you do. (1) Perceptron - what is the equation of the separating surface given the two classes below. PLEASE GIVE THE EQUATION FOLLOWING BY EITHER THE PROGRAM YOU USED OR ALL THE CALCULATIONS. NO SUPPORT = NO CREDIT. Class 1 (1,2,3) (4,5,6) (7,8,9) (25,15,20 Class 2 (-1,2,3) (-4,-5,-6) (7,-8,-9) (-25,-15,-20) (2) Bigram / text processing Given the bigram example of homework week 5, ( I am Sam , Sam I am, I do not like green eggs and ham, etc) - (A) what word most likely follows the word 'I' ? answer am (B) what word most likely follows the answer of (a) above? (3) The movie TITANIC was a media sensation as you no doubt know. The movie suggested that the rich people survived and the poor did not. The actual survival records are given as data in the DATA FOLDER Assume you get funded by a TV network to use data mining in the survival data and report your findings. Was there some pattern as to who survived and who did not? (4). Use the k means algorithm with 5 cluster centers to cluster the ecoli data here: http://archive.ics.uci.edu/ml/datasets/Ecoli what are the 5 cluster centers? (5). Use Nearest neighbor algorithm with the ecoli data above and make your own test set from the data given. Submit your classification results. How well did the nearest neighbor approach work? (6) Assume a corporation needs to know the yearly income of a customer and does not want to annoy the customer by asking this question directly. Instead, the desire is to ask the consumer some "innocent" questions that can predict yearly income. The corporation gives you US Census data and asks you if this can be used to classify people as to whether they earn above or below $50,000 yearly. What are some questions that might be asked given this data? Data is found here under ADULT DATABASE in the data link in the DATA FOLDER: Assume your job is to come up with an automatic way to classify consumers if given the same data as appears in the data file. Justify and test your approach. What happened? How accurate were you? (7) What can you tell me about the MUSHROOM DATA given in the data link in the DATA Folder. If you can build a classifier – do that. Do whatever you can with this data and tell me what you did and why. Justify your results and approach. (8) Arrythmia is a heart ailment there is a database of people who either are normal or suffer from some type of arrhythmia given on the data link in the DATA FOLDER. Assume your job is to come up with an automatic way to classify patients if given the same data as appears in the data file. Justify and test your approach. What happened? How accurate were you? (9) Select any data set we have not done in the data link given in the data folder and MINE THAT FOR WHATEVER YOU CAN. Your choice – whatever interests you. 10. EVERYONE MUST DO THIS PROBLEM in addition to the selected 4 problems: Given the LABOR data in the WEKA data folder, what are the relationships among the INCREASE-FIRST YEAR, INCREASE-SECOND YEAR, and INCREASE-THIRD YEAR attributes. How do you know these relationships? Show me.