Assignment title: Management


Knowledge Discovery in Large Data Sets Assigment Q1. (MIT) A Naïve Bayes classifier has been constructed with ten variables. A particular case that has to be classified has information only on eight variables. How would you use the classifier for this case? Q2. (MIT) A dataset of 1000 cases was partitioned into a training set of 600 cases and a validation set of 400 cases. A k-Nearest Neighbors model with k=1 had a misclassification error rate of 8% on the validation data. It was subsequently found that the partitioning had been done incorrectly and that 100 cases from the training data set had been accidentally duplicated and had overwritten 100 cases in the validation dataset. What is the misclassification error rate for the 300 cases that were truly part of the validation data? Q3. (Stanford University) Below is a table of similarities between five points. 1.00 0.92 0.35 0.22 0.21 0.92 1.00 0.61 0.44 0.16 0.35 0.61 1.00 0.37 0.10 0.22 0.44 0.37 1.00 0.33 0.21 0.16 0.10 0.33 1.00 (a) Cluster these five points using complete linkage. Draw a dendrogram to depict the clustering you obtain. (b) Some clustering algorithms only work on distance matrices. A standard way to convert similarity matrices Sij to distances Dij is as follows: What if you had used the conversion Will this result in the same clustering? (c) Repeat (b) when using group average linkage instead of complete linkage. Q4. (Stanford University) Suppose that your friend, a postdoc in biology, asks for your help in understanding an algorithm to detect some novel cancer cell types based on some point data in p dimensions that might represent different features of each of a number of cells. Your friend tells you: Novel cell types are characterized by being isolated from the generally large group of "normal" cells in the data. However, smaller (but not tiny) clusters of cells away from the large "normal" cluster are probably not novel cancer cell types. Based on an some already published results, your friend also tells you: The probability I mistakenly label a truly non-novel cancer cell novel is about 1%, while the probability I label a truly novel cancer cell novel is 90%. I also know that, in my samples, 95% of the cells are not novel cancer cells. (a) Suppose an experiment generates 10000 cells. Write down an "expected" confusion matrix for this algorithm. (b) If positive is novel, what is the TPR (true positive rate) of this particular algorithm? (c) If you were given the data, i.e. a data matrix X along with labels Y indicating whether an observation is novel or not, how might you find a model that would allow you to find novel cancer cells? Q5. (School of Queen's Cımputing) The problem of email spam has been almost completely solved, but it's a useful surrogate for other interesting problems. In this question, the goal is to build a spam prediction system that could be integrated into your favourite email environment. Explain the design of your complete system. You should include discussion of the following points: • The cold start problem: you rely on the user labelling some emails as spam and, by default, labelling the rest of the emails as normal. However, at the beginning you have very few examples of spam. • The need to update the predictor regularly, in principle after every email. • How many classes you would use for prediction and why. • How you would assess how well you have succeeded at the task.