Issue 18/08/2011 Revision Date 11/06/2013 Form No. ECT/AC/F.05.02 Assignment #1 Date /5/2017 Time: PM Total Mark: 10 Student’s Name Student’s ID Course Name Business Intelligence Course Code BIT405 Semester Spring 2017 Instructor’s Name Dr. Myriam Bounhas, Dr. Walaa Saber, Dr. Bilel Elayeb Scope and Focus: ● Classification Methods ● The Naive Rule ● Naive Bayes ● k-Nearest Neighbors Contributing to the following CLOs: CLO #1 Describe the role of Business Intelligence in an organization. CLO #2 Understand the data mining process and its related issues. CLO #3 Create, evaluate and apply different intelligence models. Questions Part A Part B Total Point 7 5 12 Student Mark Note: This Assignment accounts for 10% of the student’s final grade. Part A: (7 marks) K-Nearest Neighbor (KNN) is a supervised learning algorithm where the result of new instance query is classified based on majority of K-nearest neighbor category. The purpose of this algorithm is to classify a new object based on attributes and training samples. Indeed, KNN used neighborhood classification as the prediction value of the new query instance. We consider the following data collected from questionnaires survey (to ask people opinion) and objective testing with two attributes (X1: Acid Durability and X2: Strength) to classify whether a special paper tissue is Good or Bad. We suppose use the number of nearest neighbors k = 3. Now the factory produces a new paper tissue that pass laboratory test with X1 = 5 and X2 = 9. Without another expensive survey, can we guess what the classification of this new tissue is? Fortunately, KNN algorithm can help you to predict this type of problem. Table 1 presents six training samples. X1: Acid Durability (seconds) X2: Strength (kg/square meter) Y: Classification 9 9 Bad 9 6 Bad 5 6 Bad 3 6 Good 4 5 Good 2 4 Good 5 9 ? Table 1: Training data 1). Calculate the Euclidian distance between the query-instance and all the training samples. Insert values in table 2 and provide detail of calculus. (1.5 marks, 0.25 for each value) X1: Acid Durability (seconds) X2: Strength (kg/square meter) Euclidian distance to the query-instance (5, 9) 9 9 9 6 5 6 3 6 4 5 2 4 Table 2: Euclidian distance between the query-instance and all the training samples …………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………… 2). Sort the distance and determine nearest neighbors based on the k-th minimum distance. Insert values in table 3. (3 marks, 0.25 for each value) X1: Acid Durability (seconds) X2: Strength (kg/square meter) Euclidian distance to the query-instance (5, 9) Rank minimum distance Is it included in 3-nearest neighbors? (Yes/No) 9 9 9 6 5 6 3 6 4 5 2 4 Table 3: Section of the 3-nearest neighbors 3) Gather the category Y of the nearest neighbors. Insert values in table 4 and justify your response. (1.5 marks, 0.25 for each value) X1: Acid Durability (seconds) X2: Strength (kg/square meter) Euclidian distance to the query-instance (5, 9) Rank minimum distance Is it included in 3-nearest neighbors? Y= category of nearest neighbor 9 9 9 6 5 6 3 6 4 5 2 4 Table 4: Categories of the 3-nearest neighbors ………………………………………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… 4) Use simple majority of the category of nearest neighbors as the prediction value of the query instance. (2 marks) ………………………………………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… Part B: (5 marks) Let us consider the training data below dealing with “Eye disease problem” to learn Naive Bayes Classifier. The goal is to classify (as “Noncontact”, as “Soft Contact or as “Hard contact”) a new record: R11: (Pre-presbyopic, Hypermetrope,Yes, Reduced) For this purpose you have to calculate P(NonContact), P(Hard Contact), and P(Soft Contact) 1. Compute the conditional probabilities and class priors for each class label in the training set. (3.75 marks, 0.25 for each value) ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… 2. Compute the probability to assign each class label for the new record. (0.75 marks,0.25 for each value) Class Label = Soft Contact: ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… Class Label = Hard Contact: ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… Class Label = NonContact: ………………………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………… 3. Which class is to assign to the new record ? justify your answer (0.5 mark) ………………………………………………………………………………………………………………………………………………………………………………………………………………………… ……………………………………………………………………………………………………………