Assignment title: Information

Question 1 [Points 20] a. [Points 8] Using a large training data for a community of professionals, a linear equation was developed to predict salary in terms of years of experience. It is given as: salary(in $K) = 30 + 2.5Experience (in years). For the given three individual professionals, determine the predicted salary and the mean squared error (MSE) over the three. Professional# Experience (years) Actual Salary ($K) 1 2 38 2 4 38 3 6 50 b. [Points 6] Given the following regression tree, predict the salary for the following three professionals: Professional# Name Education Experience (years) 1 John BS 10 2 Jane PhD 5 3 Jim HS 25 EQ1 Salary=15+2Experience EQ2 Salary= 25+ 3Experience EQ3 Salary=35+4Experience EQ4 Salary=50+3Experience EQ5 Salary=60+4Experience EQ6 Salary=70+5Experience Education HS BS MS/PhD EQ6 Experience Experience <5 5-10 >10 EQ1 EQ2 EQ3 EQ4 EQ5 <8 >7 1c. [Points 6] Given the following rules with exception, predict the salary of the following three persons: Professional# Name Education Experience (years) 1 John BS 10 2 Jane PhD 5 3 Jim HS 25 Default: salary=40+3experience Except if education >=BS and experience < 8 then salary = 60+4experience Except if education=PhD then salary=80+5experience else if experience > 10 then salary = 50 + 5experience except if education >= BS then salary=70 + 7experience; Question 2 [Points 20] a. [Points 8] Given the following data, derive and show a table similar to table 4.2 (for Naïve Bayes). Using this table, predict the acceptability probabilities for the instance with Price=L, Capacity=7, and Safety = M. Instance# Price Capacity Safety Acceptability 1 L 4 High Good 2 H 2 High Bad 3 M 4 Medium Good 4 L 7 High Good 5 L 7 Low Bad 6 H 7 High Bad 7 M 2 Medium Good 8 H 4 High Good 9 L 2 Medium Good 10 M 7 Low Bad Price Capacity Safety Acceptability Bad Good Bad Good Bad Good Bad Good L 2 L M 4 M H 7 H L 2 L M 4 M H 7 H b. [Points 6] For the above training data (with price, capacity, safety, and acceptability), determine the root of the decision tree. Show your work. c. [Points 6] Given the following set of training instances, find the nearest training instance for the unknown instance , and predict its health. Show your work. (Hint: Use Euclidean distance with normalized attributes) Inst# Age Height Weight Health 1 35 5.6 175 Excellent 2 55 6.0 150 Good 3 50 5.8 200 Okay 4 65 5.5 175 Good 5 45 5.6 190 Okay Question 3[Points 20] a. [Points 6] For the following three test instances, the predicted probabilities and the actual outcome (Health) were given. For each instance, determine the quadratic loss function (QLF) and informational loss function (ILF). Inst# Age Height Weight Predicted Prob. Actual 1 35 5.6 175 0.4 0.4 0.2 Okay 2 55 6.0 150 0.3 0.2 0.5 Good 3 50 5.8 200 0.1 0.6 0.3 Excellent b. [Points 5] To test the efficacy of a new medical test to diagnosis a disease, a group of 2500 volunteers were given the test. Out of these, only 1500 had the disease. The results of the test identified 1750 as having the disease. Out of these, only 1000 really had the disease. From here, determine the sensitivity and the specificity of the diagnostic test. c. [Points 5] Given the following data collected from a survey of customers regarding a new product, determine the ROC curve expressed as table. Customer# Predicted 1 0.85 Yes 2 0.50 No 3 0.95 No 4 0.99 Yes 5 0.45 Yes 6 0.97 No 7 0.80 Yes Okay Good Excellent Actual Prob (Yes) 8 0.6 No 9 0.75 Yes 10 0.7 No d. [Points 4] Given the following data with actual outcome and predicted outcome, determine the mean-absolute error and relative-absolute error. Instance# Actual Predicted 1 2.5 3.0 2 4.0 4.3 3 3.5 2.5 4 5.0 3.0 Question 4 [Points 20] a. [Points 6] Given the following two clusters (C1 and C2), determine the distance between the two clusters using (i) Single-linkage method (ii) Centroid-linkage method. (Hint: Normalize the attributes) Instance# Age GPA Salary($K) Cluster# 1 25 3.8 50 C1 2 30 3.6 65 C1 3 23 4.0 40 C1 4 35 3.0 70 C2 5 55 3.3 90 C2 b. [Points 6] Given the following Bayesian network, determine the probability that the performance of the unknown candidate is predicted to be Good, Average, or Poor. Unknown candidate: Performance Good=0.25 Average=0.45 Poor=0.3 GPA Experience Performance Low Medium High Good 0.2 0.3 0.5 Average 0.4 0.4 0.2 Poor 0.6 0.3 0.1 Performance A B C or D Good 0.35 0.35 0.3 Average 0.3 0.4 0.3 Poor 0.1 0.3 0.6 Salary Performance Experience Low Medium High Good Low 0.4 0.4 0.2 Good Medium 0.2 0.4 0.4 Good High 0.1 0.3 0.6 Average Low 0.5 0.4 0.1 Average Medium 0.3 0.5 0.2 Average High 0.2 0.4 0.4 Poor Low 0.7 0.2 0.1 Poor Medium 0.5 0.3 0.2 Poor High 0.4 0.3 0.3 c. [Points 8] Given the following data and the rule Color=Y and Size=S and Act=S and Age=A  Inflated=Yes (i) Determine the support and accuracy of the rule. (ii) Prune the rule so its support is at least 3 and accuracy is 75%. Show your work. Instance# Color Size Act Age Inflated 1 Y S S A Yes 2 Y S S C Yes 3 Y S D A Yes 4 Y S D C Yes 5 Y L S A Yes 6 Y L S C No 7 Y L D A No 8 Y L D C No 9 P S S A Yes 10 P S S C No (outcome) 11 P S D A No 12 P S D C No 13 P L S A Yes 14 P L S C No 15 P L D A No 16 P L D C No Question 5 [Points 20] a. [Points 6] Given the following seven instances of numeric data for an attribute along b. [Points 6] Given the following code for the output Grade (A, B, C, D, F), determine with the corresponding outcome, suggest the first point that could be used to divide the range using entropy method. Show your work.(Each instance is shown as a pair of attribute value and the outcome (T or F)). <80, T>, <20, T>, <50, F>, <65,F>, <40, F>, <25, T>, <110,T> (i) The Hamming distance of the code (ii) How many errors can this code correct? Class Class vector A 11100111 B 00011000 C 10101010 D 01010101 F 00110011 c. [Points 4] Show one way to transform the following two attributes: Gender (M, F) d. [Points 4] A dataset with instances containing 8 attributes has been transformed using and GPA (A, B, C, D, F) into a single attribute. Justify. Principal Component Analysis (PCA). The results are as follows. Determine the principal components that we need to choose so as to capture a variance of at least 97%. Component Variance 1 61% 2 20% 3 12% 4 2% 5 1.5% 6 1.3% 7 1.2% 8 1.0% Question 1[Points 20] An employer collected data from employees' background and performance, and chose three different models to represent this information. Using each of three tables, predict the estimated salary (in K) for two new applicants with the following qualifications: (i) Education=BS, GPA = 3.5, Experience=2 years ( ii) Education=HS, GPA=2.5, Experience=10 years. (a) Linear model: Salary = Education5 + GPA10 + Experience*5, where Education HS=1, BS=3, MS=4, PhD=6. (b) Model tree: