Assignment title: Information


1. Must contain presentation which includes presentation of charts, clearly labelled question parts, well-constructed code that is easy for the reader to follow, and anything else that makes the solution easy to read. 2. All comments must be written and explained in details. 3. separated into two documents – word and R markdown Question 1 In this question, you will apply cluster analysis methods to the protein data that you graphed in the visualisation part of the unit. The available data consists of the percentage of protein obtained from various food groups in European countries in 1973. The variables, which are percentages, are named for their food groups: RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg The data file proteinclus.csv contains the values of these variables for 25 European countries in 1973. (a) Standardise the variables, and apply K means cluster analysis with centres from 2 to 10, and nstart = 10. Provide a graph of the percentage of explained variance as a function of the number of clusters. Comment on the shape of the graph. Based on the "elbow" rule, does this graph give a clear indication of the number of clusters that should be chosen? What number of clusters would you choose on this basis? (b) Use an R command to produce 5 clusters based on this data. Provide the output to this command in your document and also provide a listing of the countries in each cluster. Determine the average values for each (unstandardized) variable in each cluster, and hence determine and also tabulate the average values in each cluster of the three variables: RedMeat + WhiteMeat + Fish Eggs + Milk Cereal + Starch + Nuts + Fr.Veg Characterise and interpret the clusters based on the results in these tables. Can the clustering outcomes be explained by just the three combined variables listed above? (c) To supplement your analysis in (b), choose and plot three different graphs to illustrate the division into groups determined in (b). These should be scatterplots of each observation coloured by cluster. Discuss what the graphs show. Question 2 In this question you will apply classification methods to wine types. The available data, obtained by chemical analysis, includes the quantities of thirteen different constituents, namely Alcohol Malic acid Ash Alkalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline. The chemical analysis was applied to 178 Italian wines, and the results are available in the file wine.csv. The wines are known to belong to one of three different classes (cultivated varieties) and the varieties to which they belong are provided in the first column of the data file. They are listed as 1, 2 and 3. However, you will want to make sure that the class variable is treated as a factor. Set a seed equal to 2. (a) Divide the available data into training and test data as was done in the workshops. Use 138 observations in the training set, and reserve the remaining 40 for the test set. (b) Use the command tree in the package tree to construct a classification tree for the wine cultivars, allowing the tree to grow large. Show the tree, including text of course. (c) Then use cross-validation to choose the appropriate tree size, using misclassification as the measure of leaf heterogeneity. Supply a graph of deviance against size, and based on this graph, discuss what tree size should be used. (d) Decide what size of tree you will use, produce and show that tree and determine the misclassification rate in the training and the test data set. (e) Calculate the centroid of the whole data set, and determine how your tree classifies this (made-up) data point. Comment briefly on the classification.