Assignment title: Information


1 MATH 4044 – Statistics for Data Science Assignment 2 (SP5 2016) Due 6 November by 11pm Instructions: • This assignment is worth 35% of your final mark. It is due no later than 11pm on Sunday 6 November in Week 13. • You will need to submit your assignment via learnonline, including a completed cover sheet. A partially filled-in cover sheet can be downloaded from the Assignments page on the course website. • Please sign your cover sheet (type written name is acceptable). • The submitted assignment needs to be a single file, in either a Microsoft Word (doc or docx) or pdf file format, 25 pages at most excluding any appendices. • The assignment is out of 100 marks. To achieve maximum marks for each question, you should aim to: o Complete the requested statistical analysis in SAS using appropriate tasks or procedures. (40%) o Provide and interpret only the output most relevant to the question. Do not include every piece of output produced by SAS! (40%) o Discuss the results in the context of the question. (20%) • Assignments submitted late, without an extension being granted, will attract a penalty of 10 marks per each working day or any part thereof beyond the due date and time.2 For this assignment you will continue to use data derived from Capital Bikeshare trip records from 2011 and 2012, this time analysing patterns in daily numbers of rentals by casual users. References and Data Sources: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg. Data file for this assignment: The data file for this assignment is called daily.sas7bdat and contains daily counts of bike rentals for 2011 and 2012, derived from Capital Bikeshare trip history data, with additional weather and seasonal information. The data was downloaded from the UCI Machine Learning Repository. Variables in that file are as follows: Variable Description instant Record index dteday Date season Winter, spring, summer or fall (northern hemisphere) Yr 0 = 2011, 1 = 2012 Month Month (January to December) Weekday Day of the week (Monday to Sunday) workingday Working day = 1, weekend or public holiday = 0 Temp Normalised temperature in degrees Celsius; observed temperature divided by 41 (max) Atemp Normalised 'feels like' temperature in degrees Celsius; values divided by 50 (max) Hum Normalised humidity; observed values divided by 100 (max) Windspeed Normalised wind speed; values divided by 67 (max) Casual Count of casual users registered Count of registered users Count Total count of bike rentals (casual and registered)3 Assignment tasks: Question 1 (15 marks) Carry out a one-way analysis of variance relating casual to weekday. Use contrasts to test at least one a-priori hypothesis of your choice. Examine and comment on residuals. Also carry out appropriate post-hoc comparisons and discuss your results. Question 2 (25 marks) Use SAS to perform a one-way ANCOVA relating casual to weekday with atemp as a covariate, including appropriate post-hoc comparisons: • Confirm that there is a linear relationship between the response variable and the covariate (a scatterplot and a correlation coefficient plus a comment will suffice); • Check the two additional ANCOVA assumptions (report and comment only on the parts of the output most directly relevant to condition checking): o Independence of the covariate and the treatment effect (perform a one-way ANOVA test; there should be no statistically significant difference); o Equality of slopes (add and check significance of the interaction term); • Report and briefly discuss your results. Technical note: Make sure you obtain and examine Type III Sum of Squares (ss3). Also obtain estimates of 'least squares means' (lsmeans) which are means by treatment adjusted for the covariate. Question 3 (45 marks) (a) (15 marks) Carry out a one-way analysis of variance relating casual to season. Use contrasts to test at least one a-priori hypothesis of your choice. Also carry out appropriate post-hoc comparisons and discuss your results. (b) (15 marks) Extend your analysis in part (a) to test whether there is evidence of interaction between season and the type of day (working day vs weekend or public holiday). Carry out appropriate post-hoc comparisons and discuss your results. (c) (15 marks) The distribution of the number of casual users by season is actually not Normal so a Kruskal-Wallis test may be more appropriate to relate casual to season. Carry out this test and for post-hoc analysis, consider comparisons between summer and each of the other seasons. Discuss and compare your results to those in part (a). Question 5 (15 marks) Write a summary of your findings from Questions 1 to 3. Keep the technical details of the analyses that led you to these conclusions to the absolute minimum. Rather, focus on practical significance and present your findings in non-specialist terms. One page will be sufficient.