Assignment title: Information


STA2300 Data Analysis S1, 16 1 Assignment 2 Due Date: 2 May, 2016 Weighting: 20% Full Marks: 100  Answering the questions in this assignment should not be your first attempt at these types of questions. It is essential that you work through practice exercises from the tutorial sheets in the study book and text book first.  This assignment is important in providing feedback and helping to establish competency in essential skills.  Answer all the questions. The questions are not of equal weight; some questions are worth much more than others.  The questions relate to material in Modules 1 to 6.  Before starting this assignment read Notes Concerning Assignments under the Introductory Material link on the StudyDesk.  When you are asked to comment on a finding, usually a short paragraph is required.  Do not copy/paste SPSS output into your assignment unless specifically asked to do so. In many cases the SPSS output contains much more information than is required for a correct and complete answer. In those cases just reproducing the output may not attract any marks. Make sure you report only the information from the SPSS output relevant to your answer.  In order to obtain full marks for any question you must show all working.  Convert your word document to pdf before online submission. See the Introductory Material (Section 5, Assignments) for information about how to do this properly.  This assessment item consists of 6 questions. STA2300 Data Analysis S1, 16 2 Question 1 (14 marks) This question uses information from the data file cgd.sav found under the Assessments link on the StudyDesk (also see cgd.txt for more details about the study and the variables measured). Make sure the variable view in SPSS is setup correctly with all 'labels' correctly defined (with units), all 'values' assigned correctly for categorical variables and the correct 'measure' selected for all variables. In gaining insights into the onset of a serious infection from the beginning of study entry, a researcher is interested in the patients receiving 'gamma interferon' compared with those who are given 'placebo'. As such she decides to check if there is an association between 'Treatment Code' and 'Time period' to first infection of the patients suffering from chronic granulotomous disease (cgd). (a) (4 marks) Use a contingency table to display the relationship between 'Treatment code' and 'Time period' to first infection for patients in this study (you should use SPSS to complete this contingency table). The title for this table should reflect the context of the study. (Note that by convention, a table title should appear above the table). (b) (2 marks) What proportion of patients are receiving 'gamma interferon' and have experienced a 'short time' to first infection? (c) (2 marks) Of those who are receiving 'placebo', what proportion of patients experienced a 'long time' to first infection? (d) (6 marks) Does there appear to be an association between 'Treatment code' and the 'Time period' to first infection of the patients suffering from chronic granulotomous disease (cgd). Explain in less than 100 words, using a numerical example(s) from a conditional distribution table to support your explanation. Question 2 (20 marks) Consider the data in the file cgd.sav again. Use SPSS to find the answers to the following questions, but do not copy and paste SPSS output into your answer for parts (c) and (d) (make sure you always include units where appropriate). (a) (5 marks) Display the distribution of 'Diastolic BP' at the beginning of study (study entry) of the patients in this study using an appropriate graph. Label the axes correctly, include units of measure and provide an appropriate title. (b) (4 marks) Using the graph in (a) only (don't refer to SPSS summary statistics), describe in no more than 60 words, this distribution of 'Diastolic BP' for patients at the beginning of the study. Include comments on shape, centre and spread of the distribution and the existence of outliers, if any. Do not include information from any calculations; use the graph only. (c) (3 marks) What is the sample size, mean and standard deviation of the distribution of 'Diastolic BP' at the beginning of study for patients in this study? (You can use SPSS to calculate them but do not copy/paste SPSS output). STA2300 Data Analysis S1, 16 3 (d) (4 marks) Using SPSS find the median and IQR of the distribution of 'Diastolic BP' at the beginning of study for patients. (Do not copy/paste SPSS output). (e) (4 marks) For this distribution of 'Diastolic BP', which statistics are appropriate to measure the centre and spread? Give a reasonable explanation for your choice. Question 3 (24 marks) Consider the data in the file cgd.sav again. The researcher is interested in identifying if height of the patients can be used to effectively predict the weight of the patients in this study. (a) (2 marks) What are the two variables you will need to include in your analysis? What type of variables are they? (b) (4 marks) Use an appropriate graph to display the relationship between the two variables identified in part (a) for these patients. Label the axes correctly, include units of measure and provide an appropriate title. (c) (4 marks) From the graph in part (b), describe (in no more than 30 words) the form, direction and scatter of this relationship, and identify any outliers. (d) (4 marks) Calculate an appropriate statistic to measure the strength and direction of the relationship between the two variables for these patients. Interpret this statistic. (e) (6 marks) Use SPSS to find the equation of the regression line which could be used to make predictions and then plot the regression line on the graph in part (b). (f) (3 marks) Using the regression equation from part (e), predict the expected weight for a patient who has a height of 150cm. Would you expect this to be an accurate prediction? Why? (g) (1 mark) What proportion of the variability in weight for patients can be explained by the model, i.e. the relationship between weight and height time? Question 4 (12 marks) Chronic granulotomous disease (cgd) is a diverse group of hereditary diseases usually developed in adolescence. Based on historical data (not the sample data in cgd.sav), age of the cgd patients is approximately normally distributed with a mean of 14 years and a standard deviation of 9 years. (a) (2 marks) Identify the variable of interest and the unit of measurement of the variable that can be used to estimate the probabilities of the ages of cgd patients. (b) (5 marks) Based on this normal distribution, what proportion of cgd patients are aged 25 years or more? (c) (5 marks) Based on this normal distribution, above what age are the oldest 15% of cgd patients? STA2300 Data Analysis S1, 16 4 Question 5 (12 marks) Use this news article, "Cranberry Juice Can Effectively Reduce Heart Disease," (appeared on preventdisease.com on July 1, 2015) to answer the questions that follow: Three glasses of cranberry juice just might keep the cardiologist at bay. That's the suggestion of a small new study presented March 24 at the annual meeting of the American Chemical Society in New Orleans. Researchers from the University of Scranton suggested that nutrients found in cranberry juice can effectively reduce the risk of heart disease -- in some cases, up to 40 percent -- mostly by increasing levels of HDL, the "good" cholesterol. The juice was also shown to increase blood levels of antioxidant nutrients by up to 121 percent. "It is one of the most important fruit juices you can drink -- with protective qualities that can make an important difference in your health, particularly your heart health," says Joe Vinson, the researcher who presented the findings. Vinson's research was fully funded by the Cranberry Institute. Before you race out and buy that year's supply of cranberry juice, the news isn't all good. Those who drank sweetened cranberry juice -- the kind you find on most supermarket shelves -- experienced a rise in triglycerides, which are dangerous to the heart. While Vinson suggests the solution is to drink your juice artificially sweetened, not all nutritionists agree that's the best advice. "I think the best thing you can do is eat whole fresh fruits -- and to make cranberries one of a variety of fresh fruits and vegetables you eat every day since we know that, in their natural form, with nothing added, these foods have hearthealthy qualities, without any risk of adverse effects," says Gyni Holland, a nutritionist at the New York University School of Medicine. The study also found the amount of cranberry juice you consume is directly related to how much protection you receive. For those who had just one eightounce serving daily, Vinson says there was little in the way of health benefits seen in this study. Significant differences in both antioxidant levels and HDL cholesterol were not seen until two to three glasses of juice were consumed daily. The research involved 11 women and eight men, all diagnosed with high cholesterol (on average 250 milligrams per deciliter), and none were taking any cholesterol medication. Normal cholesterol is below 200 mg/dl. Ten of the participants were assigned to drink cranberry juice containing an STA2300 Data Analysis S1, 16 5 artificial sweetener and no added sugar, while the remaining nine drank juice sweetened with corn syrup. All the drinks contained 27 percent fruit juice, the average amount commonly found in many grocery store brands. During the first month of the 90-day trial, each volunteer drank one daily eightounce serving of juice. The second month they consumed two glasses a day, and the third month three glasses daily. At the conclusion of each of the three months, Vinson measured their total cholesterol, their HDL, and their triglycerides. He also measured levels of antioxidants -- nutrients that protect our heart by blocking certain types of cell damage caused by molecules generated by smoking and pesticide exposure. "After one month there was no change in any of the participants. At two servings a day, triglyceride levels rose marginally, but only in those drinking sweetened cranberry juice," says Vinson. However, once intake rose to two glasses daily, antioxidant levels also rose by 111 percent; when three glasses a day were consumed, Vinson reports, it climbed to a whopping 121 percent in both types of juices. What's more, the HDL or "good" cholesterol of those drinking three glasses of either juice per day jumped up by 10 percent. "That's equal to approximately a 40 percent reduction in heart disease," he says. According to Holland, the real message in this study still remains that eating a variety of fruits and vegetables is one of the most beneficial things you can do for your health. "I wouldn't run out and buy cranberry juice necessarily -- but I would make every effort to include cranberries along with all types of fruits and vegetables in your diet," says Holland, who also reminds us that the flesh as well as the juice of a fruit yields important health benefits. Also important to note, says Holland, is that the study was not a controlled trial and there was virtually no attention paid to any changes in the participants' diet or exercise regimens. Moreover, she notes, they were not questioned as to any lifestyle or other changes that could have affected the study outcome. (a) (2 marks) Is this an experimental or observational study? In less than 50 words clearly explain your choice based on the extract given above. (b) (3 marks) For the above study identify, if appropriate, STA2300 Data Analysis S1, 16 6 i) the response variable(s). ii) the factor and its levels. iii) the sample size. (c) (4 marks) Are the four principles of experimental design used in this study? Explain, in the context of the study? (d) (3 marks) Explain explicitly what a confounding variable is. Identify one plausible confounding variable in this study and explain why it is a confounding variable. Question 6 (18 marks) According to Australian Bureau of Statistics, 25% all Australian children aged 5-17 years are overweight or obese. A former STA2300 student takes a random sample of 20 children aged 5-17 years in Toowoomba. A particular variable of interest is the 'number of children aged 5-17 years who are overweight or obese'. Based on the above information answer the following questions: (a) (3 marks) What is an appropriate model to represent the variable of interest? Write down the parameters of the model, if any. (b) (4 marks) Discuss how the conditions of the above model are satisfied in the current study. (c) (2 marks) Find the mean and standard deviation of the number of children aged 5-17 years who are overweight or obese using the parameters of the model. (d) (4 marks) Find the probability that at least 5 of the children aged 5-17 years are overweight or obese. (e) (5 marks) Determine the probability that, in a random sample of 100 children aged 5-17 years in Toowoomba, 30 or more children are overweight or obese. State and check any assumptions, conditions or rules of thumb that should be considered before performing the calculations to determine this probability.