HSH746 BIOSTATISTICS 1 ASSIGNMENT 3 (40% OF TOTAL MARK) Due date: 9 June 2017 Instructions This assignment covers all of the unit content up to and including week 11. Please note: this assessment task must be all your own work. Please do not discuss questions and answers in detail with your fellow students. Assignments must be submitted on-line via the assignment folder in the unit site in Deakin sync before 5 pm on 9 June 2017. Assignments must be submitted in a Microsoft Word document or an editable pdf. There are three parts to this assignment. You should ensure that you have answered all questions in all three parts. Some of the questions require calculations using Stata. Where you have used Stata for calculations, you should copy the Stata commands and output from the Stata results screen and paste them into your assignment so that the assessor can see how you have derived your answer. Note: this Stata output is required in addition to your answer to the question. Simply pasting in the Stata output will not be considered an adequate answer on its own. Note that all tables and graphs in this assignment should be presented with appropriate headings and footnotes. This assignment is worth 40% of the final mark for HSH746 and the marks allocated for each question are shown. Clearly number your answers so that the marker can easily identify your answer with each specific question. Students should ensure that they keep a spare copy of their work Student Name: Student ID Number: Questions Note that all tables and graphs in this assignment should be presented with appropriate headings and footnotes. Part One Read the following data description and answer the following questions The data set assignment 3 heart data.csv contains the results of a population study of heart disease and its associated risk factors. A complete description of the variables on the data set is contained in the word document assignment 3 heart data description. You should read this data description before attempting to answer the questions. These are synthetic data, but you may reference them in your answers as coming from assignment 3 data: heart study You have received these data as part of a population study of heart disease and its risk factors. The data were collected on paper forms and transferred to a computer file using manual data entry. What is your main task in preparing the data set for analysis? (1 mark) Perform the task you identified in part (a) Show what you did – including any relevant Stata output. (5 marks) The numeric difference between a person’s systolic and diastolic blood pressure is called the pulse pressure. For example, if a person’s systolic blood pressure is 120 mmHg and their diastolic blood pressure is 80 mmHg, their pulse pressure is 40 mmHg. A pulse pressure over 60 mmHg is an indicator of high risk for heart attacks and other cardiovascular disease. Calculate an estimate and 95% confidence interval for the mean pulse pressure for this sample. What probability distribution will you use for this and why? Do the results of your calculation suggest that the true population mean pulse pressure is greater than 60? State your reason for this. Give your answers to 1 decimal place. (3 marks) (You can read more about the pulse pressure by following this link: http://www.mayoclinic.org/diseases-conditions/high-blood-pressure/expert-answers/pulse-pressure/faq-20058189 ) Part Two Read the following data description and answer the following questions The relationship between lung cancer and cigarette smoking is well known and has been extensively studied. However, cigarette smoking is also believed to be a risk factor for a number of other cancers. The aim of questions 3, 4 and 5 is to examine the relationship between cigarette smoking and kidney cancer. We will use data from an ecological study carried out in the United States of America. The data set contains the following variables for 44 States: cigarettes: The average number of cigarettes sold per 100 people per year in each State kidney: The average annual number of deaths from kidney cancer per 100,000 population in each State hi_smoke: Takes the value 1 if the average number of cigarettes sold in that state is above the national median number of cigarettes sold and 0 if it is below the median. We will assume that the data are statistically independent. The data are stored in the data set assignment 3 cancer data.csv A full description of this data set is given in the word document assignment 3 cancer data description You may reference these data in your answers as coming from assignment 3 data: cancer study First we will carry out a test of whether or not the mean of the mortality rates for states with high numbers of cigarettes sold differs from the mean of the mortality rates with low numbers of cigarettes sold. Do a hypothesis test of whether or not the mean number of kidney cancer deaths per 100,000 population for states with high numbers of cigarettes sold (hi_smoke = 1) differs from the mean for states with low numbers of cigarettes sold (hi_smoke = 0). As part of your answer: State what hypothesis test you will use and why you can use it with these data. Fully report the hypothesis test. Report the test statistic to 1 decimal place. (9 marks) Have we either proved or disproved that cigarettes are a cause of kidney cancer? State the reason for your answer. (2 marks) Another way to look at the relationship between kidney smoking and cigarettes is to use regression analysis. Use a scatterplot and a correlation to show whether or not the number of cigarettes sold per 100 people is likely to be a good candidate for a regression model predicting kidney cancer deaths. Explain your result. Report correlation to 2 decimal places. (3 marks) Fit a regression equation using average number of cigarettes sold per 100 people as a predictor variable and average kidney cancer deaths as the dependent variable. What is the regression equation (reported to 2 decimal places)? What does the equation mean in words? Test the hypothesis that the coefficient for the predictor variable is zero. Is this equation a good predictor of systolic blood pressure? What do you conclude? (8 marks) Part Three Read the following data description and answer the following questions Ecological studies such as that used in questions 3, 4 and 5 are subject to potential confounding, so they cannot be regarded as strong evidence for a relationship between two variables. Another way we can assess the relationship between a cancer and smoking is by using a case-control study. The data set assignment 3 case control data.csv contains the results of a case-control study examining the relationship between bladder cancer and smoking. It contains two variables: case: This variable takes the value 1 if the person has bladder cancer (ie they are a “case”) and 0 if they do not have bladder cancer (ie they are a “control”) smoker: This variable takes the value 1 if the person is a current smoker and 0 if they are not a current smoker. The data set has 120 observations (a) Calculate the odds ratio and the associated 95% confidence interval for the relationship between being current smoker and having bladder cancer. Present your results to 1 decimal place. (1 mark) (b) What can you conclude from this odds ratio and confidence interval? Give a reason for your answer. (2 marks) Test the hypothesis that the odds ratio for the relationship between being a current smoker and having bladder cancer is equal to 1. (4 marks) Fully report on the hypothesis test. What conclusion can you draw about the proportion of people with bladder cancer who smoke compared to the proportion of people with bladder cancer who do not smoke from the results of your hypothesis test? What can you conclude from the results of question 6 and question 7 about the strength of the relationship between smoking and bladder cancer? (2 marks) End of assignment questions