Assignment title: Information
STA2300 Data Analysis S1, 16
1
Assignment 2
Due Date: 2 May, 2016
Weighting: 20%
Full Marks: 100
Answering the questions in this assignment should not be your first attempt at these types of
questions. It is essential that you work through practice exercises from the tutorial sheets in
the study book and text book first.
This assignment is important in providing feedback and helping to establish competency in
essential skills.
Answer all the questions. The questions are not of equal weight; some questions are worth
much more than others.
The questions relate to material in Modules 1 to 6.
Before starting this assignment read Notes Concerning Assignments under the Introductory
Material link on the StudyDesk.
When you are asked to comment on a finding, usually a short paragraph is required.
Do not copy/paste SPSS output into your assignment unless specifically asked to do so. In many
cases the SPSS output contains much more information than is required for a correct and
complete answer. In those cases just reproducing the output may not attract any marks. Make
sure you report only the information from the SPSS output relevant to your answer.
In order to obtain full marks for any question you must show all working.
Convert your word document to pdf before online submission. See the Introductory Material
(Section 5, Assignments) for information about how to do this properly.
This assessment item consists of 6 questions.
STA2300 Data Analysis S1, 16
2
Question 1 (14 marks)
This question uses information from the data file cgd.sav found under the Assessments link on the
StudyDesk (also see cgd.txt for more details about the study and the variables measured). Make sure
the variable view in SPSS is setup correctly with all 'labels' correctly defined (with units), all 'values'
assigned correctly for categorical variables and the correct 'measure' selected for all variables.
In gaining insights into the onset of a serious infection from the beginning of study entry, a researcher
is interested in the patients receiving 'gamma interferon' compared with those who are given
'placebo'. As such she decides to check if there is an association between 'Treatment Code' and 'Time
period' to first infection of the patients suffering from chronic granulotomous disease (cgd).
(a) (4 marks) Use a contingency table to display the relationship between 'Treatment code' and
'Time period' to first infection for patients in this study (you should use SPSS to complete this
contingency table). The title for this table should reflect the context of the study. (Note that
by convention, a table title should appear above the table).
(b) (2 marks) What proportion of patients are receiving 'gamma interferon' and have experienced
a 'short time' to first infection?
(c) (2 marks) Of those who are receiving 'placebo', what proportion of patients experienced a
'long time' to first infection?
(d) (6 marks) Does there appear to be an association between 'Treatment code' and the 'Time
period' to first infection of the patients suffering from chronic granulotomous disease (cgd).
Explain in less than 100 words, using a numerical example(s) from a conditional distribution
table to support your explanation.
Question 2 (20 marks)
Consider the data in the file cgd.sav again. Use SPSS to find the answers to the following questions, but
do not copy and paste SPSS output into your answer for parts (c) and (d) (make sure you always include
units where appropriate).
(a) (5 marks) Display the distribution of 'Diastolic BP' at the beginning of study (study entry) of the
patients in this study using an appropriate graph. Label the axes correctly, include units of
measure and provide an appropriate title.
(b) (4 marks) Using the graph in (a) only (don't refer to SPSS summary statistics), describe in no
more than 60 words, this distribution of 'Diastolic BP' for patients at the beginning of the study.
Include comments on shape, centre and spread of the distribution and the existence of
outliers, if any. Do not include information from any calculations; use the graph only.
(c) (3 marks) What is the sample size, mean and standard deviation of the distribution of 'Diastolic
BP' at the beginning of study for patients in this study? (You can use SPSS to calculate them
but do not copy/paste SPSS output).
STA2300 Data Analysis S1, 16
3
(d) (4 marks) Using SPSS find the median and IQR of the distribution of 'Diastolic BP' at the
beginning of study for patients. (Do not copy/paste SPSS output).
(e) (4 marks) For this distribution of 'Diastolic BP', which statistics are appropriate to measure the
centre and spread? Give a reasonable explanation for your choice.
Question 3 (24 marks)
Consider the data in the file cgd.sav again. The researcher is interested in identifying if height of the
patients can be used to effectively predict the weight of the patients in this study.
(a) (2 marks) What are the two variables you will need to include in your analysis? What type of
variables are they?
(b) (4 marks) Use an appropriate graph to display the relationship between the two variables
identified in part (a) for these patients. Label the axes correctly, include units of measure and
provide an appropriate title.
(c) (4 marks) From the graph in part (b), describe (in no more than 30 words) the form, direction
and scatter of this relationship, and identify any outliers.
(d) (4 marks) Calculate an appropriate statistic to measure the strength and direction of the
relationship between the two variables for these patients. Interpret this statistic.
(e) (6 marks) Use SPSS to find the equation of the regression line which could be used to make
predictions and then plot the regression line on the graph in part (b).
(f) (3 marks) Using the regression equation from part (e), predict the expected weight for a patient
who has a height of 150cm. Would you expect this to be an accurate prediction? Why?
(g) (1 mark) What proportion of the variability in weight for patients can be explained by the
model, i.e. the relationship between weight and height time?
Question 4 (12 marks)
Chronic granulotomous disease (cgd) is a diverse group of hereditary diseases usually developed in
adolescence. Based on historical data (not the sample data in cgd.sav), age of the cgd patients is
approximately normally distributed with a mean of 14 years and a standard deviation of 9 years.
(a) (2 marks) Identify the variable of interest and the unit of measurement of the variable that can
be used to estimate the probabilities of the ages of cgd patients.
(b) (5 marks) Based on this normal distribution, what proportion of cgd patients are aged 25 years
or more?
(c) (5 marks) Based on this normal distribution, above what age are the oldest 15% of cgd
patients?
STA2300 Data Analysis S1, 16
4
Question 5 (12 marks)
Use this news article, "Cranberry Juice Can Effectively Reduce Heart Disease," (appeared on
preventdisease.com on July 1, 2015) to answer the questions that follow:
Three glasses of cranberry juice just might keep the cardiologist at bay.
That's the suggestion of a small new study presented March 24 at the annual
meeting of the American Chemical Society in New Orleans. Researchers from the
University of Scranton suggested that nutrients found in cranberry juice can
effectively reduce the risk of heart disease -- in some cases, up to 40 percent --
mostly by increasing levels of HDL, the "good" cholesterol. The juice was also
shown to increase blood levels of antioxidant nutrients by up to 121 percent.
"It is one of the most important fruit juices you can drink -- with protective
qualities that can make an important difference in your health, particularly your
heart health," says Joe Vinson, the researcher who presented the findings.
Vinson's research was fully funded by the Cranberry Institute.
Before you race out and buy that year's supply of cranberry juice, the news isn't
all good. Those who drank sweetened cranberry juice -- the kind you find on most
supermarket shelves -- experienced a rise in triglycerides, which are dangerous to
the heart.
While Vinson suggests the solution is to drink your juice artificially sweetened,
not all nutritionists agree that's the best advice.
"I think the best thing you can do is eat whole fresh fruits -- and to make
cranberries one of a variety of fresh fruits and vegetables you eat every day since
we know that, in their natural form, with nothing added, these foods have hearthealthy qualities, without any risk of adverse effects," says Gyni Holland, a
nutritionist at the New York University School of Medicine.
The study also found the amount of cranberry juice you consume is directly
related to how much protection you receive. For those who had just one eightounce serving daily, Vinson says there was little in the way of health benefits seen
in this study. Significant differences in both antioxidant levels and HDL cholesterol
were not seen until two to three glasses of juice were consumed daily.
The research involved 11 women and eight men, all diagnosed with high
cholesterol (on average 250 milligrams per deciliter), and none were taking any
cholesterol medication. Normal cholesterol is below 200 mg/dl.
Ten of the participants were assigned to drink cranberry juice containing an
STA2300 Data Analysis S1, 16
5
artificial sweetener and no added sugar, while the remaining nine drank juice
sweetened with corn syrup. All the drinks contained 27 percent fruit juice, the
average amount commonly found in many grocery store brands.
During the first month of the 90-day trial, each volunteer drank one daily eightounce serving of juice. The second month they consumed two glasses a day, and
the third month three glasses daily. At the conclusion of each of the three months,
Vinson measured their total cholesterol, their HDL, and their triglycerides.
He also measured levels of antioxidants -- nutrients that protect our heart by
blocking certain types of cell damage caused by molecules generated by smoking
and pesticide exposure.
"After one month there was no change in any of the participants. At two servings
a day, triglyceride levels rose marginally, but only in those drinking sweetened
cranberry juice," says Vinson.
However, once intake rose to two glasses daily, antioxidant levels also rose by 111
percent; when three glasses a day were consumed, Vinson reports, it climbed to
a whopping 121 percent in both types of juices.
What's more, the HDL or "good" cholesterol of those drinking three glasses of
either juice per day jumped up by 10 percent.
"That's equal to approximately a 40 percent reduction in heart disease," he says.
According to Holland, the real message in this study still remains that eating a
variety of fruits and vegetables is one of the most beneficial things you can do for
your health.
"I wouldn't run out and buy cranberry juice necessarily -- but I would make every
effort to include cranberries along with all types of fruits and vegetables in your
diet," says Holland, who also reminds us that the flesh as well as the juice of a
fruit yields important health benefits.
Also important to note, says Holland, is that the study was not a controlled trial
and there was virtually no attention paid to any changes in the participants' diet
or exercise regimens.
Moreover, she notes, they were not questioned as to any lifestyle or other
changes that could have affected the study outcome.
(a) (2 marks) Is this an experimental or observational study? In less than 50 words clearly explain
your choice based on the extract given above.
(b) (3 marks) For the above study identify, if appropriate,
STA2300 Data Analysis S1, 16
6
i) the response variable(s).
ii) the factor and its levels.
iii) the sample size.
(c) (4 marks) Are the four principles of experimental design used in this study? Explain, in the
context of the study?
(d) (3 marks) Explain explicitly what a confounding variable is. Identify one plausible confounding
variable in this study and explain why it is a confounding variable.
Question 6 (18 marks)
According to Australian Bureau of Statistics, 25% all Australian children aged 5-17 years are overweight
or obese. A former STA2300 student takes a random sample of 20 children aged 5-17 years in
Toowoomba. A particular variable of interest is the 'number of children aged 5-17 years who are
overweight or obese'. Based on the above information answer the following questions:
(a) (3 marks) What is an appropriate model to represent the variable of interest? Write down the
parameters of the model, if any.
(b) (4 marks) Discuss how the conditions of the above model are satisfied in the current study.
(c) (2 marks) Find the mean and standard deviation of the number of children aged 5-17 years
who are overweight or obese using the parameters of the model.
(d) (4 marks) Find the probability that at least 5 of the children aged 5-17 years are overweight or
obese.
(e) (5 marks) Determine the probability that, in a random sample of 100 children aged 5-17 years
in Toowoomba, 30 or more children are overweight or obese. State and check any
assumptions, conditions or rules of thumb that should be considered before performing the
calculations to determine this probability.