EPIDEMIOLOGY 340.600
:
STATA PROGRAMMING AND DATA MANAGEMENT
Assignment 3
Due date: 11:00 a
.m.,
Wednesday
,
May 17
, 201
6
via CoursePlus dropbox
Due to c
onstraints of
final
grade submission
to the registrar
,
late
assignments will not be accepted
Overview: write a .do file which performs the tasks described below. Your .do file should be called
assignment3_
yourname
.do (for example: assign
ment1_massieallan.do
).
Remember to write
comments
for full credit!
Problem 1 is to write the name of t
he log file correctly. The name of your log file should contain
"a
ssignment3_", your name, and the
n the date on w
hich the script is run. For example, if the script is
run on May 14, 201
7
, then the log file for Allan Massie's assignment should be called
assignment3_allanmassie_201
7
0
51
4
.log
.
(
For full credit, denote January
–
September as
"01"
-
"09" and days 1
-
9 as "01
-
"09". However, generous partial credit will be given if the 0s are missing.
)
So the start of your .do file should contain:
capture log close
[code to create macros based on today's date and time]
log using assignment3_[name]_[one or more macros].log, t
e
xt
replace
Note: this is a good way to automatically preserve earlier versions of a
log file when you make changes
to a script over time. After Assignment 3 is turned in we will release startlog.ado, which Allan Massie
wrote to automatically incorporate the d
ate into log file names.
Problem 2a. Starting with an empty dataset, use
rnormal()
to create a dataset of 100
random
numbers drawn from a normal distribution with a mean of 100 and
standard deviation
of 5. Calculate
the mean and
standard deviation
of your
100 random numbers (the mean will be close to, but not
exactly, 100; the variance will be close to, but not exactly, 5.)
Problem 2b. Clear that dataset and create a dataset of 10,000 random numbers drawn from a normal
distribution with a mean of 100 and
st
andard deviation
of 5. Calculate the mean and
standard deviation
of your 10,000 random numbers.
Problem 2c. Clear that dataset and create a dataset of 1,000,000 random numbers drawn from a normal
distribution with a mean of 100 and
standard deviation
of 5.
Calculate the mean and
standard deviation
of your 1,000,000 random numbers.
Problem 2 evaluation: print the following table:
Sample size Sample mean Sample
standard deviation
100 [mean from 2a] [
SD
from 2a]
10000
[mean from 2b
] [
SD
from 2b
]
1000000
[mean from 2c
] [
SD
from 2c
]
Extra credit
will be given if
the bottom three lines are printed using a
forvalues
loop from 1 to 3,
instead of writing code to print each line individually!
Problem 3:
Using donors.dta, r
eproduce the
following graph as precisely as possible:
Note that the dots for female donors are pink, and the dots for male donors are blue.
As per the subtitle, the scatter plot and regression line should be produced only for data from
individuals with height > 150 cm,
The command for the linear regression line for female height is
regress don_wgt
don_hgt if don_gender=="F"
& don_hgt >= 150
For full credit, use results from the
regress
command for the regression line; do not use
lfit
.
Export
your graph
to a .PNG file with the name
q3_yourname.png
(e.g.
q3_massieallan.png)
Extra credit
challenge
P
roblem 6: write a program
called
sampmean
to plot random data drawn from a normal distribution.
sampmean
takes a list of numbers representing different sample sizes. It also takes (optionally) a mean
and standard deviation for the normal distribution. If we run
sampmean, at(5 20 100 1000) mean(20)
We get a graph like this:
In this example, the prog
ram generates four sets of normally distributed random numbers (one set of 5
numbers, one set of 20, one set of 100, and one set of 1000) and calculates the mean for each set. It also
plots each randomly generated number (as points) and the group mean (as
a red line). The group mean
also appears above each group as text.
Here are some more examples. The exact output will depend on the random number seed you use.
sampmean, at (4 8 16 32 64) mean(5) sd(3)
Mean=19.77
Mean=19.93
Mean=20.15
Mean=20.05
16
18
20
22
24
5
20
100
1000
sampmean, at(100 200 300) mean(5)
sd(2)
uniform
In
the last example, the distribution is a uniform distribution instead of a normal distribution.
Hints:
You can use the keyword
numlist
(for a list of numbers) just as you do with a
varlist
(list
of variables)
The uniform distribution from 0 to 1 has mean 0
.5 and standard deviation sqrt(1/12).
If you are able to solve this problem, you can probably modify your program slightly so that it is also a
solution for problem 2. That is fine.
Note: this problem is pretty hard!
We expect that few people will solve
the whole thing, but we will
give partial credit for a partial (working) solution. If your program only works partly, then explain in the
comments, like this:
//NOTE: my
program does not display the means
in the graph
.
Mean=3.81
Mean=6.18
Mean=5.52
Mean=4.69
Mean=5.22
-5
0
5
10
15
4
8
16
32
64
Mean=4.88
Mean=4.92
Mean=5.02
2
4
6
8
100
200
300
//instead it prints them to the scre
en.
//Also "uniform" doesn't work,
//and my program runs for only one number (not a list)