Assignment title: Information
STAT7001: In-Course Assessment 2
Your second in-course assessment is worth 50 marks, representing 50% of the credit
for the course STAT7001, and takes the form of two di↵erent tasks.
You should tackle the tasks below and produce project report for each of them,
providing a well-written presentation of your findings, including your code as appendix so that your results are reproducible. Illustrate your report by graphs or
other output as appropriate.
While the exclusive use of R or SAS is compulsory, you are free to use a word
processor of your choice (e.g. Tex, LibreO"ce, OpenO"ce or Microsoft Word). You
are free to use any sources (including online ones) of R/SAS commands you may
need in addition to the ones taught in the course so far. However, if this amount
is substantial - i.e., anything more than just looking up a command and/or using
the R help function - you must cite correctly and give the precise source or sources
including the full URL in case of online sources. The length of your reports should
not exceed: six pages for task 1, four pages for task 2, excluding appendix.
You should submit, per group, one digital copy of the report, together with your R
and/or SAS code, via the respective TurnitIn module on the STAT7001 Moodle page.
Only one group member should submit, with the mandatory consent of all group
members, expressed through digital signatures of all the group members. Please
make sure that the report itself is anonymized, that is, please make sure there is
no information in the report itself that hints at your identities. The deadline for
submission is Thursday, 24th March 2016, 6:00 pm (6 pm on the last day of term).
On R and SAS
You may use R or SAS for task 1. You must use SAS for task 2.
If you use R for task 1: You may choose to use the mlr package for benchmarking
in conjunction with own code, or opt to implement the benchmarking experiment
yourself entirely, using custom functions and control statements.
If you use SAS for task 1: Any (reasonable) implementation of the prediction
methods ensuring that predictions are made out-of-sample will be accepted.
3
Task 1: Prediction of the Istanbul Stock Market
The Istanbul Stock Exchange data set contains daily returns between January 5,
2009 and February 22, 2011 of the following eight national and international stock
market indices:
the Istanbul Stock Exchange index (ISE), S&P 500, DAX, FTSE 100, Nikkei 225,
Ibovespa, the MSCI European Union Index, and the MSCI Emerging Markets Index.
Obtain the data set from the moodle page (this is a version di↵erent from the usual
benchmark data set which has a "duplicate" column removed for sake of simplicity).
If you are unaware of what a stock market index or a return based on an index is,
please inform yourself in order to understand what the data describes.
Your task will be to evaluate and compare di↵erent prediction strategies for predicting daily returns of the ISE index.
(a) Read the Istanbul Stock Exchange data set into R or SAS. Obtain a single data
frame/data set ISE data, where rows correspond to di↵erent days; columns
are the di↵erent stock indices, and a variable time, in days since the earliest
record (in total 9 columns/variables).
Hint: in order to convert the day column to the desired format in R, consult
the R help on POSIXct and as.POSIXct. For the conversion in SAS, consult
the lecture 6 slides.
(b) Perform an exploratory data analysis on the frame ISE data. Especially investigate whether the stock indices associate with the variable time. Further
investigate associations between the stock indices and the same stock index
one day, two days, etc, earlier.
(c) Perform a benchmarking experiment to assess how well the Istanbul Stock
Exchange index can be predicted from other indices on the same day, with
prediction methods, error measures, and validation set-ups as follows:
The prediction methods to compare:
(i) Using as a prediction the average ISE return on the training set (as an
uninformed prediction strategy).
(ii) Ordinary least squares regression, using all stock market indices (except
ISE), but not time.
(iii) Ordinary least squares regression, using all stock market indices (except
ISE), as well as time.
The measures of prediction goodness to obtain (including suitable error
bars or confidence intervals) :
(i) out-of-sample root mean squared error (RMSE)
4
(ii) out-of-sample mean absolute error (MAE)
(iii) the relative variants of the above, i.e., relative RMSE and relative MAE.
The validation set-ups:
(i) Using the chronologically first 80 percent of the data as training sample,
and the last 20 percent as test sample.
(ii) Five-fold cross-validation where the folds are uniformly randomly sampled
among the days.
Quantify whether and how the prediction methods di↵er in terms of the error
measures (e.g. by appropriate tests on the empirical distribution of residuals
or absolute residuals). Particularly, quantify, whether the prediction methods
are better than predicting the average index on the training set.
(d) The validation set-up in part (c) does not mimick prediction in the sense of
forecasting where the stock market index is predicted from data from the previous time points only. Perform a benchmark experiment comparing prediction
strategies which actually predicts future ISE returns from (chronologically)
past data, with the following prediction methods, error measures, and validation set-up:
The prediction methods to compare:
(i) Using as a prediction the value of the most recent ISE return.
(ii) Using as a prediction the average of the ISE returns on the most recent
five days.
(iii) Ordinary least squares regression, using all indices (including ISE) from
the most recent day as covariates (8 in total).
(iv) Ordinary least squares regression, using all indices (including ISE) from
the most recent two days as covariates (16 in total).
Methods (iii)-(iv) are fit using the data of all consecutive two/three day periods
on the training set.
The measures of prediction goodness to obtain (including suitable error
bars or confidence intervals) :
(i) out-of-sample root mean squared error (RMSE)
(ii) out-of-sample mean absolute error (MAE)
(iii) the relative variants of the above, i.e., relative RMSE and relative MAE.
The validation set-up:
Training/test splits are eleven consecutive days where the first ten are used for
training, the eleventh is the test data point. All 526 such periods of consecutive
days are used for testing, yielding 526 predictions for each method.
5
Quantify whether and how the prediction methods di↵er in terms of the error
measures (e.g. by appropriate tests on the empirical distribution of residuals
or absolute residuals). Particularly, quantify, whether the prediction methods
are better than predicting the value of the most recent ISE return.
(e) Manually implement the following variant of linear regression as a custom
prediction method: for training data (x1, y1), . . . , (xN, yN) 2 Rn ⇥ R, the regression function is
f : Rn ! R , x 7! ! b>x
where ! b 2 Rn is estimated from the data as the minimizer of the sum of
absolute residuals R(!) = PN i=1 # #!>xi % yi# #. That is, ! b = argmin! R(!).
Include this type of linear regression as an additional prediction strategy in
your benchmarking experiments in (c) and (d).
Important: Do not use pre-existing packages or methods to find ! b, except
built-in tools for minimizing a (mathematical) function.
(f) Write a brief report on your prediction experiment. In the report, also interpret
your results. Discuss the the di↵erent prediction methods in the context of
your explorative findings of your data set. Also discuss to which extent the
di↵erent benchmarking scenarios quantify goodness of "prediction". At the
top of your report, include a summary of no more than 100 words.
[3+5+7+7+8+5]
6
Task 2: The Resistance of Constantin
Page 686 of the 8th edition of the "Rubber Bible" (the CRC Handbook of Chemistry and Physics) contains the following table of resistance in Ohms (denoted R)
at 20 degrees Celsius temperature, of one centimeter of Constantin wire, for varying
diameters in cm (denoted d).
R/Ohm 0.00093 0.00148 0.0024 0.0037 0.0059 0.0095
d/cm 0.2588 0.2053 0.1628 0.1291 0.1024 0.08118
R/Ohm 0.0150 0.024 0.038 0.048 0.061 0.096
d/cm 0.06438 0.05106 0.04049 0.03606 0.03211 0.02546
R/Ohm 0.153 0.24 0.39 0.98
d/cm 0.02019 0.01601 0.01270 0.00799
This task is about explaining resistance R in terms of diameter d, via a potentially
non-linear model of the type R ⇡ f(d) and/or variable transformations of R and d.
(a) Perform a brief explorative analysis to find potential candidates for the relation
and variable transformations of R and d.
(b) Investigate the goodness of several regression models to explain the data. Include your candidate models, and at least the following two models:
(i) resistance being a polynomial of degree 15 in diameter
(ii) resistance being a polynomial of degree 2 in the reciprocal of the diameter.
Obtain estimates for out-of-sample RMSE and MAE by leave-one-out crossvalidation. Compare the models by investigating the distribution of residuals.
(c) Write a brief report on your analysis. In the report, discuss the di↵erent
models, which ones are preferrable and why, and whether you can decide from
the data which model is correct. At the top of your report, include a summary
of no more than 100 words.