Assignment title: Information


STAT7001: In-Course Assessment 2 Your second in-course assessment is worth 50 marks, representing 50% of the credit

for the course STAT7001, and takes the form of two di↵erent tasks. You should tackle the tasks below and produce project report for each of them, providing a well-written presentation of your findings, including your code as appendix so that your results are reproducible. Illustrate your report by graphs or other output as appropriate. While the exclusive use of R or SAS is compulsory, you are free to use a word

processor of your choice (e.g. Tex, LibreO"ce, OpenO"ce or Microsoft Word). You are free to use any sources (including online ones) of R/SAS commands you may need in addition to the ones taught in the course so far. However, if this amount is substantial - i.e., anything more than just looking up a command and/or using

the R help function - you must cite correctly and give the precise source or sources including the full URL in case of online sources. The length of your reports should not exceed: six pages for task 1, four pages for task 2, excluding appendix. You should submit, per group, one digital copy of the report, together with your R and/or SAS code, via the respective TurnitIn module on the STAT7001 Moodle page.

Only one group member should submit, with the mandatory consent of all group members, expressed through digital signatures of all the group members. Please make sure that the report itself is anonymized, that is, please make sure there is

no information in the report itself that hints at your identities. The deadline for submission is Thursday, 24th March 2016, 6:00 pm (6 pm on the last day of term). On R and SAS You may use R or SAS for task 1. You must use SAS for task 2.

If you use R for task 1: You may choose to use the mlr package for benchmarking in conjunction with own code, or opt to implement the benchmarking experiment

yourself entirely, using custom functions and control statements. If you use SAS for task 1: Any (reasonable) implementation of the prediction

methods ensuring that predictions are made out-of-sample will be accepted. 3

Task 1: Prediction of the Istanbul Stock Market The Istanbul Stock Exchange data set contains daily returns between January 5, 2009 and February 22, 2011 of the following eight national and international stock

market indices: the Istanbul Stock Exchange index (ISE), S&P 500, DAX, FTSE 100, Nikkei 225, Ibovespa, the MSCI European Union Index, and the MSCI Emerging Markets Index.

Obtain the data set from the moodle page (this is a version di↵erent from the usual benchmark data set which has a "duplicate" column removed for sake of simplicity).

If you are unaware of what a stock market index or a return based on an index is, please inform yourself in order to understand what the data describes.

Your task will be to evaluate and compare di↵erent prediction strategies for predicting daily returns of the ISE index. (a) Read the Istanbul Stock Exchange data set into R or SAS. Obtain a single data frame/data set ISE data, where rows correspond to di↵erent days; columns

are the di↵erent stock indices, and a variable time, in days since the earliest record (in total 9 columns/variables). Hint: in order to convert the day column to the desired format in R, consult

the R help on POSIXct and as.POSIXct. For the conversion in SAS, consult the lecture 6 slides. (b) Perform an exploratory data analysis on the frame ISE data. Especially investigate whether the stock indices associate with the variable time. Further investigate associations between the stock indices and the same stock index one day, two days, etc, earlier.

(c) Perform a benchmarking experiment to assess how well the Istanbul Stock Exchange index can be predicted from other indices on the same day, with prediction methods, error measures, and validation set-ups as follows:

The prediction methods to compare: (i) Using as a prediction the average ISE return on the training set (as an uninformed prediction strategy). (ii) Ordinary least squares regression, using all stock market indices (except

ISE), but not time. (iii) Ordinary least squares regression, using all stock market indices (except ISE), as well as time.

The measures of prediction goodness to obtain (including suitable error bars or confidence intervals) : (i) out-of-sample root mean squared error (RMSE) 4

(ii) out-of-sample mean absolute error (MAE) (iii) the relative variants of the above, i.e., relative RMSE and relative MAE.

The validation set-ups: (i) Using the chronologically first 80 percent of the data as training sample,

and the last 20 percent as test sample. (ii) Five-fold cross-validation where the folds are uniformly randomly sampled among the days.

Quantify whether and how the prediction methods di↵er in terms of the error measures (e.g. by appropriate tests on the empirical distribution of residuals

or absolute residuals). Particularly, quantify, whether the prediction methods are better than predicting the average index on the training set. (d) The validation set-up in part (c) does not mimick prediction in the sense of

forecasting where the stock market index is predicted from data from the previous time points only. Perform a benchmark experiment comparing prediction strategies which actually predicts future ISE returns from (chronologically) past data, with the following prediction methods, error measures, and validation set-up: The prediction methods to compare:

(i) Using as a prediction the value of the most recent ISE return.

(ii) Using as a prediction the average of the ISE returns on the most recent five days. (iii) Ordinary least squares regression, using all indices (including ISE) from

the most recent day as covariates (8 in total).

(iv) Ordinary least squares regression, using all indices (including ISE) from the most recent two days as covariates (16 in total). Methods (iii)-(iv) are fit using the data of all consecutive two/three day periods

on the training set. The measures of prediction goodness to obtain (including suitable error bars or confidence intervals) :

(i) out-of-sample root mean squared error (RMSE) (ii) out-of-sample mean absolute error (MAE)

(iii) the relative variants of the above, i.e., relative RMSE and relative MAE. The validation set-up: Training/test splits are eleven consecutive days where the first ten are used for training, the eleventh is the test data point. All 526 such periods of consecutive

days are used for testing, yielding 526 predictions for each method. 5

Quantify whether and how the prediction methods di↵er in terms of the error measures (e.g. by appropriate tests on the empirical distribution of residuals

or absolute residuals). Particularly, quantify, whether the prediction methods

are better than predicting the value of the most recent ISE return. (e) Manually implement the following variant of linear regression as a custom prediction method: for training data (x1, y1), . . . , (xN, yN) 2 Rn ⇥ R, the regression function is f : Rn ! R , x 7! ! b>x

where ! b 2 Rn is estimated from the data as the minimizer of the sum of

absolute residuals R(!) = PN i=1 # #!>xi % yi# #. That is, ! b = argmin! R(!). Include this type of linear regression as an additional prediction strategy in your benchmarking experiments in (c) and (d). Important: Do not use pre-existing packages or methods to find ! b, except

built-in tools for minimizing a (mathematical) function. (f) Write a brief report on your prediction experiment. In the report, also interpret

your results. Discuss the the di↵erent prediction methods in the context of your explorative findings of your data set. Also discuss to which extent the di↵erent benchmarking scenarios quantify goodness of "prediction". At the

top of your report, include a summary of no more than 100 words. [3+5+7+7+8+5]

6 Task 2: The Resistance of Constantin Page 686 of the 8th edition of the "Rubber Bible" (the CRC Handbook of Chemistry and Physics) contains the following table of resistance in Ohms (denoted R) at 20 degrees Celsius temperature, of one centimeter of Constantin wire, for varying diameters in cm (denoted d). R/Ohm 0.00093 0.00148 0.0024 0.0037 0.0059 0.0095 d/cm 0.2588 0.2053 0.1628 0.1291 0.1024 0.08118

R/Ohm 0.0150 0.024 0.038 0.048 0.061 0.096 d/cm 0.06438 0.05106 0.04049 0.03606 0.03211 0.02546 R/Ohm 0.153 0.24 0.39 0.98

d/cm 0.02019 0.01601 0.01270 0.00799

This task is about explaining resistance R in terms of diameter d, via a potentially non-linear model of the type R ⇡ f(d) and/or variable transformations of R and d.

(a) Perform a brief explorative analysis to find potential candidates for the relation and variable transformations of R and d.

(b) Investigate the goodness of several regression models to explain the data. Include your candidate models, and at least the following two models: (i) resistance being a polynomial of degree 15 in diameter

(ii) resistance being a polynomial of degree 2 in the reciprocal of the diameter. Obtain estimates for out-of-sample RMSE and MAE by leave-one-out crossvalidation. Compare the models by investigating the distribution of residuals. (c) Write a brief report on your analysis. In the report, discuss the di↵erent

models, which ones are preferrable and why, and whether you can decide from the data which model is correct. At the top of your report, include a summary of no more than 100 words.