Assignment title: Information

Homework Assignment # 2 Due: Wednesday, October 12, 2016, 11:59 p.m. Total marks: 100 Question 1. [25 marks] Let X1; : : : ; Xn be i.i.d. Gaussian random variables, each having an unknown mean θ and known variance σ2 0. (a) [5 marks] Assume θ is itself selected from a normal distribution N(µ; σ2) having a known mean µ and a known variance σ2. What is the maximum a posteriori (MAP) estimate of θ? (b) [10 marks] Assume θ is itself selected from a Laplace distribution L(µ; b) having a known mean (location) µ and a known scale (diversity) b. Recall that the pdf for a Laplace distribution is p(x) = 1 2b exp −jxb− µj For simplicity, assume µ = 0. What is the maximum a posteriori estimate of θ? If you cannot find a closed form solution, explain how you would use an iterative approach to obtain the solution. (c) [10 marks] Now assume that we have multivariate i.i.d. Gaussian random variables, X1; : : : ; Xn with each Xi ∼ N(θ; Σ0) for some unknown mean θ 2 Rd and known Σ0 = I 2 Rd×d, where I is the identity matrix. Assume θ 2 Rd is selected from a zero-mean multivariate Gaussian N(µ = 0; Σ = σ2I) and a known variance parameter σ2 on the diagonal. What is the MAP estimate of θ? Question 2. [75 marks] In this question, you will implement variants of linear regression. We will be examining some of the practical aspects of implementing regression, including for a large number of features and samples. An initial script in python has been given to you, called script_regression.py, and associated python files. You will be running on a UCI dataset for CT slices1, with 385 features and 53,500 samples. Baseline algorithms, including mean and random predictions, are used to serve as sanity checks. We should be able to outperform random predictions, and the mean value of the target in the training set. (a) [5 marks] The main linear regression class is FSLinearRegression. The FS stands for FeatureSelect. The provided implementation has subselected features and then simply explicitly solved for w = (X>X)−1X>y. Increase the number of selected features (including all the way to including all the features). What do you find? How can this be remedied? (b) [5 marks] The current code averages the error over multiple training, test sets, subsampled from the data. Modify the code to additionally report both the standard error over these multiple runs. (c) [5 marks] Now implement Ridge Regression, where a ridge regularizer λkwk2 2 is added to the optimization. Run this algorithm on all the features. How does the result differ from (a)? Discuss the result in a couple of sentences, for one regularization parameters, λ = 0:01. 1https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis 1/3Fall 2016 CSCI-B555: Machine Learning (d) [15 marks] Imagine that the dataset size continues to grow, which causes the matrix X 2 Rn×d for n samples and d features to become quite large. One option is to go back to subselecting features. In this question, you will implement a simple greedy algorithm for selecting a subset of features, often called forward greedy selection or matching pursuit (see for example the pseudocode in\On the Consistency of Feature Selection using Greedy Least Squares Regression", Zhang, 2009). The idea is to greedily add one new feature on each iteration, based on its correlation with the residual: the feature with the maximum absolute dot product with the residual. On each step, for current subset s of features and residual R = X(:; s)w − y, one adds the feature i with maximal jX(:; i)>Rj. The coefficients are then recomputed for the current subset of features, and another feature considered. This iteration ends once the residual error is below some threshold, or if maxi jX(:; i)>Rj is below some threshold. Implement MPLinearRegression and report error. (e) [15 marks] Now imagine that you want to try a different feature selection method and you've heard all about this amazing and mysterious Lasso. Lasso can often be described as an algorithm, or otherwise as an objective with a least-squares loss and '1 regularizer. It is more suitably thought of as the objective, rather than an algorithm, as there are many algorithms to solve the Lasso. Implement an iterative solution approach that uses the soft thresholding operator (also called the shrinkage operator). Discuss the impact of the choice of regularization parameter on the error of LassoRegression. (f) [15 marks] Now imagine that your dataset has gotten even larger, to the point where dropping features is not enough. Instead of removing features, implement a stochastic optimization approach to obtaining the linear regression solution (see Section 4.5.3). Explain your implementation choices. Report the error. (g) [15 marks] Implement batch gradient descent for linear regression. Compare stochastic gradient descent to batch gradient descent, in terms of the number of times the entire training set is processed. Set the step-size to 0.01 for stochastic gradient descent, and implement a reasonable approach to select the step-size for batch gradient descent. Report the error versus epochs, where one epoch involves processing the training set once. Report the error versus runtime. Homework policies: Your assignment will be submitted as a single pdf document and a zip file with code, on canvas. The questions must be typed; for example, in Latex, Microsoft Word, Lyx, etc. or must be written legibly and scanned. Images may be scanned and inserted into the document if it is too complicated to draw them properly. All code (if applicable) should be turned in when you submit your assignment. Use Matlab, Python, R, Java or C. Policy for late submission assignments: Unless there are legitimate circumstances, late assignments will be accepted up to 5 days after the due date and graded using the following rule: on time: your score 1 1 day late: your score 0.9 2 days late: your score 0.7 3 days late: your score 0.5 4 days late: your score 0.3 5 days late: your score 0.1 2/3Fall 2016 CSCI-B555: Machine Learning For example, this means that if you submit 3 days late and get 80 points for your answers, your total number of points will be 80 × 0:5 = 40 points. All assignments are individual, except when collaboration is explicitly allowed. All the sources used for problem solution must be acknowledged, e.g. web sites, books, research papers, personal communication with people, etc. Academic honesty is taken seriously; for detailed information see Indiana University Code of Student Rights, Responsibilities, and Conduct. Good luck! 3/3