Assignment title: Information
Homework Assignment # 2
Due: Wednesday, October 12, 2016, 11:59 p.m.
Total marks: 100
Question 1. [25 marks]
Let X1; : : : ; Xn be i.i.d. Gaussian random variables, each having an unknown mean θ and
known variance σ2
0.
(a) [5 marks] Assume θ is itself selected from a normal distribution N(µ; σ2) having a known
mean µ and a known variance σ2. What is the maximum a posteriori (MAP) estimate of θ?
(b) [10 marks] Assume θ is itself selected from a Laplace distribution L(µ; b) having a known
mean (location) µ and a known scale (diversity) b. Recall that the pdf for a Laplace distribution is
p(x) = 1
2b exp −jxb− µj
For simplicity, assume µ = 0. What is the maximum a posteriori estimate of θ? If you cannot find
a closed form solution, explain how you would use an iterative approach to obtain the solution.
(c) [10 marks] Now assume that we have multivariate i.i.d. Gaussian random variables,
X1; : : : ; Xn with each Xi ∼ N(θ; Σ0) for some unknown mean θ 2 Rd and known Σ0 = I 2 Rd×d,
where I is the identity matrix. Assume θ 2 Rd is selected from a zero-mean multivariate Gaussian
N(µ = 0; Σ = σ2I) and a known variance parameter σ2 on the diagonal. What is the MAP
estimate of θ?
Question 2. [75 marks]
In this question, you will implement variants of linear regression. We will be examining some
of the practical aspects of implementing regression, including for a large number of features and
samples. An initial script in python has been given to you, called script_regression.py, and
associated python files. You will be running on a UCI dataset for CT slices1, with 385 features and
53,500 samples. Baseline algorithms, including mean and random predictions, are used to serve as
sanity checks. We should be able to outperform random predictions, and the mean value of the
target in the training set.
(a) [5 marks] The main linear regression class is FSLinearRegression. The FS stands for
FeatureSelect. The provided implementation has subselected features and then simply explicitly
solved for w = (X>X)−1X>y. Increase the number of selected features (including all the way to
including all the features). What do you find? How can this be remedied?
(b) [5 marks] The current code averages the error over multiple training, test sets, subsampled
from the data. Modify the code to additionally report both the standard error over these multiple
runs.
(c) [5 marks] Now implement Ridge Regression, where a ridge regularizer λkwk2 2 is added to the
optimization. Run this algorithm on all the features. How does the result differ from (a)? Discuss
the result in a couple of sentences, for one regularization parameters, λ = 0:01.
1https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis
1/3Fall 2016 CSCI-B555: Machine Learning
(d) [15 marks] Imagine that the dataset size continues to grow, which causes the matrix X 2 Rn×d
for n samples and d features to become quite large. One option is to go back to subselecting features.
In this question, you will implement a simple greedy algorithm for selecting a subset of features,
often called forward greedy selection or matching pursuit (see for example the pseudocode in\On the
Consistency of Feature Selection using Greedy Least Squares Regression", Zhang, 2009). The idea
is to greedily add one new feature on each iteration, based on its correlation with the residual: the
feature with the maximum absolute dot product with the residual. On each step, for current subset
s of features and residual R = X(:; s)w − y, one adds the feature i with maximal jX(:; i)>Rj. The
coefficients are then recomputed for the current subset of features, and another feature considered.
This iteration ends once the residual error is below some threshold, or if maxi jX(:; i)>Rj is below
some threshold. Implement MPLinearRegression and report error.
(e) [15 marks] Now imagine that you want to try a different feature selection method and you've
heard all about this amazing and mysterious Lasso. Lasso can often be described as an algorithm,
or otherwise as an objective with a least-squares loss and '1 regularizer. It is more suitably thought
of as the objective, rather than an algorithm, as there are many algorithms to solve the Lasso.
Implement an iterative solution approach that uses the soft thresholding operator (also called the
shrinkage operator). Discuss the impact of the choice of regularization parameter on the error of
LassoRegression.
(f) [15 marks] Now imagine that your dataset has gotten even larger, to the point where dropping
features is not enough. Instead of removing features, implement a stochastic optimization approach
to obtaining the linear regression solution (see Section 4.5.3). Explain your implementation choices.
Report the error.
(g) [15 marks] Implement batch gradient descent for linear regression. Compare stochastic
gradient descent to batch gradient descent, in terms of the number of times the entire training set
is processed. Set the step-size to 0.01 for stochastic gradient descent, and implement a reasonable
approach to select the step-size for batch gradient descent. Report the error versus epochs, where
one epoch involves processing the training set once. Report the error versus runtime.
Homework policies:
Your assignment will be submitted as a single pdf document and a zip file with code, on
canvas. The questions must be typed; for example, in Latex, Microsoft Word, Lyx, etc. or must
be written legibly and scanned. Images may be scanned and inserted into the document if it is too
complicated to draw them properly. All code (if applicable) should be turned in when you submit
your assignment. Use Matlab, Python, R, Java or C.
Policy for late submission assignments: Unless there are legitimate circumstances, late assignments will be accepted up to 5 days after the due date and graded using the following rule:
on time: your score 1
1 day late: your score 0.9
2 days late: your score 0.7
3 days late: your score 0.5
4 days late: your score 0.3
5 days late: your score 0.1
2/3Fall 2016 CSCI-B555: Machine Learning
For example, this means that if you submit 3 days late and get 80 points for your answers, your
total number of points will be 80 × 0:5 = 40 points.
All assignments are individual, except when collaboration is explicitly allowed. All the sources
used for problem solution must be acknowledged, e.g. web sites, books, research papers, personal
communication with people, etc. Academic honesty is taken seriously; for detailed information see
Indiana University Code of Student Rights, Responsibilities, and Conduct.
Good luck!
3/3