Assignment title: Information

Midterm CS536, Machine Learning, Spring 2017 March 4, 2017 This test is a take-home test. You have until 11:59pm on March 10 to answer the questions and submit your test. The test should be submitted electronically through Sakai. All work should be strictly individual. Be very clear and precise in your answers. This does not mean being verbose, use the language of mathematics! For questions that ask for evaluation on a specific problem include the code (MATLAB or Python) with clear instructions on how to run the code and reproduce your results. Name: ID: Problem Score Max. score 1 80 2 100 1 30 2 20 Max. Total 230 (230 perfect grade) 1Problem 1 In many machine learning problems (e.g., Bayesian learning for linear regression) one seeks to estimate the evidence of a model family Pr(Djθ) = f(θ) for some set of model family (hyper)parameters1. For instance, in the case of the linear regression and Bayesian learning those were the parameters α and β. We then seek to maximize the evidence by varying those parameters and find the model that has good generalization performance. However, it is sometimes the case that the evidence function f we seek to optimize cannot be computed analytically. More critically, it is often challenging to optimize this function. Instead, we approach the problem by taking samples fθkgK k=1 of the parameters of this function and evaluating its value on those samples ff(θk)gK k=1, selecting the sample that results in the highest probability of evidence θ∗ = argmaxk f(θk). 1. Suppose you approach the above parameter optimization problem using a regular grid of points GR = fθkgK k=1, where θ 2 . Assume that you can evaluate the function f(θ) for any argument θ in some fixed time τ, which could be large. What are the potential issues in the strategy outlined in the paragraph above under these settings? 2. Now consider a different evaluation strategy. Instead of first constructing a regular grid of points GR you will be given two initial values of θ, call them G2 = fθkg2 k=1. Your goal is to select the next "best" point θK+1, where this function is to be evaluated. The "tool" you have at your disposal is the Bayesian linear regression with radial basis functions placed at the points in G2. In other words, you can model f from the samples G2 and the corresponding F2 = ff(θk)g. Propose a strategy to select the next best point θ3 where to evaluate f. Justify clearly the mathematical model underpinning this strategy and the set of steps (i.e., algorithm) that you will use to select that next point. Be precise. Generalize this to selecting θK+1 given previous evaluations GK FK. 3. Demonstrate this strategy on the problem of finding the maximum of the function f(x) = e−x2 − e−(x−5)2 on the interval I = [−3;8]. Choose two starting points, x1 = −0:8 and x2 = 5:7. Show the sequence of ten subsequent choices of xk and in each iteration plot the estimated mean value of your function f and the uncertainty, over the given evaluation interval I, and the location of the next chosen evaluation point xk+1. Problem 2 In class we discussed several approaches to modeling data using Generalized Linear Models (GLMs). In the regression case, the setting was that one seeks to find the LR model f(x), defined by basis functions φ, from the dataset D = f(x; t)gN i=1, where t are the noisy samples of that unknown function t = f(x). 1. Suppose that your LR model is y(xjw) = wT φ for x 2 D and Φ 2 M. What is the distribution of the function sample f(xjw) = y(x) + , ∼ N(0; α) for a given x? 2. Let x 2 . What is the distribution of the derivatives of the function f defined above, i.e., f 0(xjw) = df(xjw)=dx? Consider three cases of bases: (1) M-th order polynomial bases, (2) M-th order radial basis function bases centered at µ of bandwidth 1, and (3) probit bases, φ(xjµ) = Φ(x − µ), where Φ(x) = R−1 x N(uj0;1)du. 3. What is the joint distribution of the set of function samples ftigN i=1 given the set of corresponding inputs fxigN i=1, i.e., Pr(ftigN i=1jfxigN i=1; w)? 4. Suppose that your LR model has a Bayesian prior on weights w, w ∼ N(0; βI). What is the joint distribution of the set of function samples ftigN i=1 given the set of the corresponding inputs fxigN i=1 f(x) = y(x) + , ∼ N(0; α), over all possible linear models (i.e., once w is integrated out), Pr(ftigN i=1jfxigN i=1)? 5. What is the distribution of the derivatives of the (integrated out) function f defined above, i.e., f 0(x) = df(x)=dx? Consider three cases of bases mentioned previously. 6. Now focus on the LR with M-th order polynomial bases. Suppose you were not only given the set of points D = f(x; t)gN i=1 but also the set of noisy derivative estimates of f, Dd = f(x; t0)gN i=1 d . What is the best Maximum Likelihood estimate of w, wML, given both sets of samples D and Dd? 1I will use the term "parameters" to refer to hyperparameters. The meaning should generally be clear from the context. 27. Using the wML find the best predicted value of the derivative of f and some arbitrary input point x. 8. What is the Bayesian posterior of w given D and Dd? 9. What is the optimal Bayesian prediction of the derivative f0(x) at some arbitrary input point x, given D and Dd? 10. Let f(x) = sinc(πx) = sin( πx πx). Generate a set of 11 data points D of equally spaced samples x 2 [0;3] and the corresponding values t = f(x) + , ∼ N(0; α = 0:1). Then generate the samples of 5 derivative data points Dd, again based on equally spaced samples x 2 [0;3]. Let β = 1. Show the plots of the function and the derivative predictions, along with the predicted uncertainties (±1 STD) for 100 equally spaced points x 2 [0;3] given the ML and the Bayesian prediction models for two cases: (1) When only the dataset of function values D is used and (2) When both the dataset of function values and the derivatives, D and Dd, are used for estimation. Problem 3 Consider a two-category one-dimensional problem with class-conditional densities given by p(xjw1) = 2 0x; ; 0 otherwise ≤ x ≤ 1 (1) p(xjw2) = 2 0;− 2x; 0 otherwise ≤ x ≤ 1 (2) Suppose your training set consists of two points D = f(x1;+1);(x2; −1)g where +1 and -1 indicate class memberships in classes 1 and 2, respectively. Let the following decision rule be given: y = f(x) = +1 −1; ; 9 otherwise (xi;+1) 2 D; 8(xj; −1) 2 D : (x − xi)2 < (x − xj)2 (3) 1. What is the expected error rate of this classifier (decision rule)? 2. What is the optimal Bayes error rate? Assume P(!1) = P(!2). 3. Compare the two error rates. Can you intuitively explain their relationship? Problem 4 Consider a binary classification problem in which each observation xi is known to belong to one of two classes, corresponding to t = 0 and t = 1, and suppose that the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data point xi, instead of having a value t for the class label, we have instead a value πi representing the probability that ti = 1. Given a probabilistic model p(t = 1jφ), write down the log likelihood function appropriate to such a data set. 3