Assignment title: Information


Assignment 3 CIS602 Data Mining Spring 2017 Q1. (40 pts) In this question, we will use SVM for a few experiments. There are many good implementations of SVM, e.g., MATLAB built-in functions, and LIBSVM. You can use either of them, or other implementations found by yourself to solve the following questions. a. (20 pts) USPS is a hand-written digit database including ten digits, i.e., 0, 1, 2, …, 9, each of which has more than 1000 samples, and in total there are 11000 samples. The resolution of each image is 16×16, and thus the length of each vector is 256. In this question, you will use SVM for the digit classification. For each digit, you will use the first 10, 50, or 100 images as the training data, and the rest as test data to finish this multi-class classification problem. In each setting, you will be required to use: (1) linear kernel, (2) polynomial kernel, (3) Gaussian kernel. As there are a few model parameters for each kernel, e.g., band-width � in Gaussian kernel, and the common slack variable �, you will need to use cross-validation on the training set to select the best model parameters. You can use "grid search" to find the best values for them. So, in total there are 9 different results (recognition accuracy) from 3 different training setting and 3 different kernels. b. (20 pts) You will use PIE database for the same tasks introduced in Q1(a). As we have fewer data samples per class in PIE databases, we will use the first 5, 10, or 15 images as the training samples in each setting. Also, you are expected to show 9 different results (recognition accuracy). [Submissions] Source codes and experimental results report (in PDF) [Hints] You may want to apply PCA, centralization, or normalization before any learning algorithms. To better understand the kernel SVM and learn how to choose kernel parameters, e.g., Gaussian kernel, you may want to use the following sample code for MATLAB built-in function to illustrate the decision boundary. Please type in the following command in MATLAB: openExample('stats/NonlinearClassifierWithGaussianKernelExample') [Provided files list] USPS.mat, PIE.matAssignment 3 CIS602 Data Mining Spring 2017 Q2. (60 pts) We've learned Kmeans and spectral clustering in the course and they both are state-of-theart methods for data clustering. In this question, you will implement the spectral clustering by yourself and compare the clustering performance of Kmeans and your implemented spectral clustering. For Kmeans, you could use either the MATLAB build-in function or a fast implementation here. The basic steps of spectral clustering include: a. Build the affinity graph of given dataset; b. Compute the graph Laplacian; c. Compute the eigenvectors of the normalized graph Laplacian; d. Use each row of the first several eigenvectors as the new features; e. Normalize each row to have unit length; f. Run Kmeans on the row space, i.e., treating each row as a sample in the new feature space. You will find more details from this paper, and our slides. In this question, you will run Kmeans and spectral clustering on both USPS and PIE databases. To evaluate your results, you will report clustering accuracy, NMI and running time. Clustering accuracy and NMI are two important clustering performance indicators (their codes are already provided in this question). I also upload a sample code (SampleClustering.m) for Kmeans and its evaluations on USPS dataset for your reference. So in total there are 12 experiment results from 2 different databases, 3 indicators, and 2 clustering methods. [Submissions] Source codes and experimental results report (in PDF) [Hints] You may want to apply PCA, centralization, or normalization before running any algorithms. In addition, you will program to find the best model parameters for spectral clustering, e.g., the band-width for Gaussian similarity, the number of neighbors when building the affinity graph. [Provided files list]: litekmeans.mat, USPS.mat, PIE.mat, SampleClustering.m, hungarian.m, accuracy.m, nmi.m