TABLE OF CONTENT ABSTRACT 2 1. INTRODUCTION 2 2. METHODS 3 2.1. DATASET 3 2.2. PREPOSSESSING 3 2.3. CLASSIFIERS 3 2.3.1 MULTINOMIAL LOGISTIC REGRESSION (MLR) 3 2.3.2 SUPPORT VECTOR MACHINE (SVM) 4 2.3.3 NAÏVE BAYES GAUSSIAN (GNB) 5 2.3.4 K-‐NEAREST NEIGHBORS (KNN) 5 3. EXPERIMENTS AND RESULTS 5 3.1. HARDWARE AND SOFTWARE SPECIFICATIONS 5 3.2. CONSTRAINTS 5 3.3. ACCURACY 5 3.4. ANALYSIS 7 4. DISCUSSION 8 5. CONCLUSION AND FUTURE WORK 8 APPENDIX 9 1 Abstract Classification is a part of supervised machine learning technique and is considered as a vital method to identify if there is any structure in the data that will be useful in predicting future class labels accurately. This study aims to compare the performances of several classifiers through accuracy, precision, recall and F-measure scores. The performances of the self-implemented Multinomial Logistic Regression classifier were compared against Support Vector Machine, Gaussian Naïve Bayes, K-Nearest Neighbors classifiers. The given training dataset was used to train and build the model of each classifier. 1. Introduction Many types of classification methods are widely used such as Logistic Regression, Support Vector Machine, Naïve Bayes, K-Nearest Neighbors and more. The choice of the classification methods however depends on the several factors including the type of data, size of the records, features of the data and many more. The aim of using the classifier is to build a model based on the experience it has gained through the training data set and then predict future class labels for the test dataset. The basic workflow of classification is to use the labeled data as training input, trains a model to generate the prediction based on the known labels and finally, use the model to make prediction on the new data. Labeled training data Learn Model Model (classifiers) Apply model Predicted label The purpose of this study is to compare the performances between self-implemented Multinomial Logistic Regression (MLR) classifier against several classifiers which were implemented using scikit-learn external library. The classifiers include Support Vector Machine (SVM), Gaussian Naïve Bayes (GNB) as well as K-Nearest Neighbors (KNN). The performances of each classifier were evaluated and compared using various classification metrics specifically the accuracy, precision, recall and f-measure scores. Those classification metrics were obtained through implementing the confusion matrix and ROC (Receiver Operating Characteristic). Multinomial Logistic Regression classifier is preferred as the benchmark due to several factors. Multinomial Logistic Regression is a Logistic Regression with multi-class problems where the dependent variable has more than two categories and it is not in an ordered form. The training set contains 30 unique labels, thus MLR is the preferable candidate to be used in predicting multiclass labels for the future inputs. MLR classifier is also known as probabilistic model where it can predict the class based on the highest probability and allow some level of uncertainty to be considered. MLR allows the use of some tunable parameters including the learning rate and the gradient descent to minimize the objective function (logistic loss) and produce an accurate logistic curve for the data. This study is important in order to understand the underlying concepts and behaviors behind each classifier and how the prediction accuracy can be improved by changing certain parameters. 2 2. Methods Four main classifiers were used in this study in order to build the model for the given dataset, calculate the accuracy score and find the predicted class labels for the test data. The classifiers include MLR as the benchmark, SVM, GNB and KNN. 2.1. Dataset The dataset contains the data collected from the Apps Market and is divided into four files. The training_data.csv file contains 20,104 rows and 13,627 columns with the app’s name in the first column. The rest of the columns contains the tf-idf values with values either zero or non-zero. The matrix dimension for this file is 20,104 X 13,627 The second file is training_desc.csv which contains the app’s name and the description. Total number of rows is 20,104. Training_labels.csv file contains 20,104 rows with two columns of app’s name and the label. Consider this as 20,104 X 2 matrix. The last file testing_data.csv will be use for testing purposes to find the predicted labels for each of the app’s name There are 30 unique labels for the given dataset, for example Medical, Sports, Shopping and so on. 2.2. Prepossessing A certain levels of preprocessing were done in order to train the model using the selected classifier. • Loaded the training files using Pandas DataFrame • Joined the two training datasets, training_data and training_labels based on the app’s name. • Defined feature vector as X and response vector (the labels) as Y and store the data into different array each. • Defined the integer values for the labels since the original labels used string data type. • Split the merged training data into training set and validation set by using 5-folds cross validation (Train set: X_train and y_train, Validation set: X_validation and y_validation). However, this method runs 5 times slower than the conventional test/split method. 2.3. Classifiers 2.3.1 Multinomial Logistic Regression (MLR) Multinomial Logistic Regression is a Logistic Regression with multi-class problems where the dependent variable has more than two categories. This classifier can be used when the dependent variable has multiple categories and those categories are not in an ordered form. The objective of the classifier is to predict the conditional probability ( = , ) by applying non-linear transformation to the data in order to find out the linear decision boundary. Penalizing the size of each weight vector for the classes is done by adding the regularization penalty of the form ( T) where is a matrix and is the trace function. The Logistic Loss function (cross-entropy) is defined with below formula which is used to measure the fit of the model: 3 41 – /23 023 /0 ln ( /0) Where is the number of training example, is the number of classes, /0 = 1 if is class and zero. If is not in class or zero, it is known as a one-hot representation, /0 is the prediction of the model with probability that is class . Dimension of the weight matrix is ( , ) where is the dimension of the input after being mapped into a new space. Thus = , = ϕ( ) where ϕ is the basis function. This classifier is self-implemented. In order to build the model and test the accuracy of the model, 10,000 records were chosen to be trained. 5 folds split were applied and iterated over 1000 times for each fold. In each fold, the training dataset size of 8000 and 2000 for validation dataset were obtained. Apart from the prepossessing listed above, another prepossessing is required by applying One Hot Encoding to the target column to allow for One vs Rest (OVR) classification. In other words, to find the probability distribution over all possible outcomes. The weight was initialized to 0 in the beginning. Once all the parameters were initialized, the model fitting started. Logistic Loss function was selected as the objective function which will help in measuring the fit of the model. The weight was adjusted in each iteration by applying Gradient Descent method. Regularization is required since the data contains multi-class category to reduce overfitting problems. MLR will be the benchmark classifier in this study. The accuracy, recall, precision and f-measure scores were obtained by applying the confusion matrix. These measures were then compared against the measures of the other classifiers such as SVM, GNB and KNN. 2.3.2 Support Vector Machine (SVM) The objective of SVM is to find the hyper-plane that gives maximum margin between two classes of data and then predict in which class the new data points will land on. The bigger the margin or the distance is better because the chances of the new data point will be on the correct side of the line is high. In this study, the SVM model was trained using scikit-learn external library sklearn.svm with a polynomial kernel. The kernel allows the data to be mapped into a higher dimension. By using the given training dataset with input pair of n, n where n = 1,2,…, N, the function ∅ maps the vector n into a higher dimensional space which allow the SVM to find the hyper-plane with optimal margin that separate the classes. This will resolve the optimization issue of the following formula: 3 1 min 2 T + ∁ ℇn D,E,F /23 with subject to n T n + ≥ 1 − ℰn and ℇn ≥ 0 Polynomial kernel is used in order to map the data into the higher dimension. The formula is , = (Υ nT + )d where Υ > 0 and Υ, and are the parameters of the kernel. As the data were divided into 5-folds cross-validation, the average accuracy score was then computed on the validation data of each folds. Confusion matrix was constructed for this classifier which was then used to calculate the metrics to get the accuracy, recall, precision and F-measure scores. The scores obtained from this model is used to compare the performances with the benchmark classifier, MLR. 4 2.3.3 Naïve Bayes Gaussian (GNB) Gaussian Naïve Bayes extends the features of Naïve Bayes classifier. It is good in dealing with continuous attributes where all the values of each class is considered to be distributed using Gaussian distribution. The assumption for this classifier is that is conditionally independent given with probability distribution . The likelihood function can be computed using below formula where mean and variance for each feature in each class needs to be estimated: X = 1 exp − ( X − [)2 \ \ [ 2 [ 2 The implementation of this classifier was done using scikit-learn external library sklearn.naive_bayes.GaussianNB. The accuracy, recall, precision and F-measure scores obtained through confusion matrix implementation were used to compare the performances with MLR classifier. 2.3.4 K-‐Nearest Neighbors (KNN) KNN classifier is used for classification and is a simple algorithm which stores the labeled training data. It does not build any classification model. During the training process, this classifier will only store the training data. It classifies the labels for the new data based on a distance function. It computes the distance between two data points based on the closest labeled training data point (the nearest neighbors). By looking at the class of its’ nearest neighbors, the algorithm is able to learn and get an idea of how it should classify the new data. The distance is calculated by below equation: Distance: a = 0X23 | X - X| where x = y → D=0 x ≠ y → D=1 scikit-learn external library sklearn.neighbors.KNeighborsClassifier was used to implement this algorithm. Training data was divided into 5 parts and 1000 rows were selected from training dataset due to the low runtime performances of this classifier. The number of instances of selected for the training process is 5. Finally, to find how accurate the classifier is, the accuracy score is calculated on the validation data against the predicted labels obtained during the training process. The performances of KNN were then compared against the benchmark classifier. 3. Experiments and Results 3.1. Hardware and Software Specifications Hardware: Desktop PC, 130GB RAM Software: Jupyter Notebook (Python v3) 3.2. Constraints Due to memory and CPU limitations, the training of model was done for 10,000 records for MLR and GNB classifiers and 1000 records for SVM and KNN 3.3. Accuracy The training data is split using 5-fold cross validation technique. The overall accuracy is obtained by calculating the average accuracy for each folds. This technique systematically 5 create k train/test set and average the results as it will give better estimate of the out-of-sample performance. However, this technique run k times slower than conventional test/split method. Confusion Matrix describes the performances of the classification model and it is used to find how often the classifier is correct and how often it is incorrect (misclassification rate) in predicting the class label. The performance metrics used for this study are heavily depending on the metrics obtained by using confusion matrix. Since the dataset contains multiclass labels, and split using k-fold cross validation technique, the scores calculated using confusion matrix need to be sum up for each fold to get the total scores. There are several basic terminologies that need to be understood in order to calculate the scores: True Positive (TP): correctly predicted positive True Negative (TN): correctly predicted negative False Positive (FN): falsely predicted positive False Negative (FN): falsely predicted negative The confusion matrix table below shows the detailed breakdown of correct and incorrect classification for each label obtained for MLR classifier. Below formula is used to calculate the accuracy, recall, precision and F-measure scores. + = + + + = + = + = 2 = 2 + 2 + + 6 = + = + The runtime for each classifier was monitored to give an idea how the models perform during the training process. The table below shows the scores of various classification methods in this study with MLR as the benchmark classifier. No. of Accuracy Recall Precision F-Measure Runtime records MLR 10,000 0.97 0.58 0.58 0.58 4m 48s SVM 1000 0.95 0.27 0.27 0.27 20s GNB 10,000 0.96 0.42 0.42 0.42 2m 2s KNN 1000 0.94 0.20 0.20 0.20 25s 3.4. Analysis MLR shows better performance score compared to other classifier with the score of accuracy of 97%, recall score of 58%, precision score of 58% and F-measure score of 58%. The performances are highly dependent on several factors. Number of folds or split chosen for this study is 5 where the training data is split into 5 parts. Each fold contains two set of data, training and validation set. Based on the observation, if lower number of fold is selected, the accuracy score increases. Another factor that effect the performances of MLR is the tunable parameters chosen during the training process. Number of iteration determines how many times the model should iterate on each fold. Number of momentum of lambda also effects the scores. Some optimizations were done on the model including regularization to avoid overfitting problem. Number of learning rate is also important. If higher number of learning rate is specified, the model will converge quickly and will not be able to find the optimal solution. If lower number of learning rate is chosen, it takes longer time to converge. Another way to optimize the model is by adjusting the weight. This is done through gradient descent method. The objective of training the model is to minimize the loss function so that the prediction will be more accurate. Due to runtime performance, memory, and CPU limitation the training process was done on only 1000 records for SVM and KNN. Even by using external library like scikit-learn, the runtime performance was very poor due to the large number of data. Thus, we have decided to use smaller dataset of 1000 records. GNB shows better performance with accuracy score of 96%, recall score 42%, precision score 42% and F-measure score of 42%. GNB is a simple classifier which predict the class based on probability. Thus, it runs faster and return higher performance scores compared to KNN and SVM. KNN became in third place after MLR, and GNB. This classifier runs slower for the large dataset. The accuracy score is around 94% while recall score is around 20%. The precision is quite high/low with 20%. F-measure score is around 20%. The number of instances of selected for the training process is 5. There is no specific rule in selecting value and it totally depends on the training data. However, various performances were seen if high or low 7 value of is used. The drawback is complexity in searching nearest neighbors for each data point for a large amount of data and it requires greater storage, CPU and memory requirements of the hardware SVM performances are the lowest compared to other classifiers with the accuracy score of 95%, recall score 27%, precision score 27% and F-Measure score 27%. SVM is a kernel based classifier. Different kernels will produce different score. The lowest performances are probably due to there is no relationship between the features and the class in data and it maybe suffers from a high bias or high variance problem. 4. Discussion Based on the experiments and the results obtained in this study, MLR gives better results in term of accuracy compared to other classifiers. MLR is most preferable as it works well with multiclass variable. There are several ways of optimizing the classifier including regularization to avoid overfitting and gradient descent to adjust the weight. KNN is not suitable for a large dataset as it has to check for nearest neighbor’s class which take lot of time to evaluate the results and require greater hardware and software specification. However, this algorithm works well with smaller dataset. Gaussian Naïve Bayes is good classifier for testing large data as it calculates the probabilities of a variable given the class labels. It works well with the data that has continuous attributes. Another important classifier in this study is SVM. This classifier is used for maximizing margin. The greater the margin, the better result this classifier can yield. It is kernel based algorithm which separates input data into higher dimensional space. Different kernel will give different accuracy scores. There are several other factors that effect the performance score of each classifier. These include the dimension of the data, the number of fold or split of the training data, the parameters chosen such as number of iteration, learning rate and weight for MLR classifier, number of neighbor for KNN classifier and the type of kernel used for SVM classifier. 5. Conclusion and Future Work As a conclusion, Multinomial Logistic Regression stands out against other classifiers in term of accuracy performances and more consistent than other classifiers which have been trained and tested. This classifier is good to be used when the dependent variable has multiple categories and those categories are not in an ordered form. However, the performance scores are highly depending on the parameters chosen to train the model as well as depending on optimization methods applied. KNN Classifier results are does not really accurate because it tends to make mistakes in training the dataset and are not linearly separable. Even though this classifier is popular and simple it has some limitations such as complexity of time, as it takes lot of time to show the results and requires lot of memory usage because it is dependent on each data point in the training data. Naïve Bayes classifier gives good results for independent datasets, as it is probabilistic model it works well for classifying data into multiple categories. SVM gives very low accuracy probably due to no connection between the features and the class in the data. The performance results of different classifiers were compared in this study based on the accuracy, recall, precision, as well as F-measure scores. The same methods can be applied to a different dataset such as physiological dataset, pollution dataset, Robust Cancer diagnosis datasets and robotics for future work. Other classification methods such as PCA and SVM with 8 different kernels can also be used to compare the performances of different dataset from various fields. Appendix Instruction on how to run the code: 1. Use any Jupyter notebook 2. Go to File->Open. Drag and drop "comp5318_Ass1.ipnyb", "training_data.csv", "training_labels.csv", "test_data.csv" files to the home interface and click upload. 3. To run the cell you can press Ctrl-Enter or hit the Play button at the top. 9