Assignment title: Information

1 CSCI 4144, Winter 2017 Assignment 5: Naive Bayes Classifier Implementation (Issue: Mar 15, Due: Mar 27) - TA: Virlla Devi Soothar ([email protected]) - Ass5 Tutorial: Mar 16, 6:00-7:30PM, CS 127 - Help Hours: Mar 21, 1:00-2:30PM, CS 233; Mar 23, 6:00-7:30PM, CS 127 1. Objectives: 1) To gain an in-depth understanding on Bayes based classification by simplifying the Naive text classification algorithm for relational datasets. 2) To learn the main technical issues for the implementation. 3) You may also gain a team work experience (allowed in a group of 2 students). 2. Programming language, computer system, etc.: The implementation is required to use a serious production implementation language, such as Java family/C family, etc. (but not a script language, such as R), which should have a compiler on bluenose. 3. Data sets and interface design requirements: 1) Your program should be able to handle data files with the following format: the first line contains column headings (i.e., attributes); and every following line contains the values that represent a tuple. 2) It is a common practice to divide a given dataset into two subsets for training and testing the algorithm, such as use two third as training set and the rest as the test set. 3) In developing Ass5, you may use data2 as the given data set. Since the original data set has some noises, you need to do a data cleaning first, and state what are the noises and how handled in README. 4) The interface should allow the user to choose a training data file, a test data file, and a classification target. The program then writes the classification results into an external file called 'Result.txt'. The format of the result file should be same as the test file but with an added label to each tuple for the target attribute, and with an additional row at the end of the file for reporting the classification accuracy: same/total. Where "same" is the number of tuples having the same labels with the test file, and "total" is the size of the test data. 4. Submit your Ass5 electronically: 1) Create a directory assign5 in your bluenose account. This directory should include the developed source code, Makefile, and README file. The README file should provide the instructions how to compile the code and run the program. It should also provide a brief description of the overall code architecture (the functions and the call relationships, etc.).2 2) Submit the assignment directory from your home directory by the command line: submit assign5. 3) Do not submit any data. 5. Evaluation: – Your assignment will be evaluated based upon the overall quality of the work including user interface, functionality, modularity and readability of the program, and the clarity of the README file. *Bonus: +5 by doing the following: 1) Choose a proper dataset from one of the two data repositories: a. http://sci2s.ugr.es/keel/category.php?cat=clas or http://www.ics.uci.edu/~mlearn/MLRepository.html b. Divide the selected dataset into training (70%) and testing (30%) subsets. 2) Run three classifiers on the same datasets, including Naïve Bayes (Ass5), ID3 (Ass4 Bonus, or from WEKA), and C4.5 (from WEKA). Provide a) three classification result tables, and b) a performance comparison table. For more details, please attend Ass5-Tutorial. 3) Comment on your observation in terms of possible strength and limitation of each method. Plagiarism and Intellectual Honesty: (http://plagiarism.dal.ca) Dalhousie University defines "plagiarism as the presentation of the work of another author in such a way as to give one's reader reason to think it to be one's own." Plagiarism is considered a serious academic offence which may lead to loss of credit, suspension or expulsion from the University, or even the revocation of a degree. Good luck and have fun!