Assignment title: Information
1
CSCI 4144, Winter 2017
Assignment 5: Naive Bayes Classifier Implementation
(Issue: Mar 15, Due: Mar 27)
- TA: Virlla Devi Soothar ([email protected])
- Ass5 Tutorial: Mar 16, 6:00-7:30PM, CS 127
- Help Hours: Mar 21, 1:00-2:30PM, CS 233; Mar 23, 6:00-7:30PM, CS 127
1. Objectives:
1) To gain an in-depth understanding on Bayes based classification by simplifying
the Naive text classification algorithm for relational datasets.
2) To learn the main technical issues for the implementation.
3) You may also gain a team work experience (allowed in a group of 2 students).
2. Programming language, computer system, etc.:
The implementation is required to use a serious production implementation language,
such as Java family/C family, etc. (but not a script language, such as R), which should
have a compiler on bluenose.
3. Data sets and interface design requirements:
1) Your program should be able to handle data files with the following format: the
first line contains column headings (i.e., attributes); and every following line
contains the values that represent a tuple.
2) It is a common practice to divide a given dataset into two subsets for training
and testing the algorithm, such as use two third as training set and the rest as
the test set.
3) In developing Ass5, you may use data2 as the given data set. Since the original
data set has some noises, you need to do a data cleaning first, and state what
are the noises and how handled in README.
4) The interface should allow the user to choose a training data file, a test data file,
and a classification target. The program then writes the classification results into
an external file called 'Result.txt'. The format of the result file should be same as
the test file but with an added label to each tuple for the target attribute, and
with an additional row at the end of the file for reporting the classification
accuracy: same/total. Where "same" is the number of tuples having the same
labels with the test file, and "total" is the size of the test data.
4. Submit your Ass5 electronically:
1) Create a directory assign5 in your bluenose account. This directory should
include the developed source code, Makefile, and README file. The README file
should provide the instructions how to compile the code and run the program. It
should also provide a brief description of the overall code architecture (the
functions and the call relationships, etc.).2
2) Submit the assignment directory from your home directory by the command
line: submit assign5.
3) Do not submit any data.
5. Evaluation:
– Your assignment will be evaluated based upon the overall quality of the work
including user interface, functionality, modularity and readability of the program,
and the clarity of the README file.
*Bonus: +5 by doing the following:
1) Choose a proper dataset from one of the two data repositories:
a. http://sci2s.ugr.es/keel/category.php?cat=clas or
http://www.ics.uci.edu/~mlearn/MLRepository.html
b. Divide the selected dataset into training (70%) and testing (30%) subsets.
2) Run three classifiers on the same datasets, including Naïve Bayes (Ass5), ID3
(Ass4 Bonus, or from WEKA), and C4.5 (from WEKA). Provide a) three
classification result tables, and b) a performance comparison table. For more
details, please attend Ass5-Tutorial.
3) Comment on your observation in terms of possible strength and limitation of
each method.
Plagiarism and Intellectual Honesty: (http://plagiarism.dal.ca) Dalhousie University
defines "plagiarism as the presentation of the work of another author in such a way as
to give one's reader reason to think it to be one's own." Plagiarism is considered a
serious academic offence which may lead to loss of credit, suspension or expulsion from
the University, or even the revocation of a degree.
Good luck and have fun!