Assignment title: Information


We are living in fascinating time where almost everything is done by computers or with a help of a computer. As a result hundreds of thousands petabytes of information are generated every day. And this data, at a glance seeming absolutely useless, still can surprize and stun us. It just needs to be processed in a right way. The purpose of this paper is considering and analyzing of some commonly used classification algorithms used for mining the hidden knowledge and offering new algorithm with some improvements as a result. 1. An Improved k-Nearest Neighbor Classification Using Genetic Algorithm Source of paper: http://www.ijcsi.org/papers/7-4-2-18-21.pdf The paper describes the one of the most popular algorithm K Nearest Neighbours algorithm (kNN) used for pattern recognition. The paper presents the detailed explanation of traditional kNN algorithm, its application, arranges the improved Genetic kNN algorithm (GKNN) obtained by combination of kNN and Generic algorithms which overcomes the limitations of traditional kNN algorithm and in the end shows the performance comparison with the traditional kNN, CART and SVM classifiers. The traditional kNN algorithm can be summarised as (i) specifying positive integer k along with a new sample, (ii) select k entries from the trained set which are closest to the new sample, (iii) finding the most common classification of these entries, and (iv) assigning the classification to the new sample according to the previous step. According to the paper (N. Suguna et al., 2010) traditional kNN algorithm has 3 following limitations: (i) calculation complexity due to the usage of all the training samples for classification, (ii) the performance is solely dependent on the training set, and (iii) there is no weight difference between samples. To overcome these limitations and improve performance of traditional kNN algorithm Genetic Algorithm (GA) is applied. Before classification, initially the reduced feature set is constructed from the samples using Rough set based Bee Colony Optimization (BeeRSAR). According to the Suguna et al (2010), the basic idea of the proposed improvement is that, instead of calculating the similarities between all the training and test samples and then choosing k-neighbors for classification, by using GA, only k-neighbors are chosen at each iteration, the similarities are calculated, the test samples are classified with these neighbors and the accuracy is calculated. As the result, the N. Suguna et al (2010) state, that the experiments and results show that proposed method not only reduces the complexity of the kNN, also it improves the classification accuracy. The obvious advantages of kNN algorithm are: • Simple to understand and powerful. No need for tuning complex parameters to build a model • High accuracy • No assumptions about data • Can do well in practice with enough representative data In spite of overcoming some limitations of kNN algorithm there are still some: • In case of where one type of category occurs much more than another, classifying an input will be more biased towards that one category • Computationally expensive, requires a lot of memory • Large search problem to find nearest neighbours • Storage of data To sum up, the Nearest neighbor (KNN) is very simple, most popular, highly efficient and effective algorithm for pattern recognition. The improvements presented by N. Suguna et al (2010) made it more accurate and fast. But the same time there are still some drawbacks such as sensitivity to significant predominance of one category than another, requirement of lot of memory and others are still need to be solved.