Assignment title: Information
Abstract:
Decision tree is one of the most important techniques in data mining for classification. The human-understandable descriptions can be produced from the relationships of the given dataset that can be used for classification and prediction tasks [1]. Medical diagnosis and CMS (customer marketing strategies) are the best areas to apply these techniques. In this paper, a discussion about the elegant decision tree, which is the improved form SLIQ, will be taken. In SLIQ algorithm, a large number of computations are required since the gini index for each value in every attribute have to be calculated to get the best split. The proposed method (Elegant Decision Tree) is more efficient as the calculation of gini index is required after a certain interval instead of each value in an attribute.
1. Introduction:
Classification is a data mining technique used for prediction. It has different methods of classification such as Decision tree and neural networks. Decision Tree is the most popular Algorithm in data mining. It is an effective method for classification. CART (Breiman et. al, 1984), ID3 (Quinlan, 1981), C4.5 (Quinlan, 1993), SPRINT (Shaefer et.al, 1996), SLIQ (Mehta et. al., 1996) are the algorithms that working with the Decision Tree.
In 1981, Quinlan introduced the algorithm for ID3 [6]. In 1986, Quinlan suggested a number of enhancements in ID3 by introducing C4.5 [7]. The main idea in ID3 was to choose the suitable attribute for each node of the decision tree. It introduced the Gain formula and applied it to every attribute. The splitting attribute is the one that has the highest gain value. The drawbacks of this method are the Gain results and the failure to handle the noisy data. Also, the Gain value of an attribute will be the highest if there are more distinct values used; because its value is higher if an attribute has more changes in the values. For example if we have an attribute ID_No, where each value is different than the others, the gain result will always be higher for this attribute than the Gain results of the other attributes. Gain Ratio formula was introduced to overcome the problem which was caused due to the Gain formula. However, it was still difficult to handle noisy data using ID3.
SLIQ overcomes the drawbacks of ID3. The Gain ratio was replaced by Gini Index. Now in SLIQ, the Gini index is calculated for each value of every attribute and then calculates the Gini Split to find the splitting value of that attribute. Elegant decision tree is an improved version of SLIQ algorithm. It is more efficient, and the classification accuracy is much higher than Neural Network Classification technique.
2. Data Preparation:
The dataset used for testing this algorithm is Breast Cancer Wisconsin which was downloaded from UCI website. This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg [2].
The information about this dataset is given below:
1. Total Number of Instances: 699.
2. Missing Attribute values: 16.
3. Remaining instances: 683.
4. Total Number of Attributes: 11 (10 plus the class attribute)
5. Attribute Information: (class attribute has been moved to last column)
6. The data in the file is in the following order.
7. Class distribution:
Benign: 458 (65.5%)
Malignant: 241 (34.5%)
8. Attribute details:
S.No Attribute Domain
-- ---------------------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
For data preprocessing, all the data was checked and found 16 missing values. A missing value in the above screen shot of data can be seen in the last row of attribute # 7.
3. System Design:
In Elegant decision tree, the creation of tree is similar to the one in SLIQ. The formula for calculating the gini index is also the same. The difference is the time required to calculate the gini index for an attribute. The calculations of gini index are becoming lesser at each split. A class histogram is used to get the successive values of the attributes. The histogram that gives the least gini index gives the splitting point P for the node under consideration.
Example of finding gini index for a sample histogram with classes X and Y:
Attribute value < P X Y
L x1 x2
R y1 y2
P denotes the Splitting value for an attribute
x1 denotes the number of attributes which are less than P and belong to class A
x2 denotes the number of attributes which are less than P and belong to class B
y1 denotes the number of attributes which are greater than P and belong to class A
y2 denotes the number of attributes which are greater than P and belong to class B
Gini Index = (x1+x2)/n * [1-{x1/( x1+ x2)}2 - {x2/( x1+ x2)}2 ] +
(y1+y2)/n * [1-{y1/( y1+ y2)}2 - {y2/( y1+ y2)}2 ]
where n = total number of records = x1 + x2 + y1 + y2
By using this formula, the gini indices are computed for each histogram for each attribute. Once the gini index for each histogram is calculated, the attribute with the least gini index is chosen to split from, and the split value equals to the splitting point P for that histogram. The stopping criterion is when the gini index at a node becomes zero, as this implies that all data records at that node have been classified completely.
In the elegant algorithm, it is proposed that instead of performing calculations for the gini index at each value in an attribute, the calculation is performed at a certain fixed interval within the range of values of an attribute. For instance, the computation may be performed at intervals of 10% of the number of records at each node. This would result in the total number of computations being significantly lesser than that in SLIQ.
In SLIQ:
The total number of computations per node = n X m
;where n is the number of attributes and m is the number of records at that node.
In elegant algorithm:
The total computations at a node = n X g
;where g is the number of groups formed (g=m/k where k is the interval/group size) [1].
The computations for gini are reduced at every node.
4. Program Architecture:
The program is written according to the algorithm given in the previous section. The main class ElegantDT is used. Other functions used in the program are listed below:
ElegantDT( ) It is the constructor for the class ElegantDT
retreiveData(String filename) It is used to retrieve the data from the data file and store it in a vector array.
sort() Sort the values of each attribute.
createTree(Vector[] data) It is used to create the decision tree.
construct(int, int, double) It is used to construct a new node with (attr, splitval,ginival).
calc(int[ ], int [ ]) It is used to calculate the gini.
4.1 Algorithm:
Input : Training set, samples consisting of attributes A1– An and records R1– Rm. num is the number of groups into which each attribute has to be divided for computation of gini index.
attr stores the splitting attribute
ginival stores the final gini index.
splitval stores the final splitting value of attribute attr>
Assumed to be c no. of classe
//initializations
ginival = infinity // high value
splitval = 0
attr = 0
Copy records R1– Rm from samples to data.
Maketree(data)
{
1. construct attribute lists alist1– alistn such that alisti contains all records sorted on Ai.
2. //make histograms for each alisti
3. for i = 1 to n
4. val = (alist[i][m] 䀈 alist[i][0])/num
5. count=0
6. while count=split to obtain r[c].
10. //calculate gini index
11. gini = calc(l[c],r[c])
12. if ginival>gini
13. ginival = gini
14. splitval = split
15. attr = I
16. count++
17. Construct a new node with (attr,splitval,ginival)
18. if gini = 0
19. return //classification complete
20. Put all records with value in Aattr=splitval into data2
22. maketree(data1)
23. maketree(data2)
}
//function to compute gini index
calc(l[c],r[c])
{
let k be total no. of records in l[c] and r[c].
//let ans be the return value
1. ans = 0, lval = 1, rval = 1.
2. lsum = sum of no. of records in l[c].
3. rsum = sum of no. of records in r[c].
4. for i = 1 to c
5. lval -= (l[i]/lsum)2
6. rval -= (r[i]/rsum)2
7. lval *= lsum/k.
8. rval *= rsum/k.
9. ans = lval + rval.
10. return(ans).
}
4.2 Building the Program:
In order to run the program, two steps have to be taken (assuming that you are in the directory were the program file and the dataset is on, i.e. proj):
1- Compiling the program to make the class file by writing the following line
javac ElegantDT.java
This line will generates ElegantDT.class file.
2- Running the program by writing the following line:
java ElegantDT
then proceed with the program instructions.
4.3 User Interface:
This program is used command line interface for both Input and Output.
5. System Evaluation:
- Program running sample:
After running the program, it will ask to you to enter the data file number.
Then it will ask you to enter the Number of groups. In Elegant algorithm, these groups mean how many groups you want for each attribute values. For example if you enter 3 then it will split the total no of attribute values in 3 groups and calculate the three gini values only. That is the advantage of this algorithm; because before in SLIQ, it was necessary to calculate the gini for each value of every attribute.
Now program will show the name of attribute which are used in the dataset:
Now it will take a pause and wait for your pressing the Enter key. This is for user’s convenience to read the attributes (if he/she is interested).
Then after pressing the Enter key, the program will create the decision tree and show the output on the screen:
The above result shows four 2s and five 4s values. In the dataset, two classes were used, 2 and 4.
The same dataset was tested with ID3 and it was observed that it was slower in processing and the classification accuracy was also lower.
The classification accuracy got from ID3 was 70%.
The classification accuracy got from this program was 100%.
6. Conclusion:
It is observed that the elegant decision tree is more efficient than SLIQ. The best thing in this algorithm is that it is using the same gini index formula used in SLIQ. However, the number of computations for gini is dramatically reduced. By using this algorithm, the classification accuracy is also much higher than SLIQ.
7. References
[1] B. Chandra, Sati Mazumdar , Vincent Arena and N. Parimi, Elegant Decision Tree Algorithm for Classification in Data
[2] O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
[3] Alis Hadi. Identifying multiple outliers in multivariate data. J.R. Statist. Soc., 1992 B 54, No. 3,pp. 761-771.
[4] Breiman L, Freidman J. , Olshen R. and Stone C.: Classification and Decision Trees , Wadsworth,1984.
[5] Mehta M., Agarwal R. and Rissanen J. SLIQ: A fast scalable classifier for data mining, Proceedings of the International conference on Extending Database Technology, France, March 1996
[6] Quinlan J.R. Introduction of decision tree. Machine Learning, 1:81-106,1986
[7] Quinlan J.R C4.5 : Programs for Machine Learning:Morgan Kauffman, 1993
[8] Rocke, M. and David L. Woodruff. Identification of outliers in multivariate data. Journal of the American Statistical Association. September 1996, vol. 91, No. 435, Theory and Methods.
[9] Rumelhart, D.E., Hinton G.E. and R.J. Williams, " Learning internal representation of error propagation",in D.E. Rumelhartand J. McCelland editors, Parallel Distributed processing, Cambridge, MIT press,1986.
[10] Shafer J.C., Agarwal R. and Mehta M.. SPRINT: A scalable parallel classifier for data mining, Proceedings of the 24th International Conference on Very lrge Databases, Mumbai, India September 1996.