Abstract:
The advancement in DNA microarray dataset technology has become an area of interest among many scholars. Application of this technology can be a great success for cancer data classification. However, DNA microarray data usually contains thousands of irrelevant and redundant gene information which need to be eliminated to improve the accuracy of classification. Thus, in order to select the relevant gene information from cancer data, a novel feature selection technique based on a filter-wrapper approach using machine learning methods is proposed in this study. Wrappers choose all possible subsets of features to evaluate which features are useful by using learning techniques and provide the most informative subset which will increase the accuracy of the classifiers whereas filter methods extract features from the data without any learning involved. However, compared to filters, the computation demand of wrappers are high when applied to cancer data. Hence, in the proposed work, the wrapper is applied after the filter approach with the intention of reducing the computational complexity of wrappers. The datasets were pre-processed initially using a filter called Gain Ratio Filter with the Ranker search method, and then the resultant gene subsets were evaluated using a wrapper called Wrapper Subset Evaluator with the best first forward selection searching strategy using the WEKA machine learning workbench. The selected gene subset by wrapper was then used to classify the cancer microarray using machine learning classifiers namely, Decision Tree (J48), Naïve Bayes, Sequential Minimal Optimization (SMO), Deep Learning and Bayes Net. The proposed approach was tested on five cancer microarray datasets. The accuracy of 89.69%, 95.16% and 97.04% were obtained for Breast, Colon and Lung cancer datasets respectively while Leukaemia and Ovarian cancer datasets scored 100%. According to the findings of this study, the proposed method is capable of accurately classify the dataset based on a few informative genes which is more efficient compared to existing classification models.