BREAST CANCER DETECTION USING FEATURE EXTRACTION
DEPARTMENT OF COMPUTER SCIENCE
GOVT GIRLS POST GRADUATE COLLEGE NO 1 ABBOTTABAD
GOVT GIRLS POST GRADUATE COLLEGE NO 1 ABBOTTABAD
Department of Computer Science
ABREAST CANCER DETECTION USING FEATURE EXTRACTION
This research study has been conducted and reported as partial fulfillment of the requirements Of BS degree in Computer Science awarded by GGPGC No 1 Abbottabad, Pakistan.
Friday, 13 November 2018
BREAST CANCER DETECTION USING FEATURE EXTRACTION
The thesis of Ms. Hajra bibi and Ms Areeba Sajjad is approved in its present form for completing the requirements for the degree of BS in Computer Science.
External Examiner 1:Dr. MUHAMMAD NAWAZ
Centre for Excellence in Information Technology
Institute of Management Sciences, Peshawar
External Examiner 2:Dr. MUAZAM ALI KHAN
Department of Computer Engineering
EME College, NUST Rawalpindi
Supervisor:MA’AM AISHA SIKANADAR
Department of Computer Science
GGPGC No 1, Abbottabad
Dated:January 29, 2016
DEPARTMENT OF COMPUTER SCIENCE
GOVT GIRLS POSTGRADUATE COLLEGE NO 1 ABBOTTABAD
GOVT GIRLS POSTGRADUATE COLLEGE NO 1 ABBOTTABAD
APPROVAL SHEET OF THE MANUSCRIPT
BS THESIS SUBMITTED BY
Name Hajra Bibi, Areeba Sajjad
Fathers Name Manzoor Hussain, Sajjad ur RehmanDate of Birth 20-11-1995, 14-08-1996_________________________________
Postal Address _________________________________________________________
Permanent Address _________________________________________________________
Telephone: – ________________________________
Email: – _________________________________
BS Thesis Title: – Breast Cancer Detection Using Feature Extraction
Language in which the Thesis has been written: -English
Dr. Arif Iqbal Umar _______________________
Dr. Imran Siddiqi ______________________
1 Prof. Dr. Habib Ahmad (TI) ________________________
Dean Faculty of Sciences
BREAST CANCER DETECTION USING FEATURE EXTRACTION
Submitted byHajra Bibi, Areeba SajjadBS students
Research SupervisorMa’am Aisha SikandarLecturer
Department of Computer Science
GGPGC No 1, Abbottabad
DEPARTMENT OF COMPUTER SCIENCE
GOVT GIRLS POSTGRADUATE COLLEGE NO 1 ABBOTTABAD
IN THE NAME OF ALLAH, THE COMPASSIONATE, THE
AND HIS LAST PROPHET MUHAMMAD
This thesis is dedicated to our family for their encouragement and support to achieve our goals.
First of all I am thankful to almighty Allah for giving me courage and strength to do this research work. With Allah’s blessings I have achieved my goals and have completed this work on time. I would like to express my deepest gratitude to my supervisor Ma’am Aisha Sikandar for the courage, endless support and guidance throughout my research. It has been an honor to work in her supervision. Her enthusiasm towards research was motivational for us during tough times in our research. Above all we salute to her patience and care at every stage of our research. Finally, I would like to thank our family for their patience, continuous support, love and encouragement for the completion of this research work.
Table of Contents
TOC o “1-3” h z u 1Introduction…………………………………………………………………………..11.1 Classification Of Character Recognition Systems1
1.1.1 Offline Character Recognition21.1.2 Online Character Recognition…..31.2Phases Of Ocr………………………….31.2.1 Image Acquisition31.2.2Preprocessing41.2.3 Segmentation41.2.4 Feature Extraction5
1.2.5 Classification/Recognition61.2.6 Post-Processing6
1.3 Application Of Urdu Ocr…….6
1.5 Research Problem…………….8
1.6 Proposed Methodology And Research Contributions8
1.7 Parameters Measured…………9
1.8 Thesis Organization…………..9
1.9 Summary……………………………………………………………………………………….102 Background Knowledge………………………………………………………………11
2.1 Overview Of URDU…………………. 11
2.1.1 Urdu And Arabic Script Based Languages And-Their Properties12
2.1.2 Peculiarities Of Urdu Text16
2.2 Overview Of Recurrent Neural Network….24
2.2.1Artificial Neural Network25
2.2.2Recurrent Neural Network262.3 Summary…………………………32
3 Literature Review…………………………………………………………333.1 Data Set………………………..363.2 Preprocessing…………………..353.3 Segmentation………………….403.3.1 Segmentation-Free (Holistic) Approaches383.3.2 Segmentation-Based (Analytical) Approaches423.4 Feature Extraction Approaches433.5 Classifiation Approaches……………..443.6 Recurrent Neural Network Based Recognition Systems503.7 Summary………………………..544 Proposed Methodology…………………………………………………564.1 Database…………………………574.1.1 Labeling Of Data584.2Training……………………………………..664.2.1 Preprocessing664.2.2Feature Extraction674.2.3MDLSTM Model754.2.4Recognition Of Characters From Text Lines814.3Summary………………………………….825Experiments and Results…………………………………………….…83
5.1MDLSTM Based Recognition Using Manual Features835.1.1Recognition Using Non-Overlapped Windows835.1.2Recognition Using Overlapped Windows865.2MDLSTM Based Recognition Using Automatic Features895.2.1Raw Pixel Based MDLSTM895.2.2CNN Features Based MDLSTM905.3Discussion……………………….925.3.1Frequent Errors On Manual Features925.3.2 Frequent Errors On Automatic Features945.3.3Generalization Of Results955.3.4Comparison975.4Summary…………………………………1006 Conclusion and Perspectives……………………………………….1016.1Conclusion…………………………….1016.2 Future Research Directions…………………………………………………..102Bibliography…………………………………………………………………103Appendix A Articals Published………………………………………..119Appendix B Conferences Papers Published ………………….……137
Appendix C Book ChapteR Published…………………………………142
List of Figures TOC h z c “Figure”
Figure ?1.1 Online and Offline OCR2
Figure ?1.2 Block Diagram of a Character Recognition System3
Figure ?1.3Types of Segmentation Techniques4
Figure 2.1 Different Calligraphic Styles of Urdu Script13
Figure ?2.2 Urdu Character Set14
Figure ?2.3 Comparisons of Alphabet Sets of Arabic Script Based Languages: Arabic, Urdu, Pushto And Sindhi14
Figure ?2.4 Examples from Urdu, (a) The Two Shapes of “Alif”, (b) The Four Shapes of “Hay”, (c) The Word “Rana” Demonstrating Both The Forms of Alif, (d) Demonstration of “Hay” Shapes: R To L—”Mallah”, “Hamd”, “Mehfooz”, “Fatah”15
Figure ?2.5 The Sentence “My Name is Saeeda” in Urdu, Pashto, Sindhi, and Arabic, Respectively15
Figure ?2.6 Examples of Urdu Words and Their Con?guration: (a) “Pakistan”, (b) Separate Letters of “Pakistan”, (c) “Tasbih”, (d) Separate Letters of “Tasbih”16
Figure ?2.7Urdu Diacritics. (a) Common: Toy, Hamza, DotsAnd Madaa, (b) Uncommon: Zabar, Zeir, Shadd And Pesh16
Figure ?2.8 Different Shapes of Isolated Form of Bay in Nasta’liq Script17
Figure ?2.9 Bidirectional Behavior of Urdu Script18Figure ?2.10 Diagonal Writing Direction18Figure ?2.11 Intra-Ligature Overlap18Figure ?2.12 Inter-Ligature Overlap18Figure ?2.13 Common Diacritics19Figure ?2.14 Uncommon Diacritics15Figure ?2.15 Complexity in Dot Placement and Association with Base Character15Figure ?2.16 Non-Monotonic Writing in Urdu 20Figure ?2.17 Baseline and Two Descender Lines for Naskh and Nasta’liq21Figure ?2.18 Red is Baseline for Naskh and Nasta’liq, Blue Line is Challenges in Baseline21Figure ?2.19 Filled-Loop Characters in Nasta’liq and Open-Loop Characters in Naskh………………………………………………………………….21Figure ?2.20 An Example of False Loop22Figure ?2.21 Stretching of Characters in Urdu (a) Un-Stretched Version (b) Stretched Version22Figure ?2.22 Stretching of Seen22Figure ?2.23 Positioning of Character in Urdu23Figure 2.24 Spacing is Shown by Circle in Urdu Sentence ….……….……………..23Figure ?2.25 Horizontal and Vertical Segmentation of Ligatures inNasta’liq23Figure ?2.26 Horizontal Pro?ling in Nasta’liq24
Figure ?2.27 Schematic Diagram of Biological Neuron’s Sturcuture25Figure ?2.28 Neural Network Architecture26Figure ?2.29 A Recurrent Neural Network27
Figure ?2.30 Long Short Term Memory28Figure ?2.31 LSTM Architecture Replaces Hidden Neurons With Memory Blocks28Figure ?2.32 Retention Of Gradient Information over Temporal Sequential Behavior. Gates are Either Open “O” Or Closed “-“29Figure ?2.33 Bi-Directional Recurrent Neural Network30Figure ?2.34 MDRNN: 2D RNN Forward Pass and 2D RNN Backsword Pass31Figure ?4.1 Block Diagram of Proposed Urdu Nasta’liq Text Line Recognition System Based on MDLSTM57Figure ?4.2 A Sample Line of Urdu Text (Top) along with the Transcription File (Bottom)66Figure ?4.3 (a) Original Image (b) Resized Grayscale Image to the Fixed Height67Figure ?4.4 Frames Extracted From A Text Line (a) Using Overlapped Sliding Window (b) Using Non-Overlapped Using Sliding Window68Figure ?4.5The Frame has Features Of (a) Horizantal Edges (b) Vertical Edges69Figure ?4.6 Features Extracted Automatically From Raw Pixels of the Text Line Image72Figure ?4.7 Architecture of CNN73Figure ?4.8 Some Samples Images of 0-9 Digits from The MNIST Dataset73Figure ?4.9 Error Rate of CNN on 60,000 Samples Images from MNIST Dataset on Different Number of Epochs74Figure ?4.10 The Selected Features in Kernels K1, K2, K3, K4, K5 And K6.74Figure ?4.11 Urdu Text Line (a) Original Image (b) Skeletonized Image (c)-(h) Six Convolutional Kernels From C1 Layer Extracted By CNN As The Features And Filtered The Skeletonized Urdu Text Line With Each Kernel Respectively75 Figure ?4.12 The Complete Network Architecture Along With Its Parameters For Manual Features for Urdu Nasta’liq Text Line Recognition77Figure ?4.13 The Complete Network Architecture Along With Its Parameters for Automatic Features for Urdu Nasta’liq Text Line Recognition78Figure ?4.14 (a) Input Image (b) Recognition of Urdu Text Line by Trained MDLSTM Network …………………………………………………………………82 Figure ?5.1 Error Curves of Training Performance of MDLSTM Recognition System for Different Sets of Features85Figure ?5.2 Comparison of Network Performance on Different Feature Sets on Training and Validation Sets85Figure ?5.3 Error Curves on MDLSTM Based Training for Different Frame Widths87Figure ?5.4 Input and Output of The MDLSTM Recognition System: Input Images are in Figures a.1, b.1 annd Output Texts Are in a.2, b.288Figure ?5.5 Error Rates as a Function of Number of Epochs on Training and Validation Data Sets89Figure ?5.6 Training of MDLSTM on Different Number of Epochs using CNN Features91Figure ?5.7 Input and Output of The MDLSTM Recognition System: a.1, b.1, c.1 Are The Input Text Line Images While a.2, b.2, c.2 Represent The Output Text Lines92Figure ?5.8 Frequently Occurring Recognition Errors94Figure ?5.9 Input and Output of The MDLSTM Illustration of Insertion, Deletion and Substitution of Characters After Recognition94Figure ?5.10 Recognition Error Rates on 500text Lines As A Function of Amount of Training Data96Figure ?5.11The Overall Split of The UPTI Dataset for 5-Fold Cross Validation and Error Rate Of Each Experiment96
List of Tables
TOC h z c “Table” Table ?3.1 Review of Cursive Arabic Scripts’ Databases34Table ?3.2 Comparison of Segmentation Phase’s Approaches for Urdu42Table ?3.3 Machine Printed Isolated Character Recognition48Table ?3.4 Machine Printed Ligature or Sub-Word Recognition49Table ?3.5 Comparison of Machine Printed Numeral Recognition50Table ?3.6 Performance of RNN Based Recognizers using Raw Pixels and Features…54Table ?4.1 Distribution of UPTI Database in Train, Validation and Test Sets58Table ?4.2 The Construction of Classes by Grouping Various Shapes of a Basic Character and Assign a Label. The Various Shapes/Glyph are Also Shown with Content in the Ligature.58
Table ?4.3 Feature Vector Extracted from the Input Image70Table ?4.4 Feature Set Description71Table ?4.5 Parameter Values for Training The Network using Manual Features80Table ?4.6 Parameters Values for Training The Network Using Automatic Features81Table ?5.1 Comparison of Networks for Recognition Rate on Different Features on Test-Set86Table ?5.2 Character Recognition Rates for Network-8 (F3-F13) on Different Epochs for Testing Set86Table ?5.3 Training and Validation Errors for Four Experimental Settings87Table ?5.4 (Character) Recognition Rates on Test Set As A Function of Frame Width And Number of Epochs88Table ?5.5 Error Rates on Training and Validation Dataset for the Best Network90
Table5.6Character Recognition Rate on Test Set by Trained Network …………….90
Table ?5.7 Different Errors for CNN Based Recognition System91Table 5.8 Character Error Rates on Test Set using CNN based Features………………..91
Table ?5.9Frequently Occurring Recognition Errors (Manual Features) Expressed in Terms of Edit Distance. (___Actual Column Means There is Nothing in a Word and ___ in Predicted Column Means Something is Deleted in a Word.)93Table ?5.10 Frequently Occurring Recognition Errors (Automatic Features) Expressed in Terms of Edit Distance95Table ?5.11 Three Types of Analysis Techniques Employed for Generalization Of Recognition Error Rates on UPTI Dataset97Table ?5.12 Recognition Rates of Other Techniques Reported in the Literature99
Breast cancer is the maximum not unusual cancer amongst ladies mostly. Research have shown that early detection and suitable treatment of breast most cancers notably increase the probabilities of survival. They have also proven that early detection of small lesions boosts diagnosis and results in a vast discount in mortality. Mammography is in this situation the first-rate diagnostic technique for screening. But, the interpretation of mammograms isn’t easy due to small variations in densities of different tissues within the image that is in particular true for dense breasts. This paper is a survey of the automated early detection of breast cancer through reading mammographic photos. This evaluation ought to offer radiologists a higher information of stereotypes and provides, if it’s far detected at an early stage, a better prognosis inducing a significant lower in mortality.
Breast tumor is the best not strange harm among ladies for the most part. Breast tumor addresses one of the infections that make a high number of passing reliably. It is the most surely understood kind everything considered and the essential driver of women’s passing around the globe. Research have exhibited that early distinguishing proof and suitable treatment of the dominant part of chest malignancies very augmentation the probabilities of survival. They have furthermore exhibited that early area of little wounds bolsters decision and results in an enormous refund in mortality. Mammography is in this situation the first class characteristic framework for screening. In any case, the comprehension of mammograms isn’t straightforward due to little assortments in densities of different tissues inside the photo that is particularly substantial for thick chests. Various elective approaches are made in data mining using machine learning frameworks for diagnosing chest threat. Course of action and data mining procedures are a ground-breaking strategy to arrange data. Especially in helpful field, where those methods are comprehensively used as a piece of finding and examination to choose. We propose a machine learning approach for breast development assurance.
The cells in our bodies all have different jobs to do. Normal cells divide in an orderly way. They die when they are worn out or damaged, and new cells take their place. Cancer is when the cells start to grow out of control. The cancer cells keep on growing and making new cells. They crowd out normal cells. This causes problems in the part of the body where the cancer started. Cancers has been characterized as a heterogeneous disease composed of many different subtypes. The early medical diagnosis and diagnosis of a cancer type have become a requirement in cancer research, as it can facilitate the subsequent clinical management of patients. Tumor/Cancers of the breast is one of the most widespread diseases among women worldwide. Correct and early on diagnosis certainly important is definitely an important step in rehabilitation and treatment 1.
In breast cancer, cancer cells form in the tissues of the breast of the woman 2. The breast is made up of lobes containing 15 to 20 sections and ducts. The most common type of breast cancer begins in the cells of the ducts. Cancer that starts in the lobes or lobules found in both breasts is other types of breast cancer. Warm, red, and swollen breast is an indicator for breast cancer. Age and health history can affect the risk of developing breast cancer 3. For detecting the different stages of the breast cancer, Chest X-ray, CT scan, Bone scan and PET scans are widely used. The number of breast cancer diagnosis is calculated to be 1.2 million among women every year according to projections by the World Health Organization. In the year 2006 an estimate of 214,460 new cancer diagnoses was made and total death of at least 41,000 occurred within the US 4. Since the early years of cancer research, biologists have used the traditional microscopic technique to assess tumor behavior for breast cancer patients 5. For the diagnosis and treatment of cancer, precise prediction of tumors is critically important.
Computer-aided detection or diagnosis (CAD) systems, which use computer technologies to detect abnormalities in mammograms such as calcifications, masses, and architectural distortion, and the use of these results by radiologists for diagnosis 6, can play a key role in the early detection of breast cancer and help to reduce the death rate among women with breast cancer. Thus, in the past several years, CAD systems and related techniques have attracted the attention of both research scientists and radiologists. For research scientists, there are several interesting research topics in cancer detection and diagnosis systems, such as high-efficiency, high-accuracy lesion detection algorithms, including the detection of masses, detection of architectural distortion, and the detection of bilateral asymmetry. Radiologists, on the other hand, are attracted by the effectiveness of clinical applications of CAD systems. Latest machine learning techniques are increasingly being used by biologists to obtain proper tumor information from the databases. Among the existing techniques, supervised machine learning methods are the most popular in cancer diagnosis.
Figure 1: Normal parts of the breasts
The proposed research is aimed at developing a system for breast cancer detection using feature extraction.
To get state of the art knowledge about existing breast cancer detection algorithms and methods.
To elaborate challenges in development of SVM.
To train the proposed systems on three datasets with feature extraction and compare their results.
To propose the development of SVM with high accuracy.
Compare the proposed algorithm with existing computational methods.
The literature will be studied for SVM to know topological and biological constrains.
Features will be extracted from three datasets.
Original features and extracted features all will be normalized.
Apply proposed algorithm on original, extracted and normalized features datasets.
1.5.Thesis OrganizationThesis organization is: Chapter 2 explains background of breast cancer diagnosis; in chapter 3 literature review of different algorithms for breast cancer diagnosis is described, chapter 4 different features of breast cancer datasets taking from UCI machine learning repository, chapter 5 contains proposed algorithm for detecting breast cancer, in chapter 6 results of proposed algorithm are given, and chapter 7 describes future work and final remarks.
Development of tumor:
Cells in the body develop and partition in a controlled manner and the cells bite the dust after a few rounds of replication. New cells are framed from ancestor cells as required. Breast tumor is an aftereffect of development anomaly in the cells of breast bringing about change in consistency of the breast tissues. This variation from the norm grows normally in the internal coating of the drain channels or lobules. Breast malignancy shows up as tumor when there is wild multiplication of breast cells.
Many researchers have been applying various algorithms and techniques like Classification, Clustering, Regression, Artificial Intelligence, Neural Networks, Association Rules, Decision Trees, Genetic Algorithm, Nearest Neighbor method etc., to enable wellbeing to professionals with enhanced exactness in the finding of breast cancer disease.
An accurate classifier is the most important component of any CAD scheme that is developed to assist medical professionals in early detecting mammographic lesions. CAD systems are designed to support radiologists in the process of visually screening mammograms to avoid miss-diagnosis because of fatigue, eyestrain, or lack of experience. The use of an accurate CAD system for early detection could definitely save precious lives 7.
A variety of classification techniques were developed for breast cancer CAD systems. The accuracy of many of them was evaluated using the dataset taken from the UCI machine-learning repository. For example, Goodman, Boggess, and Watkins, tried different methods that produced the following accuracies: optimized learning vector quantization (optimized-LVQ) method’s performance was 96.7%, big-LVQ method reached 96.8%, and the last method, they proposed AIRS, which depending on the artificial immune system, obtained 97.2% of classification accuracy.8
For achieving better precision researchers turned to data mining technologies and machine learning approaches for predicting breast cancer. Machine Learning (ML) techniques can be used to develop tools for physicians that can be used as an efficient system for early detection and diagnosis of breast tumor that may greatly enhance the survival rate of patients 9.
In 1999 fuzzy-genetic approach to breast cancer detection were proposed by Pena-Reyes and Sipper 10.
In cancer detection not only the accuracy was main requirement but also the time complexity. Researcher proposed sequential forward search and sequential backward search to extract features for getting the multilayer perceptron neural network to categorize tumor to overcome time complexity 11. F-score were used for examining DNA virus differentiation was established for selecting the subset of DNA viruses for detection of breast cancer using support vector machine (SVM) 12. Later on Akay proposed the SVM based method that was combined with feature selection for diagnosing breast cancer 13. Prasad et al. attempted different combinations of SVM and heuristics to explore the best features for SVM training for less time consuming 14. Consequences of those researches unveil not only an improvement on cancer detection perfection but also decreased the computation time for training efficiently because of reduction of feature space dimension. Along with these significances the less time efficiency was the drawback. Then K-means algorithm (unsupervised learning algorithm) is suggested to extract tumor features to prevent the iterative training of different subsets. A membership function is improved to get the compact result of K-means algorithm organized for training SVM which show high accuracy on breast cancer diagnosis 15.
Support vector machine (SVM) invented by Vladimir Vapnik is a class of machine learning algorithm that can execute pattern recognition and regression based on theory of statistical learning and principle of structural risk minimization 16. It has been extensively used in detection of diseases due to high accuracy of prediction. SVM extracted more precise result (97.2%) than decision tree based on breast cancer Wisconsin (original) Dataset. Research for detecting breast cancer proposed by Akay, SVM delivered 98.53% accuracy for 50-50 % of training test partition, 99.02% for 70-30% of training test partition and 99.51% for 80-20% of training test partition with same dataset. The feature selection algorithm not only decreased the dimension of features but also abolished the noisy information for forecasting. Polat and Günes suggested least square support vector machine (LS-SVM) for breast cancer detection based on same dataset with accuracy of 98.53% 17. Another SVM with linear kernel was applied for the classification of cancer tissues based on different datasets with more than 2000 types of features 18.
Related Work and Comparative Study
This chapter will provide the details of the previous work on breast cancer diagnosis. We have surveyed the previous approaches to understand the breast cancer diagnosis. The comparative study of approaches for diagnosis of breast cancer are given below. There are two main methods or approaches for identification and diagnosis of breast cancer .These are
In past, different experimental approaches are used for diagnosis of breast cancer including mammography, ultrasound, Magnetic Resonance Mammography (MRM), breast screening etc. Mammography was significantly more sensitive than clinical assessment in cancer diagnosis but gave a higher false positive rate (p;0.0001). Screening is the application of a diagnostic test to an asymptomatic population with the purposes of defining whether people have or do not have a disease. Ideally screening should detect a disease when it is at an early stage when it is most effectively able to be cured 19.
Breast self-examination has been encouraged as a relatively inexpensive method of screening while a number of countries have now developed sophisticated programmers using mammography. Studies that have assessed the effectiveness of screening mammography should be considered separately from those that have assessed the efficacy of the test to diagnose cancer in symptomatic people.
Medical imaging for breast cancer can be used as non-invasive method for looking inside the body and assist the doctors in diagnosis and treatment. Md. Shafiul Islam et al. has explored different medical imaging used in the diagnosis of breast cancer and compared their effectiveness, advantages , disadvantages for detecting early-stage breast cancer in of 20, which mainly focuses on X-ray mammography, ultrasound and magnetic resonance imaging (MRI). In 21 Sachin Prasad Na and Dana Houserkovaa give an overview of the old and new modalities used in the field of breast imaging and evaluate the role of various modalities used in the screening and diagnosis of breast cancer. Though there are various imaging techniques, Invasive Mammogram is considered the gold standard for breast cancer detection. Though initial detection of breast cancer can be done using any one of the available imaging modalities, they do not give assurance that the abnormality detected is malignant. So treatment of the patient does not start until after microscopic examination of tissue from the tumor is done to confirm its malignancy.
With the passage of time and advancement in technology, researchers turned to computational approaches.
Data mining (DM) is one of the steps of knowledge discovery for extracting implicit patterns from vast, incomplete and noisy data 22.It is a field with the confluences of various disciplines that has brought statistical analysis, machine learning (ML) techniques, artificial intelligence (AI) and database management systems (DBMS) together to address issues 23.
Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies for extracting useful knowledge from data. The term Knowledge Discovery in Databases or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the “high-level” application of particular data mining methods. The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.
Applications of data mining have already been proven to provide benefits to many areas of medicine, including diagnosis, prognosis and treatment 24.
Fig. 1 shows the importance of data mining in the knowledge discovery framework and how data is transferred into knowledge as the discovery process continues. Classification and clustering problems have been two main issues in the data mining tasks. Classification is the task of finding the common properties among a set of objects in a database and classifying them into different classes 25.Classification problems are closely related to clustering problems, since both put similar objects into the same category. In classification problems, the label of each class is a discrete and known category, while the label is an unknown category in clustering problems 26. Clustering problems were thought of as unsupervised classification problems 27. Since there are no existing class labels, the clustering process summarizes data patterns from the data set. Usually breast cancer has been treated as a classification problem, which is to search for an optimal classifier to classify benign and malignant tumors.
Fig. 1. An overview of knowledge, discovery, and data mining process (Fayyad et al., 1996).
Classification is one of the most important and essential task in machine learning and data mining. About a lot of research has been conducted to apply data mining and machine learning on different medical datasets to classify Breast Cancer. Many of them show good classification accuracy. Vikas Chaurasia and Saurabh Pal compare the performance criterion of supervised learning classifiers; such as Naïve Bayes, SVM-RBF kernel, RBF neural networks, Decision trees (J48) and simple CART; to find the best classifier in breast cancer datasets 28. The experimental result shows that SVM-RBF kernel is more accurate than other classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets. Djebbari et al.consider the effect of ensemble of machine learning techniques to predict the survival time in breast cancer.
Their technique shows better accuracy on their breast cancer data set comparing to previous results 29. S. Aruna and L. V Nandakishore, compare the performance of C4.5, Naïve Bayes, Support Vector Machine (SVM) and K- Nearest Neighbor (K-NN) to find the best classifier in WBC 30. SVM proves to be the most accurate classifier with accuracy of 96.99%. Angeline Christobel. Y and Dr. Sivaprakasam, achieve accuracy of 69.23% using decision tree classifier (CART) in breast cancer datasets 31.
The accuracy of data mining algorithms SVM, IBK, BF Tree is compared by A. Pradesh 32. The performance of SMO shows a higher value compared with other classifiers. T.Joachims reaches accuracy of 95.06% with neuronfuzzy techniques when using Wisconsin Breast Cancer (original) datasets 33. Liu Ya-Qin’s, W. Cheng, and Z. Lu experimented on breast cancer data using C5 algorithm with bagging; by generating additional data for training from the original set using combinations with repetitions to produce multisets of the same size as you’re the original data; to predict breast cancer survivability34. Delen et al. Lu take 202,932 breast cancer patients records , which then pre-classified into two groups of “survived” (93,273) and “not survived” (109,659). The results of predicting the survivability were in the range of 93% accuracy 35.
Sau Loong Ang et al. attempts were made to improve the Naive Bayes by introducing links or associations between the features such as the Tree Augmented Naive Bayes (TAN). In this study, they had shown the accuracy of a General Bayesian Network (GBN) applied with the hill climbing learning approach, which did not impose any restrictions on the structure and represented the dataset in a better way. To measure the performance of GBN against the Naive Bayes and TAN, they used seven nominal datasets with the absence of missing values for comparative purposes. These nominal datasets were taken from the UCI Machine Learning Repository (Lichman, 2013) and they were fed into the Naive Bayes, GBN and TAN for classification with ten-fold cross validation in WEKA software using 286 instances each containing 10 attributes. Naïve Bayes model gave an accuracy of 71.68% followed by 69.58% for TAN and 74.47% for GBN 36.
Support vector machine is a class of machine learning algorithms that can perform pattern recognition and regression based on the theory of statistical learning and the principle of structural risk minimization 37. Vladimir Vapnik invented SVM for searching a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin 38. The margin was defined by the distance of the hyperplane to the nearest of the positive and negative examples 39. SVM has been widely used in the diagnosis of diseases because of the high accuracy of prediction. SVM generated a more accurate result (97.2%) than decision tree based on the Breast Cancer Wisconsin (Original) Dataset 40. In the research for diagnosing breast cancer developed by SVM provided 98.53%, 99.02%, and 99.51% for 50–50% of training-test partition, 70–30% of training-test partition, and 80–20% of training-test partition respectively based on the same previous data set which contained five features after feature selection by a genetic algorithm 41. In this research, the features were selected based on the rank of feature discrimination and the testing accuracy on different combinations of feature subsets using grid search and SVM, which requires high computational time and resources. In other words, to get the optimal parameter settings and feature subsets, the SVM trained the input iteratively until the optimal accuracy was obtained. The feature selection algorithm not only reduced the dimension of features but also eliminated the noisy information for prediction. Polat and Günes proposed least square support vector machine (LS-SVM) for breast cancer diagnosis based on the same data set with accuracy of 98.53% 42. The main difference between LS-SVM and SVM was that LS-SVM used a set of linear equations for training instead of solving the quadratic optimization problem. By improving the training process, the calculation became simpler; however, feature selection was not combined in this research. Another SVM with a linear kernel was applied for the classification of cancer tissue based on different data sets with more than 2,000 types of features 43.
To reduce the training set dimension, some researchers have started to combine clustering algorithms and classifier models in machine learning areas. Dhillon et.al used a hybrid clustering algorithm to group similar text words for achieving faster and more accurate training on the task of text classification 44. A variant of K-means algorithm, Fuzzy C-Means clustering algorithm was introduced to select training samples for SVM classifier training 45. Through the Fuzzy C-Means clustering algorithm, similar training samples were clustered and a subset of the training samples in the same cluster was selected for SVM classifier training.
K. Shivakami breast cancer prediction was done using DT-SVM Hybrid Model. This study was performed using the Wisconsin Breast Cancer Dataset (WBCD) taken as input from UCI machine learning repository (UCI Repository of Machine Learning Databases). The dataset contained 699 instances taken from needle aspirates from patients’ breasts, of which 458 cases belonged to benign class and the remaining 241 cases belonged to malignant class. It should be noted that there were 16 instances which had missing values. In this study all the missing values were replaced by the mean of the attributes. Each record in the database had nine attributes. These nine attributes were found to differ significantly between benign and malignant samples. In case of DT-SVM the accuracy obtained was 91% with an error rate of 2.58%. Other classification algorithms had also been applied like IBL, SMO and Naïve Bayes. For IBL the accuracy obtained was 85.23% with an error rate of 12.63%. For SMO the accuracy was 72.56% with an error rate of 5.96%. For Naïve Bayes the accuracy obtained was 89.48% with an error rate of 9.89%. So this comparative study revealed that DT-SVM performed well in classifying the breast cancer data compared to all other algorithms 46.
Shweta Kharya et al. the core objective was to develop a probabilistic breast cancer prediction system using Naive Bayes Classifiers which can be used in making expert decision with highest accuracy. The system may be implemented in remote areas like countryside or rural regions, to imitate like human diagnostic expertise for treatment of cancer disease. The system is user friendly and reliable as model was already developed. For training Wisconsin Datasets containing 699 records with 9 medical attributes was used. For Testing 200 records were taken. This dataset had almost 65.5% benign cases and remaining 34.5% malignant cases. The accuracy was found to be 93% 47.
G. Ravi Kumar et al. the data set consisted of 699 patient’s records of which 499 were considered for training and 200 for testing purposes. Among them, 241 or 34.5% were reported to have breast cancers while the remaining 458 or 65.5% were non-cancerous. In order to validate the prediction results of the six popular data mining techniques the 10-fold crossover validation was used. The k-fold crossover validation was usually used to reduce the error coming from random sampling to compare the accuracies of a number of prediction models. The entire set of data was randomly divided into k folds with the same number of instances in each fold. The training and testing were performed for k times and one fold was selected for further testing while the rest were selected for further training. The present knowledge distributes the data into 10 folds where 1 fold was used for testing and 9 folds were used for training purpose in the 10-fold crossover validation. Here by applying Naïve Bayes algorithm on testing data an accuracy of 94.5% had been obtained. Same result had been obtained for SVM 48.
C.D. Katsis et al. the proposed methodology used a Correlation Feature Selection (CFS) procedure to rank the extracted different features and an Artificial Immune Recognition System (AIRS) classifier in order to support breast cancer diagnosis. To evaluate the methodology, data had been gathered arising from 53 subjects out of 4726 cases. The specific topics expressed lesions that were not highly suggestive of benignity or malignancy when evaluated on all modality used. In every case biopsy was conducted and the biopsy results were used as golden standard to validate the methodology. The constructed dataset consisted of the features as well as the biopsy results (malignancy or benignity) for all 53 subjects. In the University Hospital of Ioannina, Greece, all data were collected. SVM technique gave an accuracy of 70.00+6.33 % considering the full set of features and an accuracy of 68.92+6.97 % considering the subset of CFS selected features 49.
Gouda I. Salama et al. presented a comparison among the different classifiers decision tree (J48), Naive Bayes (NB), Multi-Layer Perception (MLP), Sequential Minimal Optimization (SMO) and Instance Based for K-Nearest neighbor (IBK) on three very popular different databases of breast cancer (Wisconsin Breast Cancer (WBC),Wisconsin Prognosis Breast Cancer (WPBC) and Wisconsin Diagnosis Breast Cancer (WDBC)) by using confusion matrix and classification accuracy based on 10-fold cross validation method. They introduced a fusion at classification level between these classifiers to get the most appropriate multi-classifier method for each data set. The experimental results showed that in the classification using fusion of J48 and MLP with the PCA was superior to the other classifiers using WBC data set. The PCA was used in WBC dataset as a features reduction transformation method which combined a set of correlated features. An accuracy of 92.97% was achieved using Naïve Bayes as classifier 50.
Kim W et al. SVM technique was used on breast cancer dataset consisting of 679 records. The types of data were clinical, pathologic and epidemiologic. The accuracy obtained was 99% considering the feature local invasion of tumor 51.
Mehmet Fatih Akay SVM with feature selection was used to diagnose the breast cancer. For training and testing experiments the WDBC dataset has been taken from the University of California at Irvine (UCI) machine learning repository .It was spotted that the proposed method produced the highest classification accuracies (99.51%, 99.02% and 98.53% for 80–20% of training-test partition, 70–30% of training-test partition and 50–50% of training-test partition respectively) for a subset that carried five features. Also, other measures such as the sensitivity, specificity, confusion matrix, negative predictive value and positive predictive value and ROC curves were used to show the performance of SVM with feature selection 52.
Diana Dumitru the Naive Bayes classifier was applied to the Wisconsin Prognostic Breast Cancer (WPBC) dataset, containing a number of 198 patients and a binary decision class: non-recurrent-events having 151 instances and recurrent-events having 47 instances. The testing diagnosing accuracy, that was the main performance measure of the classifier, was about 74.24%, in compliance with the performance of other well-known machine learning techniques 53.
Daniele Soria et al. a comparison of three different classifiers in machine learning was presented, namely the Naive Bayes algorithm, the Multilayer Perceptron function and the C4.5 decision tree. C4.5 algorithm developed by Ross Quinlan, is used to generate a decision tree. C4.5 is an extension of Quinlan’s earlier ID3 algorithm 54. The decision trees created by C4.5 can be used for classification purpose and for this reason C4.5 is often referred to as a statistical classifier55.A Multilayer Perceptron is a feed forward artificial neural network model which maps sets of input data onto a set of proper output. It is a moderation of the standard linear perceptron where it uses three or more layers of neurons i.e., nodes with nonlinear activation functions and is more powerful than the perceptron in which it can differentiate data that is not linearly separable or separable by a hyper plane. The study was motivated by the necessity to detect an automated and robust method to validate their previous classification of breast cancer markers. They had, in fact, obtained six classes using agreement between different clustering algorithms. Starting from these groups they wanted to replicate the classification keeping into account the high non-normality of used data. For this reason they started using the C4.5 and the Multilayer Perceptron classifiers and then they compared results with the Naïve Bayes. Surprisingly, it was found that when the dataset was reduced to ten markers, the Naive Bayes classifier performed better than the C4.5. The number of instances taken was 663. An accuracy of 93.1% was obtained using 10 markers and this accuracy became 86.9% using 25 markers.
Haowen You et al. was to provide a comparative analysis on the utilized potential classification tools (back-propagation neural network, linear programming, and Bayesian network and support vector machine) on the problem by a benchmark dataset which consisted of numeric cellular shape features extracted from pre-processed Fine Needle Aspiration biopsy image of cell slides. The benchmark dataset in this research was obtained from the UCI machine learning repository classified data as malignant (M) or benign (B). The dataset was composed of a total of 569 observations with benign and malignant cases being 357 and 212 observations respectively. Each of the dataset in the observation was composed of 30 variables and 10 of the featured variables were related to the aforementioned characteristics. Here Naïve Bayes classifier gave an accuracy of 89.55% 56.
Andrews, Diederich and tickle presented the first method for extracting rules from neural network in 1966. They studied link-based expert system in which every node of the ANN represented a mental concept. The experts showed the method of using ANN for the refinement of rules. The algorithm was named SUBSET, and it was based on the analysis of weights witch fired a specific neuron 57.
Towel and Shavlik developed a method for extracting rules, called the subset method, which was based on the weights of links, and in which it was assumed that the activity function exhibited an almost Boolean behavior 58. Sethi and Yoo came up with a method for extracting rules which was based on the weights of the links 59.
Setiono and liu proposed a viewpoint on extracting rules which was based on clustering the degrees of the activity of the hidden layer 60. Keedwell et al. proposed a system in which a genetic algorithm was used to search for rules in the entry space of the ANN 61. Setiono and leow developed a rapid method, based on the connection among hidden layers, in which the information load of the hidden layers was taken into account 62. Palade, Neagu, and Puscasu proposed a method for extracting rules from artificial neural networks (ANNs) which was based on interrupts in the propagation through network, and in which method a function was employed for inverting the ANN 63. Garcez and broad suggested a method for extracting non-uniform rules from an artificial neural formed from separate input units 64.
Snyclers and Omlin compared the efficiency of symbolic rules extracted from a trained ANN with/without adaptive bias and obtained experimental result for a problem in molecular biology 65. Jiang, Zhou and Chen proposed a combination of artificial neural network and the learning of rules. In their suggested algorithm, a comprehensive ANN was employed as a front-end process which produced many learning samples for the process of learning the back-end rule 66. Setiono, Leow and Zuarada proposed a viewpoint for extracting rules from ANN which had been trained regression problems 67.Elaifi, Haque and Elalami proposed an algorithm for extracting rules from data bases by an ANN which had been trained through employing a genetic algorithm 68. In summary, most of the viewpoints explained in articles usually have two purposes. On hand, some researchers have stated that it is necessary to simplify neural networks so that the process of extracting laws becomes easier, and they favor the use of a structure and teaching program specifically for neural networks to achieve this goal. The hypothesis which supports these viewpoints is that neural networks can help in extracting desired rules. On the other hand, in some articles, algorithms have been proposed which basically focus on clarifying coded information present in previous trained ANNs 69. In other words, the rules produced by artificial immunity are evaluated through the use of a combination of a trained ANN and the results of subtractive clustering of the initial dataset.
K-Nearest Neighbor (KNN) classification classifies instances based on their similarity 70. It is one of the most popular algorithms for pattern recognition. It is a type of Lazy learning where the function is only approximated locally and all computation is deferred until classification. An object is classified by a majority of its neighbors. K is always a positive integer. The neighbors are selected from a set of objects for which the correct classification is known. In WEKA this classifier is called IBK.
Decision tree J48 implements Quinlan’s C4.5 algorithm for generating a pruned or un pruned C4.5 tree 71. C4.5 is an extension of Quinlan’s earlier ID3 algorithm. The decision trees generated by J48 can be used for classification. J48 builds decision trees from a set of labeled training data using the concept of information entropy. It uses the fact that each attribute of the data can be used to make a decision by splitting the data into smaller subsets.
J48 examines the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. To make the decision, the attribute with the highest normalized information gain is used. Then the algorithm recurs on the smaller subsets. The splitting procedure stops if all instances in a subset belong to the same class. Then a leaf node is created in the decision tree telling to choose that class. But it can also happen that none of the features give any information gain. In this case J48 creates a decision node higher up in the tree using the expected value of the class.
J48 can handle both continuous and discrete attributes, training data with missing attribute values and attributes with differing costs. Further it provides an option for pruning trees after creation.
Fusion of classifiers is combining multiple classifiers to get the best accuracy. It is a set of classifiers whose individual predictions are combined in some way to classify new examples. Integration should improve predictive accuracy. In WEKA the class for combining classifiers is called Vote. Different combinations of probability estimates for classification are available.
Confusion matrix is a visualization tool which is commonly used to present the accuracy of the classifiers in classification 72. It is used to show the relationships between outcomes and predicted classes. In WEKA, SVM classifier is called SMO.
We use WEKA in our research for measuring classification accuracy.
WEKA (WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS) is an open source tool, University of Waikato New Zealand developed and it is a free Java software available under the General Public License. It is Data mining tool consisting ample toolbox to solve data mining problems 73.WEKA is applied on large instances consisting of more attributes. WEKA implements several Data mining techniques classification, algorithms for regression and clustering along with a number of visualization tools.
WEKA support many different data mining tasks such as data preprocessing and visualization, attribute selection, classification, prediction, model evaluation, clustering, and association rule mining. WEKA encloses 49 different tools used for pre-processing and 76 different algorithms, where clustering algorithm can execute using eight different algorithms 74.
With respect to all related work mentioned above, our work compares the behavior of data mining algorithm SVM and dl4j classifier using Wisconsin Breast Cancer (original, diagnostic, prognostic) dataset in both diagnosis and analysis to make decisions. The goal is to achieve the best accuracy with the lowest error rate in analyzing data. To do so, we compare efficiency and effectiveness of those approaches in terms of many criteria, including: accuracy, precision, sensitivity and specificity, correctly and incorrectly classified instances and time to build model, among others.
Features of Datasets:
In this chapter we discuss different features of datasets.Datasets:
Our investigation is based on Wisconsin Breast Cancer dataset available at UCI machine learning repository online.
The descriptions of Wisconsin Breast Cancer Datasets are:
No of attributes No of instances `
Wisconsin Breast Cancer
(Original) 11 699 2
Breast Cancer (WDBC)
32 569 2
Wisconsin Prognosis Breast Cancer(WPBC)
34 198 2
These are three datasets with 11, 32, 34 features respectively. For more accurate results we normalized these datasets using min-max normalization. Another set of dataset is prepared on the basis of extracted features. The features that we extracted include mean, minimum, maximum, median, vara, var.p, var.s, varpa, stdeva, stdev.p, stdev.s, stdevpa etc.
Wisconsin original breast cancer dataset:
1 Sample code number id number
2 Clump Thickness 1 – 10
3 Uniformity of Cell Size 1 – 10
4 Uniformity of Cell Shape 1 – 10
5 Marginal Adhesion 1 – 10
6 Single Epithelial Cell Size 1 – 10
7 Bare Nuclei 1 – 10
8 Bland Chromatin 1 – 10
9 Normal Nucleoli 1 – 10
10 Mitoses 1 – 10
11 Class 2 for benign, 4 for malignant
Wisconsin prognostic breast cancer dataset:
1 ID number
3 Mean Radius
4 Mean texture
5 Mean Perimeter
6 Mean Area
7 Mean Smoothness
8 Mean Compactness
9 Mean Concavity
10 Mean Concave points
11 Mean Symmetry
12 Mean fractal dimension
13 Radius SE
14 texture SE
15 Perimeter SE
16 Area SE
17 Smoothness SE
18 Compactness SE
19 Concavity SE
20 Concave points
21 Symmetry SE
22 fractal dimension SE
23 Worst Radius
24 Worst texture
25 Worst Perimeter
26 Worst Area
27 Worst Smoothness
28 Worst Compactness
29 Worst Concavity
30 Worst Concave points
31 Worst Symmetry
32 Worst fractal dimension
Wisconsin diagnostic breast cancer dataset:
1 Id number
4 Mean Radius
5 Mean texture
6 Mean Perimeter
7 Mean area
8 Mean Smoothness
9 Mean Compactness
10 Mean Concavity
11 Mean Concave points
12 Mean Symmetry
13 Mean fractal dimension
14 Radius SE
15 texture SE
16 Perimeter SE
17 Area SE
18 Smoothness SE
19 Compactness SE
20 Concavity SE
21 Concave points
22 Symmetry SE
23 fractal dimension SE
24 Worst Radius
25 Worst texture
26 Worst Perimeter
27 Worst Area
28 Worst Smoothness
29 Worst Compactness
30 Worst Concavity
31 Worst Concave points
32 Worst Symmetry
33 Worst fractal dimension
34 Tumor size
35 Lymph node status
In WDBC there are 569 instances along with 32 attributes of 10 real valued features which includes
Radius: It is the distance from the center of the cell to the edges points. Radius of the cell is checked while detecting cancerous tumor. If the radius is different from the radius of normal there will be abnormality which may lead to cancer.
Texture: (standard deviation of gray-scale values)
Smoothness: (local variation in radius lengths)
compactness (perimeter^2 / area – 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
Fractal dimension (coastline approximation).
Proposed Methodology for Breast Cancer Diagnosis:
The method that we use for breast cancer diagnosis is SVM. Classify an observation using a trained SVM classifier. For training SVM classifier firstly dataset is prepared.
Preparation of dataset consists of following parts:
The Wisconsin Breast Cancer datasets from the UCI Machine Learning Repository is used 75, to distinguish malignant (cancerous) from benign (non-cancerous) samples. A brief description of these datasets is presented in table 1. Each dataset consists of some classification patterns or instances with a set of numerical features or attributes.
Feature extraction is an important step. The features that we extracted include mean, minimum, maximum, median, vara, var.p, var.s, varpa, stdeva, stdev.p, stdev.s, stdevpa etc. we extract all these features for all three datasets including Wisconsin original, diagnostic and prognostic datasets. As ever, it depends on the dataset. Sometimes, one feature may be sufficient to build a highly-accurate classifier, but on most interesting, non-trivial problems, multiple features are needed. Therefore we extract features for building a classifier more accurate.
Min-max normalization is a normalization strategy which linearly transforms x to y= (x-min)/ (max-min), where min and max are the minimum and maximum values in X, where X is the set of observed values of x. We apply normalization on both sets of datasets i.e. set of three datasets taking from uci and another set of datasets with derived features. At the end of this process we finally prepared 12 datasets which are as follow:
Diagnostic stats normalized
Original stats normalized
Prognostic stats Normalized
SVM is one of the supervised ML classification techniques that is widely applied in the field of cancer diagnosis and prognosis. SVM functions by selecting critical samples from all classes known as support vectors and separating the classes by generating a linear function that divides them as broadly as possible using these support vectors.
Support vector machine is a class of machine learning algorithms that can perform pattern recognition and regression based on the theory of statistical learning and the principle of structural risk minimization. Vladimir Vapnik invented SVM for searching a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin. The margin was defined by the distance of the hyperplane to the nearest of the positive and negative examples. SVM has been widely used in the diagnosis of diseases because of the high accuracy of prediction 37-39.
An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.
The support vectors are the data points that are closest to the separating hyperplane; these points are on the boundary of the slab. The following figure illustrates these definitions, with + indicating data points of type 1 and – indicating data points of type –1.
INCLUDEPICTURE “D:\MATLAB\R2017a\help\stats\svmhyperplane.png” * MERGEFORMATINET
Therefore, it can be said that a mapping between an input vector to a high dimensionality space is made using SVM that aims to find the most suitable hyperplane that divides the data set into classes 5. This linear classifier aims to maximize the distance between the decision hyperplane and the nearest data point, which is called the marginal distance, by finding the best suited hyperplane 6.
Working of SVM classifier:
Group = svmclassify( HYPERLINK “file:///D:\MATLAB\R2017a\help\stats\svmclassify.html?searchHighlight=Classify%20an%20Observation%20Using%20a%20Trained%20SVM%20Classifier” l “inputarg_SVMStruct” SVMStruct,Sample) classifies each row of the data in Sample, a matrix of data, using the information in a support vector machine classifier structure SVMStruct, created using the svm train function. Like the training data used to create SVMStruct, Sample is a matrix where each row corresponds to an observation or replicate, and each column corresponds to a feature or variable. Therefore, Sample must have the same number of columns as the training data. This is because the number of columns defines the number of features. Group indicates the group to which each row of Sample has been assigned.
Group = Svm classify(SVM Struct, Sample,’ plot’, true) plots the Sample data in the figure created using the Show plot property with the Svm train function. This plot appears only when the data is two-dimensional.
SVM StructSupport vector machine classifier structure created using the svm train function.
Sample A matrix where each row corresponds to an observation or replicate, and each column corresponds to a feature or variable. Therefore, Sample must have the same number of columns as the training data. This is because the number of columns defines the dimensionality of the data space.
Show plot Describes whether to display a plot of the classification. Displays only for 2-D problems. Follow with a Boolean argument: true to display the plot, false to give no display.
188.8.131.52 Output arguments:
Group Column vector with the same number of rows as Sample. Each entry (row) in Group represents the class of the corresponding row of Sample.
xdata = trainset(1:end, 2:47);
labels = trainset (1:end,1);
SVMStruct = svmtrain (xdata, labels,’ShowPlot’,true);
Xnew = testset (1:end, 2:47);
test labels = svmclassify(svmStruct,Xnew,’showPlot’,true)
The results that we achieved through SVNM are as follow:
Dataset Accuracy Error Sensitivity Specificity Precision
Diagnostic. Statistic 0.5556 0.4444 0.6132 0.4615 0.6500
Diagnostic 0.9882 0.0118 0.9906 0.9844 0.9906
Diagnostic. Normalized 0.9825 0.0175 0.9813 0.9844 0.9906
Diagnosticstats.normalized0.9591 0.0409 0.9717 0.9385 0.9626
Original. Normalized 0.9476 0.0524 0.9504 0.9420 0.9710
Original. Statistics 0.9714 0.0286 0.9926 0.9333 0.9640
Original 0.9762 0.0238 0.9767 0.9753 0.9844
Originalstats.normalized0.9524 0.0476 0.9928 0.8750 0.9384
Prognostic. Statistics 0.6667 0.3333 0.7857 0.3889 0.7500
Prognostic 0.6833 0.3167 0.8333 0.4583 0.6977
Prognostic. Normalized 0.7000 0.3000 0.9167 0.3750 0.6875
Prognosticstats.normalized0.6780 0.3220 0.7907 0.3750 0.7727
J48 is a recursive algorithm for generating C4.5 pruned or unpruned decision trees. Decision trees are created within the j48 algorithm by using information entropy on a set of training data. Data attributes are organized into subsets and the normalized information gain, measured by the difference in entropy, is used to measure these subsets to identify the optimum attributes used as nodes in decision tree.
J48 is a decision tree that is use for classification using the information entropy concept and implementing Quinlan’s C4.5 algorithm for generating a pruned C4.5 tree. Making decisions is done by splitting each data attributers into smaller subsets in order to examine the entropy differences, and choose the attributes with the highest normalized information gain. The splitting stops when finding subset instances belong to the same class, and thus the leaf node gets created. If no leaf node is detected, J48 creates a higher up node decision based on the expected class value 76.
5.4.1 Working of J48:
We implement J48 on all 12 datasets in WEKA and gathered all the results which are shown in next chapter i.e. performance comparison.
Properties on the basis of which we implement j48 are given below in following table:
Cross validation 10 fold
Batch size 100
Confidence factor 0.25
Numdecimal places 2
184.108.40.206 Cross validation:
Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples.
220.127.116.11 Batch size:
The cardinality of a set of items to be processed at some future step.
18.104.22.168 Confidence factor:
Confidence factor incurs less pruning of the tree, confidence factor sets a threshold for the information gain ratio measure.
22.214.171.124 Numdecimal places:
The number of decimal places for the output of numbers in the model.
Seed for random data shuffling (to move data into a different order or into different positions).
An example of tree generated by J48 is given below:
5.4.3 Results of J48:
Dataset TP Rate FP Rate Precision Recall F-Measure
Diagnostic. Statistic 0.928 0.077 0.928 0.928 0.928
Diagnostic 0.930 0.076 0.930 0.930 0.930
Diagnostic. Normalized 0.931 0.073 0.932 0.931 0.932
Diagnostic stats normalized 0.930 0.074 0.930 0.930 0.930
Original. Normalized 0.940 0.075 0.940 0.940 0.940
Original. Statistics 0.960 0.047 0.960 0.960 0.960
Original 0.940 0.075 0.940 0.940 0.940
Original stats normalized 0.960 0.047 0.960 0.960 0.960
Prognostic. Statistics 0.737 0.404 0.753 0.737 0.744
Prognostic 0.732 0.450 0.738 0.732 0.735
Prognostic. normalized 0.737 0.463 0.737 0.737 0.737
Prognostic stats normalized 0.727 0.495 0.723 0.727 0.725
Performance ComparisonIn order to show the efficiency and popularity of classifiers, following comparison tables are created.
In this work we present breast cancer diagnosis using feature extraction using a computational method which is based upon different approaches. In future it is aimed to improve the developed computational method by using a hybrid approach with varying feature set. The new proposed computational method will incorporate the best features of both approaches, for detecting breast cancer diagnosis using feature extraction for getting a high accuracy rate.
Breast cancer diagnosis using feature extraction is a challenging task. SVM works as well for detecting breast cancer. The presented methods shows promising results for breast cancer diagnosis using feature extraction.
1 F. Firoozbakht, I. Rezaeian, L. Porter, and L. Rueda, “Breast Cancer Subtype Identification Using Machine Learning Techniques,” IEEE, pp. 6–7, 2014.
3 www.aicr.org/assets/docs/pdf/brochures/reduce-your- risk-of-breast.pdf.
5 Daniele Soria, Jonathan M. Garibaldi, Elia Biganzoli, Ian O. Ellis,”A comparison of three different methods for classification of breast cancer data.”Machine Learning and Applications, 2008. ICMLA’08.Seventh International Conference on. IEEE, 2008.
6 M. P. Sampat, M. K. Markey, and A. C. Bovik, “Computer-aided detection and diagnosis in mammography,” in Handbook of Image and Video Processing, A.C. Bovik, Ed., 2nd ed. New York: Academic, 2005, pp. 1195–1217.
7 Abdel-zaher, A. M., & Eldeib, A. M. (2015). PT US CR. Expert Systems With Applications. https://doi.org/10.1016/j.eswa.2015.10.015
Firoozbakht, F., Rezaeian, I., Porter, L., & Rueda, L. (2014). Breast Cancer Subtype Identification Using Machine Learning Techniques. IEEE, 6–7.
8 Goodman, D. E., Boggess, L. C., & Watkins, a. B. (2002). Artificial Immune System Classification of Multiple-class Problems. In In Proc. of
Intelligent Engineering Systems (pp. 179-184): ASME.
9 Wolberg, W. H., Street, W. N., & Mangasarian, O. L, ” Image analysis and machine learning applied to breast cancer diagnosis and prognosis”, Analytical and Quantitative Cytology and Histology,vol.17,pp.77-78,1995.
10 Pena-Reyes, C. A., & Sipper, M, “A fuzzy-genetic approach to breast cancer diagnosis”, Artificial Intelligence in Medicine, vol.17, pp.131-155, 1999.
11 Nezafat, R., Tabesh, A., Akhavan, S., Lucas, C., & Zia, M, “Feature selection and classification for diagnosing breast cancer”, In Proceedings of international association of science and technology for development international conference, pp. 310–313, 1998.
12 Chen, Y.-W., & Lin, C.-J, “Combining svms with various feature selection
Strategies”, In Feature extraction Berlin Heidelberg Springer, pp. 315–324, 2006.
13 Akay, M. F, “Support vector machines combined with feature selection for
Breast cancer diagnosis”, Expert Systems with Applications,vol. 36, pp.3240–3247,2009.
14 Prasad, Y., Biswas, K., & Jain, C, (2010). “Svm classifier based feature selection using ga, aco and pso for sirna design”, In Proceedings of the first international conference on advances in swarm intelligence, pp. 307–314, 2010.
15 Bennett, K. P., & Blue, J. A, “A support vector machine approach to decision
Trees”, In Proceedings of IEEE world congress on computational intelligence Anchorage, AK: IEE, pp. 2396–240, 1998.
16 Venkatadri, M., & Lokanatha, C. R, “A review on data mining from past to the
Future”, International Journal of Computer Applications, vol. 15, pp. 19–2, 2011.
17 Polat, K., & Güne_s, S, “Breast cancer diagnosis using least square support Vector machine”, Digital Signal Processing, vol.17, pp. 694–701, 2007.
18 Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., & Haussler, D, “Support vector machine classification and validation of cancer tissue samples using microarray expression data”, Bioinformatics, vol.16, pp.906–914, 2000
19 Silagy, C. and Haines, A. (1998) Evidence based practice in primary care, BMJ Publishing, London.
20 Md. Shafiul Islam, Naima Kaabouch and Wen Chen Hu, 2013. “A survey of medical imaging used for breast cancer detection”, IEEE,Electro/Information Technology (EIT), IEEE International Conference, pp: 1-5.
21 Sachin Prasad N. and Dana Houserkova, 2007. “The role of various modalities in breast imaging”, Biomed Pap Med Fac Univ Palacky Olomouc Czech Repub,
22 Venkatadri, M., & Lokanatha, C. R. (2011). A review on data mining from past to the future. International Journal of Computer Applications, 15, 19–22.
23 Richards G, Rayward-Smith VJ, Sonksen PH, Carey S, Weng C. Data mining for indicators of early mortality in a database of clinical records. Artif Intell Med 2001; 22:215—31.
24 Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8, 866–883.
25 Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on 660 Neural Networks, 16, 645–678.
26 Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM 629
Computing Surveys (CSUR), 31, 264–323.
27 V. Chaurasia and S. Pal, “Data Mining Techniques: To Predict and Resolve Breast Cancer Survivability,” vol. 3, no. 1, pp. 10– 22, 2014.
28 Djebbari, A., Liu, Z., Phan, S., AND Famili, F. International journal of computational biology and drug design (ijcbdd). 21st Annual Conference on Neural Information Processing Systems (2008).
29 S. Aruna and L. V Nandakishore, “KNOWLEDGE B ASED A NALYSIS OF VARIOUS STATISTICAL T OOLS IN D ETECTING B REAST,” pp. 37–45, 2011.
30 A. C. Y, “An Empirical Comparison of Data Mining Classification Methods,” vol. 3, no. 2, pp. 24–28, 2011.
31 A. Pradesh, “Analysis of Feature Selection with Classification: Breast Cancer Datasets,” Indian J. Comput. Sci. Eng., vol. 2, no.5, pp. 756–763, 2011.
32 Thorsten J. Transductive Inference for Text Classification Using Support Vector Machines. Icml. 1999; 99:200-209. doi:10.4218/etrij.10.0109.0425.
33 L. Ya-qin, W. Cheng, and Z. Lu, “Decision tree based predictive models for breast cancer survivability on imbalanced data,” pp. 1–4, 2009.
34 D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: a comparison of three data mining methods,” Artif. Intell. Med., vol. 34, pp. 113–127, 2005.
35 Sau Loong Ang, Hong Choon Ong and Heng ChinLow, “Classification Using the General BayesianNetwork.” Pertanika Journal of Science & Technology24.1 (2016).
36 Idicula-Thomas, S., Kulkarni, A. J., Kulkarni, B. D., Jayaraman, V. K., & Balaji, P. V. (2006). A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in escherichia coli. Bioinformatics, 22, 278–284.
40 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20,273–297.
41 Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. In Advances In Kernel Methods – Support Vector Learning (pp. 185–208). Cambridge, MA, USA: MIT Press.
42 Bennett, K. P., & Blue, J. A. (1998). A support vector machine approach to decision trees. In Proceedings of IEEE world congress on computational intelligence (pp. 2396–2401). Anchorage, AK: IEE.
43 Akay, M. F. (2009). Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications, 36, 3240–3247.
44 Polat, K., & Güne_s, S. (2007). Breast cancer diagnosis using least square support vector machine. Digital Signal Processing, 17, 694–701
45 Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., & Haussler, D.
(2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914.
46 Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive information theoretic feature clustering algorithm for text classification. The Journal of Machine Learning Research, 3, 1265–1287.
47 Wang, X.-Y., Zhang, X.-J., Yang, H.-Y., & Bu, J. (2012). A pixel-based color image segmentation using support vector machine and fuzzy c-means. Neural Networks, 33, 148–159.
48 K. Sivakami, “Mining Big Data: Breast Cancer Prediction using DT-SVM Hybrid Model.” International Journal of Scientific Engineering and Applied Science (IJSEAS) -Volume-1, Issue-5, August 2015.
49 Shweta Kharya, Shika Agrawal and Sunita Soni, “Naïve Bayes Classifiers: Probabilistic Detection Model for Breast Cancer.”International Journal of Computer Applications 92.10 (2014).
50 G. Ravi Kumar, Dr G. A. Ramachandra and K. Nagamani. “An Efficient Prediction of Breast Cancer Data using Data Mining Techniques.” International Journal of Innovations in Engineering and Technology (IJIET) 2.4 (2013): 139.
51 C. D. Katsis, I. Gkogkou, C.A. Papadopulos, Y.Goletsis, P. V. Boufounou, G. Stylios “Using artificial immune recognition systems in order to detect early breast cancer.” International Journal of Intelligent Systems and Applications 5.2 (2013): 34.
52 Gouda I. Salama, M. B. Abdelhalim and Magdy Abd-elghany Zeid, “Breast cancer diagnosis on three different datasets using multi-classifiers.” Breast Cancer (WDBC) 32.569 (2012): 2.
53 Kim W, Kim KS, Lee JE, Noh DY, Kim SW, Jung YS, Park MY, Park RW, “Development of novel breast cancer recurence prediction model using support vector machine.” Journal of breast cancer 15.2 (2012): 230-238.
54 Mehmet Fatih Akay, “Support vector machines combined with feature selection for breast cancer diagnosis.” Expert Systems with Applications 36 (2009) 3240–3247.
55 Diana Dumitru,”Prediction of recurrent events in breast cancer using the Naive Bayesian classification. “Annals of the University of Craiova-Mathematics and Computer Science Series 36.2 (2009): 92-96.
56 Daniele Soria, Jonathan M. Garibaldi, Elia Biganzoli, Ian O. Ellis,”A comparison of three different methods for classification of breast cancer data.”Machine Learning and Applications, 2008. ICMLA’08.Seventh International Conference on. IEEE, 2008.
59 S. I. Gallant, Communications of the ACM 31, 152 (1988).
60 G. G. Towell and J. Shavlik, Machine Learning 13, 71 (1993).
61 I. Sethi and J. Yoo, Engineering Intelligent Systems 4, 153 (1996).
62 H. Lu, R. Setiono, and H. Liu, IEEE Transactions on Knowledge and Data Engineering 8, 957 (1996).
63 E. Keedwell, A. Narayanan, and D. Savic, Evolving rules from neural networks trained on continuous data, Evolutionary computation, Proceedings of the 2000 Congress on Evolutionary Computation (2000)
64 R. Setiono and K. Leow, Applied Intelligence 12, 15 (2000).
65 V. Palade, D. Neagu, and G. Puscasu, Rule extraction from neural networks by interval propagation, Proceedings of the Fourth IEEE International Conference on Knowledge-Based Intelligent Engineering Systems Brighton, UK (2000), pp. 217–220.
66 A. S. D. Garcez, K. Broda, and D. M. Gabbay, Applied Intelligence 125, 155 (2001).
67 S. Snyders and C. Omlin, Rule extraction from knowledge-based neural networks with adaptive inductive bias, Proceedings of the Eighth International Conference on Neural Information Processing_ICONIP_ (2001), Vol. 1, pp. 143–148.
68 Y. Jiang, Z. Zhou, and Z. Chen, Rule learning based on neural network ensemble, Proceedings of the International Joint Conference on Neural Networks Honolulu (2002), pp. 1416–1420.
69 R. Setiono, W. K. Leow, and J. Zuarada, IEEE Transactions on Neural Networks 13, 564 (2002).
70 A. E. Elalfi, R. Haque, and M. E. Elalami, Applied Soft Computing 4, 65 (2004).
71 E. R. Hruschka and N. F. F. Ebecken, Neuro computing 70, 384 (2006).
72 Angeline Christobel. Y, Dr. Sivaprakasam (2011). An Empirical Comparison of Data Mining Classification Methods. International Journal of Computer Information Systems, Vol. 3, No. 2, 2011.
73 Ross Quinlan, (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.
74 J. Han and M. Kamber,”Data Mining Concepts and Techniques”, Morgan Kauffman Publishers, 2000.
75 B. Hamoud and E. Atwell, “Quran question and answer corpus for data mining with weka,” 2016 Conference of Basic Sciences and Engineering Studies (SGCAC), Khartoum, 2016, pp. 211-216.
76 Kashish Ara Shakil, Shadma Anis and Mansaf Alam, “Dengue disease prediction using weka data mining tool”.
77 Ross Quinlan, (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.