Artificial intelligence (AI) has become an essential instrument for researchers for accurate prediction of disease occurrence. A stroke is one of the major cause of death for a person above 65 years. If we can predict whether a person will experience a stroke or not, then he/she can be saved from that life-threatening disease. Early detection and treatment can save one’s life and money. For prediction of stroke Cardiovascular Health Study (CHS) dataset is used. CHS data set is a complex dataset with lots of inconsistent and unwanted data. So, understanding the CHS data set is very challenging. The main problem is to understand the dataset and extract the hidden knowledge. A highly effective predictive method is desired to increase the efficiency and precision.
A powerful Machine learning (ML) techniques are required which is capable of predicting the outcome from data without stringent statistical assumptions. The most common ML techniques that used to induce predictive model from the dataset are Support Vector Machine (SVM), Decision Tree (DT) and Artificial Neural Network (ANN). These three techniques are widely used for AI models for predicting the outcomes. SVM is a powerful supervised machine learning technique used for classification 2. A decision tree is one of the simplest yet a fairly accurate predictive technique. DT is commonly used for deriving a strategy to reach a particular goal. ANN has widely used ML technique, and we are using feed forward back propagation neural network for stroke prediction. In this work, C4.5 algorithm is used in DT for feature selection, and PCA is used for dimension reduction.
The rest of the paper is organized as follows. Section 2 describes the related works on prediction of events. Section 3 describes the methods and techniques used in our model. And section 4 presents the obtained outcomes of different method used for predictions followed by Section 5 which discuss the conclusion.
II. RELATED WORK
A. Support Vector Machine(SVM)
In 3 the author used three methods, MLR, RBFNN, and SVM for the prediction of toxicity activity of two different datasets. The first Dataset includes 76 compounds and their corresponding toxicity values. Similarly, the second dataset includes 146 compounds. And both datasets were divided into two dataset 80 percent for training and the remaining 20 percent for testing. After applying MLR, RBFNN, and SVM, the results were compared based on RMS error. It shows that SVM performed better classification and generalization ability than the other two methods.
B. Decision Tree (DT)
In 10 to predict prognosis in severe traumatic brain injury the decision tree with the C4.5 algorithm is used. The author used Waikato Environment for Knowledge Analysis (Weka) tool to implement C4.5 algorithm on the Traumatic Brain Injury (TBI) dataset. TBI dataset consists of 748 patient’s records with 18 attributes each. After implementation of the generated model, 87% of accuracy is obtained.
In our earlier work 1 Decision tree with the C4.5 algorithm is used to extract features from the pre-processed data set. Where we use Gain Ratio (a constituent function) from the whole C4.5 algorithm to select the best feature for better classification.
C. Artificial Neural Network (ANN)
In 7 an artificial neural network is used to predict the stoke market. The dataset used for the experiment was the real exchange rate value of NASDAQ Stock Market index price in between 2012 and 2013 with five input variables for classification process such as opening price, the lowest price, the highest price, etc. The author used Multi-Layer Perceptron (MLP) networks which is layered feed-forward networks for classification and back-propagation for training. The result shows the accuracy of 99% and error less than 2%.
The methodology section is divided into five stages as follows:
A) Dataset collection
B) Pre-processing of dataset
C) Feature selection using DT
D) Dimension Reduction using PCA
E) Classification models
A. Dataset Collection
The dataset which we use in our work is Cardiovascular Health Study (CHS) dataset. It is a population-based longitudinal study of coronary heart disease and stroke in adults aged 65 years and older 4. Available at the National Heart, Lung and Blood Institute (NHLBI) official website. The CHS dataset includes more than 600 attributes for each 5,888 samples. More than 50% of the information were not related to stroke. TABLE 1 shows the disease type, keyword and class present in the CHS dataset.