Performance Analysis of Machine LearningAlgorithms for Disease Prediction Sipiwe PhiriBilgisayar Mühendisli?iSelçuk ÜniversitesiKonya, Turkey.Email: [email protected], [email protected] Abstract— Breast Cancer is a type of cancer thatforms from the breast tissue.
Worldwide breast cancer accounts for up to 25% ofall cancer deaths in women. However, as a result of the advancements in medicalresearch, today cancer is considered preventable and curable; in its early (primary)stages. Regrettably, a vast number of cancer patients are diagnosed with thedisease in its later stages which leads to the objective of this project whichis to improve diagnosis accuracy and time through the performance analysis ofmachine learning classification algorithms such as K-Nearest Neighbor, LogisticRegression, Decision Trees, Random Trees, Neural Networks and Support VectorMachines (SVM). Furthermore, research is done reviewing the results of relatedworks from the year 2013 to 2017 in regards to machine learning algorithmperformance in disease prediction. Using the Wisconsin Breast Cancer datasetall experiments are executed using Python with the Jupyter notebook (ipython).
The results of the tests carried out in this project show that Random Forest,Neural Networks and Support Vector Machines are better performing algorithms interms of accuracy for breast cancer prediction.Keywords— Benign; Breast Cancer;Classification; Decision Trees; Efficiency; k-NN; Logistic Regression; Machine Learning;Malignant; Neural Networks; Random Trees; SVM. I. INTRODUCTIONMachine learning (ML) isan area in artificial intelligence in which computers are given the ability tolearn through experience without being explicitly programmed. Evolving fromstudies done on computational learning theory and pattern recognition inartificial intelligence, machine learning generally focuses on the constructionof algorithms which can learn from data and make data-driven predictions basedon what has been learned. As a branch of artificial intelligence, machinelearning incorporates a collection of probabilistic, statistical andoptimization methods which enables computers to learn from prior examples andidentify the hard-to diagnose patterns that are in complex and noisy data sets.
This feature of machine learning makes it compatible for medical applications,particularly applications that are dependent on complex genomic measurements. MLalgorithms such as K-Nearest neighbor, Logistic Regression, Decision Trees,Random Trees, Neural networks and Support Vector Machines are oftentimes madeuse of for disease prediction and diagnosis. In most recent medical developments,machine learning has been implemented to cancer prediction and diagnosis whichhas led to a life-changing medical phenomenon. Breast Cancer is a type ofcancer that originates from the breast tissue in humans and other mammals. Itmost commonly starts from the inner lining of the lobules which supply milk tothe milk ducts. Surgery has been popularized as the most effective way ofeliminating the disease and increasing the chances of remission (eliminates furthersigns of cancer). Furthermore, after surgery, radiation is employed tosubstantially improve the local relapse rate and overall survival of thepatients.
Worldwide, breast cancer now accounts for approximately 25% of allcancer cases in women. Most breast cancer cases are more than a 100 timescommonly discovered in women than in men. However, when discovered in men theoutcomes are horrible due to the delay in breast cancer diagnosis. Breastcancer patients prognosis and survival rate vary from one to another as it ishighly dependent on the type of cancer,the stage at which it is at, the treatment and the geographical location;survival rates of patients in the Western world are higher than those indeveloping countries.
In this project aninvestigation of the performance of six (6) machine learning algorithms forbreast cancer diagnosis is carried out. The analysis of these algorithms is interms of accuracy of the testing and the training subsets of the WisconsinBreast Cancer dataset.II.
PROBLEM DEFINITIONThanks to machine learning algorithms the medical industry is undergoingrapid developments today. However, medical practitioners still lose patientsdue to late diagnosis, wrong prognosis or the inability to predict diseaseoutbreaks, thereby increasing the workload of medical practitioners. Hence, theluck of efficiency and accuracy in most medical healthcare systems. Amongst allcancer diseases, breast cancer in women is responsible for approximately 25% ofcancer deaths worldwide. This percentage hopefully can be reduced with the useof high performance and accurate machine learning algorithms to aid medicalpractitioners deliver better services and save lives of cancer patients bygiving correct diagnosis and prognosis.III. PROPOSED METHODTo carry out the objectiveof this project, the following algorithms are selected for the analysis interms of accuracy performance over the Wisconsin Breast Cancer Dataset;· K-Nearest Neighbor· Logistic Regression· Decision Trees· Random Forest· Neural Networks· Support Vector MachinesThe above mentioned machine learning algorithms are proposed to carry outperformance analysis tests on the proposed dataset that comes preloaded inPython’s library scikit-learn.
IV. LITERATURE REVIEWThis literature review offersa concise summary of already existing research done on machine learningalgorithms in relation to disease prediction and to be more specific on breastcancer diagnosis. As stated by the PubMed statistics, approximately 1500research papers have been published in relation to cancer and machine learning.
Unfortunately, most of these research papers discuss machine learningtechniques being used for detection of cancer and not how accurate thetechniques perform in cancer diagnosis. Nonetheless, below is an account of theimplemented ML algorithms together with the scope are elaborated in thisliterature review.In the survey by Fatimaand Pasha (2017) where they investigated Machine Learning algorithms fordisease diagnosis, their findings show that the Feed-forward Neural Networksprovide an accuracy of 98% in comparison to Support Vector Machines (96.
60%)and Naïve Bayes (97.10%) algorithms. It is clearly observed from their surveypaper that the algorithms they experimented on offer enhanced performance forthe prognosis of diseases such as cancer, diabetes, liver and hepatitis. Asri et al.
(2016) intheir study implemented four machine learning algorithms Support VectorMachines, NB, K-Nearest Neighbor and C4.5 on the Wisconsin Breast Cancerdataset. In an aim to compare the efficiency and effectiveness in terms ofgetting the algorithm with the best accuracy, specificity and sensitivity,Support Vector Machines (SVM) outperformed the other algorithms with anaccuracy of 97.13%.Findings obtained from aperformance analysis of seven classification prediction algorithms by Senturket al.
(2016) claim that machine learning algorithms guarantee results that canbe used for the early diagnosis and prognosis of breast cancer in patientsusing the RapidMiner Tool. Experimental testing was done using ArtificialNeural Networks, Decision Trees, Discriminant Analysis, K-NN, LogisticRegression, Naïve Bayes and SVM. The authors selected SVM as the best algorithmhaving the highest accuracy percentage of 96%.Shiny et al. (2015)implemented data mining algorithms to analyze breast cancer. The authorscarried out the prediction of breast cancer using various algorithms evaluatedusing classification matrices and lift charts. In their findings, owing to thefact that there was a lack of much training information to learn from, none ofthe models accurately classified the malignant data. However, the algorithms thatyielded better results are Artificial Neural Networks (ANN), Naïve Bayes andLogistic Regression.
Worst results obtained from the entropy tree.In an investigation exploringthe application of decision trees in breast cancer diagnosis done by Shajahaanet al. (2013), the authors carried an analysis of algorithms such as Naïve Bayes,CART, ID3 and C4.5. Five important attributes were highlighted which can beconsidered for breast cancer prediction. The authors concluded that RandomForest having the highest accuracy for prediction is the best classificationalgorithm.Shrivastava et al. (2013)in their research also made use of machine learning classification algorithmsto classify the malignant and benign instances of the Wisconsin Breast Cancer dataset.
Using Weka for their experimental tests the authors created classifier modelfor classification using decision trees and neural network approaches. Owing tothe if-then standards for enhancing the performance for decision trees, thealgorithm had better accuracy results.From this review, it is concluded that the selected machine learning algorithmsfor this project have been shown to produce high accuracy results whenexperimented against other algorithms, most of the algorithms having accuracy ofabove 90%. Henceforth, K-NN, Logistic Regression, Decision Trees, Random Forestand Support Vector Machine algorithms shall be used for experimental testing inthis project.
V. METHODOLOGYIn this project sixmachine learning algorithms are implemented and tested on the Wisconsin Breast Cancer dataset. The dataset contains 569 data samples which are digitized images offine-needle aspirates (FNA) of breast cancer tumors. There are thirty (30)attributes that describe each sample. Each sample is classified or labelled asa tumor malignant or benign.
In machine learning classification, this data is fed into an algorithmby specifying which particular features belong to either of the classes. Inthis process the algorithm ‘learns’. After the training process the algorithm istested using new data; only the features are specified and require thealgorithm to predict whether the data being dealt with is of class malignant or benign. The machine learningalgorithms K-nearest neighbor, Logistic Regression, Decision Trees, RandomForest, Neural Networks and Support Vector Machine (SVM) are chosen to carryout a performance/accuracy analysis test because these algorithms are popularlyused and have a high success rate at yielding great results. Furthermore, eachof these algorithms employs a different approach for the generation of classificationmodels thus increasing the probability of obtaining a prediction model having highclassification accuracy.A. K-Nearest NeighborB.
Logistic RegressionC. Decision TreesD. Random TreesE. Neural NetworksF. Support Vector Machine (SVM) VI. COMPARISONOF ML ALGORITHMSIn this sectioVII.
TESTS & RESULTSIn this section wediscuss the results obtained from the experimental accuracy tests that werecarried out on each of the selected ML algorithms.A. TestsThe 1. B. ResultsThe table below depictsthe accuracy results obtained from the experimental testing of each of themachine learning algorithms over the Wisconsin Breast Cancer dataset. ML Algorithm Accuracy on training set (%) Accuracy on testing set (%) K-NN 94.6 93 Logistic Regression 97.2 96.
5 Decision Tree 98.8 95.1 Random Forest 100 97.2 Neural Network 98.8 97.2 SVM 98.8 97.
2 Below is a graphicalrepresentation of the results obtained from the experimental tests performed. VIII. CONCLUSIONFor the analysis of medicalinformation or data, a vast variety of machine learning algorithms, techniques andlibraries are readily available. However, the challenge is to construct highperformance and computationally accurate classifiers for medical applications.In this project, six main machine learning algorithms namely; K-NearestNeighbor, Logistic Regression, Decision Trees, Random Forests, Neural Networksand Support Vector Machines are employed over the Wisconsin Breast Cancerdataset in terms of performance and accuracy. In terms of accuracy on the testset, the best performing algorithms are Random Forest, Neural Networks andSupport Vector Machines.
In this analysis Random Forest can be considered asthe better performing algorithm as it requires little to no preprocessing ofdata in comparison to Neural Networks and Support Vector Machine (SVM).IX. REFERENCESgvbjhbk