Performance K-Nearest Neighbor, Logistic Regression, Decision Trees, Random

Performance Analysis of Machine Learning
Algorithms for Disease Prediction

 

Sipiwe Phiri

Bilgisayar Mühendisli?i

Selçuk Üniversitesi

Konya, Turkey.

Email: [email protected], [email protected]

 

 

Abstract— Breast Cancer is a type of cancer that
forms from the breast tissue. Worldwide breast cancer accounts for up to 25% of
all cancer deaths in women. However, as a result of the advancements in medical
research, today cancer is considered preventable and curable; in its early (primary)
stages. Regrettably, a vast number of cancer patients are diagnosed with the
disease in its later stages which leads to the objective of this project which
is to improve diagnosis accuracy and time through the performance analysis of
machine learning classification algorithms such as K-Nearest Neighbor, Logistic
Regression, Decision Trees, Random Trees, Neural Networks and Support Vector
Machines (SVM). Furthermore, research is done reviewing the results of related
works from the year 2013 to 2017 in regards to machine learning algorithm
performance in disease prediction. Using the Wisconsin Breast Cancer dataset
all experiments are executed using Python with the Jupyter notebook (ipython).
The results of the tests carried out in this project show that Random Forest,
Neural Networks and Support Vector Machines are better performing algorithms in
terms of accuracy for breast cancer prediction.

Keywords— Benign; Breast Cancer;
Classification; Decision Trees; Efficiency; k-NN; Logistic Regression; Machine Learning;
Malignant; Neural Networks; Random Trees; SVM.

 

I.               
 INTRODUCTION

Machine learning (ML) is
an area in artificial intelligence in which computers are given the ability to
learn through experience without being explicitly programmed. Evolving from
studies done on computational learning theory and pattern recognition in
artificial intelligence, machine learning generally focuses on the construction
of algorithms which can learn from data and make data-driven predictions based
on what has been learned. As a branch of artificial intelligence, machine
learning incorporates a collection of probabilistic, statistical and
optimization methods which enables computers to learn from prior examples and
identify the hard-to diagnose patterns that are in complex and noisy data sets.
This feature of machine learning makes it compatible for medical applications,
particularly applications that are dependent on complex genomic measurements. ML
algorithms such as K-Nearest neighbor, Logistic Regression, Decision Trees,
Random Trees, Neural networks and Support Vector Machines are oftentimes made
use of for disease prediction and diagnosis. In most recent medical developments,
machine learning has been implemented to cancer prediction and diagnosis which
has led to a life-changing medical phenomenon.

Breast Cancer is a type of
cancer that originates from the breast tissue in humans and other mammals. It
most commonly starts from the inner lining of the lobules which supply milk to
the milk ducts. Surgery has been popularized as the most effective way of
eliminating the disease and increasing the chances of remission (eliminates further
signs of cancer). Furthermore, after surgery, radiation is employed to
substantially improve the local relapse rate and overall survival of the
patients. Worldwide, breast cancer now accounts for approximately 25% of all
cancer cases in women. Most breast cancer cases are more than a 100 times
commonly discovered in women than in men. However, when discovered in men the
outcomes are horrible due to the delay in breast cancer diagnosis. Breast
cancer patients prognosis and survival rate vary from one to another as it is
highly dependent on the type of  cancer,
the stage at which it is at, the treatment and the geographical location;
survival rates of patients in the Western world are higher than those in
developing countries.

In this project an
investigation of the performance of six (6) machine learning algorithms for
breast cancer diagnosis is carried out. The analysis of these algorithms is in
terms of accuracy of the testing and the training subsets of the Wisconsin
Breast Cancer dataset.

II.             
PROBLEM DEFINITION

Thanks to machine learning algorithms the medical industry is undergoing
rapid developments today. However, medical practitioners still lose patients
due to late diagnosis, wrong prognosis or the inability to predict disease
outbreaks, thereby increasing the workload of medical practitioners. Hence, the
luck of efficiency and accuracy in most medical healthcare systems. Amongst all
cancer diseases, breast cancer in women is responsible for approximately 25% of
cancer deaths worldwide. This percentage hopefully can be reduced with the use
of high performance and accurate machine learning algorithms to aid medical
practitioners deliver better services and save lives of cancer patients by
giving correct diagnosis and prognosis.

III.           
PROPOSED METHOD

To carry out the objective
of this project, the following algorithms are selected for the analysis in
terms of accuracy performance over the Wisconsin Breast Cancer Dataset;

·       
K-Nearest Neighbor

·       
Logistic Regression

·       
Decision Trees

·       
Random Forest

·       
Neural Networks

·       
Support Vector Machines

The above mentioned machine learning algorithms are proposed to carry out
performance analysis tests on the proposed dataset that comes preloaded in
Python’s library scikit-learn.

IV.           
LITERATURE REVIEW

This literature review offers
a concise summary of already existing research done on machine learning
algorithms in relation to disease prediction and to be more specific on breast
cancer diagnosis. As stated by the PubMed statistics, approximately 1500
research papers have been published in relation to cancer and machine learning.
Unfortunately, most of these research papers discuss machine learning
techniques being used for detection of cancer and not how accurate the
techniques perform in cancer diagnosis. Nonetheless, below is an account of the
implemented ML algorithms together with the scope are elaborated in this
literature review.

In the survey by Fatima
and Pasha (2017) where they investigated Machine Learning algorithms for
disease diagnosis, their findings show that the Feed-forward Neural Networks
provide an accuracy of 98% in comparison to Support Vector Machines (96.60%)
and Naïve Bayes (97.10%) algorithms. It is clearly observed from their survey
paper that the algorithms they experimented on offer enhanced performance for
the prognosis of diseases such as cancer, diabetes, liver and hepatitis.

Asri et al. (2016) in
their study implemented four machine learning algorithms Support Vector
Machines, NB, K-Nearest Neighbor and C4.5 on the Wisconsin Breast Cancer
dataset. In an aim to compare the efficiency and effectiveness in terms of
getting the algorithm with the best accuracy, specificity and sensitivity,
Support Vector Machines (SVM) outperformed the other algorithms with an
accuracy of 97.13%.

Findings obtained from a
performance analysis of seven classification prediction algorithms by Senturk
et al. (2016) claim that machine learning algorithms guarantee results that can
be used for the early diagnosis and prognosis of breast cancer in patients
using the RapidMiner Tool. Experimental testing was done using Artificial
Neural Networks, Decision Trees, Discriminant Analysis, K-NN, Logistic
Regression, Naïve Bayes and SVM. The authors selected SVM as the best algorithm
having the highest accuracy percentage of 96%.

Shiny et al. (2015)
implemented data mining algorithms to analyze breast cancer. The authors
carried out the prediction of breast cancer using various algorithms evaluated
using classification matrices and lift charts. In their findings, owing to the
fact that there was a lack of much training information to learn from, none of
the models accurately classified the malignant data. However, the algorithms that
yielded better results are Artificial Neural Networks (ANN), Naïve Bayes and
Logistic Regression. Worst results obtained from the entropy tree.

In an investigation exploring
the application of decision trees in breast cancer diagnosis done by Shajahaan
et al. (2013), the authors carried an analysis of algorithms such as Naïve Bayes,
CART, ID3 and C4.5. Five important attributes were highlighted which can be
considered for breast cancer prediction. The authors concluded that Random
Forest having the highest accuracy for prediction is the best classification
algorithm.

Shrivastava et al. (2013)
in their research also made use of machine learning classification algorithms
to classify the malignant and benign instances of the Wisconsin Breast Cancer dataset.
Using Weka for their experimental tests the authors created classifier model
for classification using decision trees and neural network approaches. Owing to
the if-then standards for enhancing the performance for decision trees, the
algorithm had better accuracy results.

From this review, it is concluded that the selected machine learning algorithms
for this project have been shown to produce high accuracy results when
experimented against other algorithms, most of the algorithms having accuracy of
above 90%. Henceforth, K-NN, Logistic Regression, Decision Trees, Random Forest
and Support Vector Machine algorithms shall be used for experimental testing in
this project.

V.             
METHODOLOGY

In this project six
machine learning algorithms are implemented and tested on the Wisconsin Breast Cancer dataset. The dataset contains 569 data samples which are digitized images of
fine-needle aspirates (FNA) of breast cancer tumors. There are thirty (30)
attributes that describe each sample. Each sample is classified or labelled as
a tumor malignant or benign. In machine learning classification, this data is fed into an algorithm
by specifying which particular features belong to either of the classes. In
this process the algorithm ‘learns’. After the training process the algorithm is
tested using new data; only the features are specified and require the
algorithm to predict whether the data being dealt with is of class malignant or benign.

The machine learning
algorithms K-nearest neighbor, Logistic Regression, Decision Trees, Random
Forest, Neural Networks and Support Vector Machine (SVM) are chosen to carry
out a performance/accuracy analysis test because these algorithms are popularly
used and have a high success rate at yielding great results. Furthermore, each
of these algorithms employs a different approach for the generation of classification
models thus increasing the probability of obtaining a prediction model having high
classification accuracy.

A.    
K-Nearest Neighbor

B.     
Logistic Regression

C.    
Decision Trees

D.    
Random Trees

E.     
Neural Networks

F.     
Support Vector Machine (SVM)

 

VI.           
COMPARISON
OF ML ALGORITHMS

In this sectio

VII.         
TESTS & RESULTS

In this section we
discuss the results obtained from the experimental accuracy tests that were
carried out on each of the selected ML algorithms.

A.     Tests

The

1.     
 

B.     Results

The table below depicts
the accuracy results obtained from the experimental testing of each of the
machine learning algorithms over the Wisconsin Breast Cancer dataset.

ML
Algorithm

Accuracy
on training set (%)

Accuracy
on testing set (%)

K-NN

94.6

93

Logistic
Regression

97.2

96.5

Decision
Tree

98.8

95.1

Random
Forest

100

97.2

Neural
Network

98.8

97.2

SVM

98.8

97.2

 

Below is a graphical
representation of the results obtained from the experimental tests performed.

VIII.       CONCLUSION

For the analysis of medical
information or data, a vast variety of machine learning algorithms, techniques and
libraries are readily available. However, the challenge is to construct high
performance and computationally accurate classifiers for medical applications.
In this project, six main machine learning algorithms namely; K-Nearest
Neighbor, Logistic Regression, Decision Trees, Random Forests, Neural Networks
and Support Vector Machines are employed over the Wisconsin Breast Cancer
dataset in terms of performance and accuracy. In terms of accuracy on the test
set, the best performing algorithms are Random Forest, Neural Networks and
Support Vector Machines. In this analysis Random Forest can be considered as
the better performing algorithm as it requires little to no preprocessing of
data in comparison to Neural Networks and Support Vector Machine (SVM).

IX.           
REFERENCES

gvbjhbk

 

 

 

 

 

Author: