PREDICTING
DENGUE DISEASE
DINKY
KHATRI , HARSHIT WADHWA
Department of Computer Science, The
NorthCap University
Gurugram,India
ABSTRACT
The main objective of this research
is to use o the classification techniques to predict the number of Dengue fever
prone cases in Jhelum district and in surrounding near by areas. We have compared performance
rate of different classification techniques and algorithms through this
research paper. The general
agenda of this paper is to classify dataset so that users can fetch useful and
ample of information and easily identify a suitable algorithm and technique for accurate and precise predictive model from
this paper . Naive Bayes, J48 and SMO are the best suitable algorithms for classified accuracy as they
achieved maximum accuracy= 100% with 98 correctly classified instances, maximum
ROC = 1 with least mean absolute error.
1. INTRODUCTION
Dengue infection is a
major disease caused by dengue germ, which infect in body of human by female
mosquito 1. Various Symptoms include headache, sudden-onset fever, retro orbital pain, joint-pain, pain in muscles
and a rash 2.The other
name for dengue is, “breakbone fever”, that comes from the associated
muscle and joint pains. Dengue
infection is a widespread disease and has endangered 2.5 billion populations
all around the universe. Every year about 50 million of people suffer from this
life-taking disease globally 1.
According to world
health organization researches, dengue infection is divided into two major types,
i.e., type 1 and type 2, 3. First one is classical and traditional one dengue
called dengue fever and the other is called as dengue hemorrhagic fever. DHF1,
DHF2, DHF3 and DHF4 are further four
categories of dengue hemorrhagic fever. DHF is initiated by start of
fever which continues for 3 to 7 days with
signs including like leakage of plasma, shock and weak pulse.
Different techniques and algorithms for dengue fever classification can be degined
and used such as Naïve Bayes classifier; decision tree, KNN Technique,
multilayered Technique and SVM 1,4,5. These techniques are evaluated based on
five common measures in data minning : accuracy, precision, sensitivity,
specificity and negative rate.
Some researchers have been working on
dengue classification such as Mr. Tanner
et al. and Tarig et al. Tanner’s team used one of the best algorithms of data
mining the Decision tree approach and they classified upto 1200 patients record
and found 6 remarkable and important features and aspects3. They
got 84% accuracy rate . Tarig’s team used techni que of Self Organizing MAP
(SOM) and ML feed-forward neural networks (MFNN). They grouped patients into
two sets and got only 70% correctness measure whereas Fatimah Ibrahim et.al
used ML perceptron’s (MLP) and got upto 90% accuracy. Daranee et al.
elaboarated using decision tree method to group dengue patients from two data
sets4. They got 97.6% and 96.6% accuracy rate from first and second method respectively.
We use the following algorithms and techniques: Naïve Bayesian,
J48, SMO, REP Tree and Random tree5. WEKA tool was used as Data mining tool
for classification of data.
Figure 1 : Symptoms of dengue disease
2.TOOL USED
2.1 WEKA
Waikato Environment for Knowledge Analysis(WEKA) is a machine learning software coded
in Java, developed at the University of Waikato, New Zealand. It is free software licensed under
the GNU General Public License.WEKA is a very good data mining tool
for the users to classify the accuracy on the basis of datasets by applying
different algorithmic approaches 8.Our main objective is to identify that
whether the patient is affected by Dengue or not. Some of the parameter are
used for predicting the fever and compare the performance of the various
classification techniques.
The Explorer interface features several panels
providing access to the main components of the workbench:
·
The Preprocess panel has
facilities for importing data from a database,
a CSV file, etc., and for pre-processing this data.
·
The Classify panel enables
applying classification and regression algorithms
(called classifiersin WEKA) to the resulting dataset, to estimate
the accuracy of
the resulting predictive model, and to visualize erroneous predictions, receiver
operating characteristic (ROC) curves,
etc., or the model itself (if the model is amenable to visualization like,
e.g., a decision tree).
3.VARIOUS TERMS
1. Correctly Classified
Accuracy
It shows the accuracy
percentage of test that is correctly classified.
2. Incorrectly Classified
Accuracy
It shows the accuracy
percentage of test that is incorrectly classified.
3. Mean Absolute Error
It shows the number of errors
to analyse algorithm classification accuracy.
4. Time
It shows how much time is
required to build model in order to predict disease.
5. ROC Area
Receiver
Operating Characteristic represent test performance guide for classifications
accuracy of diagnostic test based on: excellent (0.90-1), good (0.80-0.90),
fair (0.60-0.70), poor (0.60-0.70), fail (0.50 – 0.60).
4.DATASET
USED
The dataset was
collected from District Headquarter Hospital (DHQ) Jhelum. For properly
categorizing our dataset, different classification techniques are used.
5. DATASET ATTRIBUTES
Figure 2 : Attributes of the dengue dataset
6.CLASSIFICATIONS
6.1.NAÏVE BAYES (Refer the links of the theorem
from journal/paper about from which journal you read about these theorems)
Naive
Bayes classifier or algorithm is based on applying Bayes’
theorem. This algorithm works as a probabilistic classifier, i.e. it predicts
class membership predictions.It is not a single algorithm, but a family of
algorithms based on a common principle: all naive Bayes classifiers assume that
the value of a particular feature is independent of the value of any other
feature.We applied Naïve Bayes algorithm to make predictions of so many
attributes by using 10 cross validation. This algorithm on running produced an
output of a 100% accuracy for 99 correctly classified instances. Also, the Mean
Absolute Error comes out to be 0.0011 i.e. some error rated are achieved. Time taken for building the model is 0
seconds and ROC area is 1 as shown in the figure.
6.2.J48 TREE
J48 is
a Java implementation
of the C4.5 algorithm in the Weka data
mining tool.J48 Tree has been used to decide the target value based
on various attributes of dataset to predict machine learning model and classify
their accuracy. We applied this algorithm on the dengue prediction dataset to
analyse the outputs and the result gave many statistics on using the 10 cross
validation. The algorithm achieved a 100% correctly classified accuracy for a
total of 99 instances. The mean absolute error is exactly 0. Time required to
build the model was 0 second and the ROC area achieved is 0.97 as we can see in
the figure.
6.3.SMO
SMO is another method used for the classification of
dengue prediction dataset. This algorithm is used to split the data on the
basis of dataset. We run this algorithm on our dataset by using 10 cross
validation technique in the weka tool and obtained a result with different
statistics. This output is then analysed and the following table is obtained.
We achieved a 100% classification accuracy and no error rates as the mean
absolute error comes out to be 0. The time required to build such a model is 0
seconds and the ROC area obtained is 0.909.
6.4.REP TREE
Classification accuracy
achieved is 74.7475% correctly classified accuracy, 25.2525% are incorrectly
classified accuracy, error rates i.e. mean absolute error is 0.3655,time taken
to build model is 0.02 and ROC area is 0.547.
6.5.RANDOM TREE
Classification accuracy
achieved is 87.8788% correctly classified accuracy, 12.1212% are incorrectly
classified accuracy, error rates that is mean absolute error is 0.1853, time
taken to build model is 0 seconds and ROC area is 0.881 these are mentioned in
output.
7.CONCLUSION
Naïve Bayes, J48 and SMO
classified 100% correctly classified instances accuracy with minimum Mean
Absolute Error = 0 of J48 and SMO while Naïve Bayes has 0.011 error. Maximum ROC
is found in Naïve Bayes where ROC =1 and J48 Tree ROC Area comes out to be
0.979 while SMO’s is 0.909.The time taken to build model in all cases except
REP Tree is 0 seconds. In case of REP Tree, it is 0.02 seconds.
Maximum ROC Area means
excellent predictions performance as compared to other algorithms. Weka for
prediction of diseases is that it can easily diagnose a disease even in case
when the number of patients for whom the prediction has to be done is huge or
in case of very large data sets spanning lakhs of patients. Even though Weka is
a powerful data mining tool to analyse the overview of classification and visualization
of result in medical health to predict disease among patients but we can use
other tools such as Matlab in order to further classify different data sets.
Table 1: Accuracy prediction table of different
algorithms
Algorithm
Classified Accuracy
(%)
Incorrectly
Classified Instances (%)
Mean Absolute Error
ROC Area
Naïve Bayes
100
0
0.0011
1
J48 Tree
100
0
0
0.979
SMO
100
0
0
0.909
REP Tree
74.7475
25.2525
0.3655
0.547
Random Tree
87.8788
12.1212
0.1853
0.881
8.
REFERENCES
1.
Farooqi W, Ali S (2013), A Critical Study of Selected Classification Algorithms
for Dengue Fever and Dengue Hemorrhagic Fever. Frontiers of Information
Technology (FIT), 11th International Conference on IEEE.
2.
Farooqi W, Ali S, Abdul W (2014) Classification of Dengue Fever Using Decision
Tree. VAWKUM Transaction on Computer Sciences 3: 15-22.
3.
Rigau-Pérez JG, et.al. (1998), Dengue and dengue haemorrhagic fever, The Lancet
19: 971-977
4.
Wikipedia, http://en.m.wikipedia.org/wiki/Dengue_fever, accessed in January
2015. 12.
5.
Waikato, http://www.cs.waikato.ac.nz/ml/weka,accessed in January 2015.
6. KirkbyR,
Frank E, WEKA Explorer User Guide for version 3-4-3, November2004.
7. Shakil KA et.al. (2015), “Dengue disease
prediction using weka data mining tool”, arXiv preprint arXiv: 1502.05167.