Creatinine Phosphokinase: displays the level of enzymes that is present in your body.
Diabetic: displays that, whether the person is infected with diabetics or not.
1= Yes
0= No
Ejection fraction: displays the level of volumetric fraction of fluid ejected
High blood pressure: displays whether the person have a problem of high blood pressure or not.
1= Yes
0= No
Platelets: displays the quantity of platelets in your body.
Serum creatinine: displays the level of muscle metabolism in your body.
Serum sodium: displays the level of sodium in body.
Time: displays the number of times patient got heart attack
Fig 1. Shows the summary of each attribute including minimum, maximum, mean and median of each attribute.
Figure 1: Summary of Each Attribute
The Fig 2. show the head of our dataset, which includes five sample records from the dataset showing the above mention attributes.
Figure 2: Sample Records
The following images shows the data distribution of each attribute in our dataset. The below code has been utilized for plotting of data distribution for each attribute in our dataset, simply by changing the name and color of required attribute.
par("mar")
par(mar=c(4,4,4,4))
par("mar")
# high density vertical lines.
plot(df$Age , type= "h", col="Blue", ylab="Anaemia",xlab="smoking", main="Data Distribution of Age")
Figure 3: Age Distribution
Figure 4: Anaemia Distribution
Figure 5: Creatinine Phosphokinase Distribution
Figure 6: diabetes Distribution
Figure 7: Ejection Fraction Distribution
Figure 8: Blood Pressure Distribution
Figure 9: Platelets Distribution
Figure 10: Sex Distribution
Figure 11: Serum Creatinine Distribution
Figure 12: Serum Sodium Distribution
Figure 13: Smoking Distribution
Figure 14: Time Distribution
par("mar")
par(mar=c(4,4,4,4))
par("mar")
library(ggplot2)
Numbers<-table(df$DEATH_EVENT)
barplot(Numbers,main='Class Distribution',
col=c('red','orange'),legend=rownames(Numbers),
ylab='count')
Figure 15: Unbalanced Class Distribution
We can see that from the distribution of the class instances the dataset is not balanced. We need to balance data be generating the sample of minority class. In classification problems, majority of machine learning algorithms are vulnerable to unbalanced dataset and may leads to worst outcomes. Let’s consider an example to understand, how unbalanced data affect the efficiency of statistical models. Suppose we had 10 malignant and 90 benign tests. A trained and validated machine learning model on such a dataset could then forecast "benign" for all samples and yet correctly reach very high precision. An unbalanced data set can discriminate between the prediction model and the popular class. We have used SMOTE [6] to balance the data among the class distribution. Fig 16. Shows the class distribution of balanced dataset.
Figure 16: Balanced Class Distribution
Share with your friends: |