The world uses machine learning in many different fields. This is also true in the healthcare sector. Machine learning may be crucial in determining if locomotor disorders, heart illnesses, and other conditions are present or absent. If anticipated far in advance, such information can provide physicians with insightful knowledge that will enable them to individually tailor each patient's diagnosis and course of treatment.

Heart Disease Prediction Using Machine Learning

Here, we'll talk about utilizing machine learning algorithms to identify probable heart diseases in humans.

Dataset

Source: Kaggle

Link: https://www.kaggle.com/code/ayanotemitope/heart-attack-analysis-prediction/data

Problem Defined

Can we determine a patient's risk of heart disease based on clinical parameters?

Data Field

age - years of age of the patient
sex - Gender of the patient ( 0 is for female; 1 is for male)
cp - Type of Pain in the chest
0: Typical angina: decreased cardiac blood flow caused by chest discomfort
- 1: Atypical angina: heart-unrelated chest discomfort
- 2: Non-anginal pain: esophageal spasms are common (non-heart related)
- 3: Asymptomatic: chest discomfort not associated with any illness
trtbps - blood pressure at rest (in mm Hg on admission to the hospital). Usually, anything between 130 and 140 causes worry.
chol - mg/dl of serum cholesterol
- serum = LDL + HDL + .2 * triglycerides
- above 200 is cause for concern
fbs - (fasting blood sugar > 120 mg/dl) (1 is for true; 0 is for false)
- Diabetes is indicated by '>126' mg/dL.
restecg - electrocardiograms were taken when at rest
- 0: Nothing to worry about
- 1: ST-T Wave abnormality
  1. might range from minor signs to serious issues
  2. signals an irregular heartbeat
- 2: Whether present or absent, left ventricular hypertrophy
  1. expanded main pumping chamber of the heart
thalachh - reached a maximal heart rate
exng - Angina brought on by exercise (1 is for yes; 0 is for no)
oldpeak - Exercise-induced ST depression examines the stress on the heart during exercise; a sick heart will stress more.
slp - the angle of the ST segment's peak workout
- 0: Upsloping: exercising causes a higher heart rate (uncommon)
- 1: Flatsloping: hardly any change (typical healthy heart)
- 2: Downslopins: indicators of a sick heart
caa - main vessels colored with fluoroscopy in number (0–3)
- The doctor can see the blood flowing via a colored vessel.
- the more blood movement, the better (no clots)
thall - Thallium under stress
- 1,3: Normal
- 6: fixed defect: Previously defective, but now ok
- 7: reversible defect: no normal blood flow when exercising
output - Does the patient has a disease or not (1 is for yes, 0 is for no) [ the predicted attribute]

Implementation using Python

Importing Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import hvplot.pandas
from scipy import stats


%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

Loading the Dataset

data_ = pd.read_csv("heart.csv")
data_.head()

Output:

EDA (Exploratory Data Analysis)

data_.info()

Output:

data_.shape

Output:

pd.set_option("display.float", "{:.2f}".format)
data_.describe()

Output:

data_.output.value_counts()

Output:

data_.output.value_counts().hvplot.bar(
    title="Heart Disease Count", xlabel='Heart Disease', ylabel='Count',
    width=600, height=400
)

Output:

# here, we will check if there is any missing value in our dataset
data_.isna().sum()

Output:

categorical_value = []
continous_value = []
for column in data_.columns:
    if len(data_[column].unique()) <= 10:
        categorical_value.append(column)
    else:
        continous_value.append(column)




categorical_value

Output:

patient_have_disease = data_.loc[data['output']==1, 'sex'].value_counts().hvplot.bar(alpha=0.4)
patient_have_no_disease = data_.loc[data['output']==0, 'sex'].value_counts().hvplot.bar(alpha=0.4)


(patient_have_no_disease * patient_have_disease).opts(
    title="Heart Disease - Sex", xlabel='Sex', ylabel='Count',
    width=700, height=550, legend_cols=2, legend_position='top_right'
)

Output:

patient_have_disease = data_.loc[data['output']==1, 'cp'].value_counts().hvplot.bar(alpha=0.4)
patient_have_no_disease = data_.loc[data['output']==0, 'cp'].value_counts().hvplot.bar(alpha=0.4)


(patient_have_no_disease * patient_have_disease).opts(
    title="Heart Disease -Chest Pain Type", xlabel='Chest Pain Type', ylabel='Count',
    width=700, height=550, legend_cols=2, legend_position='top_right'
)

Output:

patient_have_disease = data_.loc[data['output']==1, 'fbs'].value_counts().hvplot.bar(alpha=0.4)
patient_have_no_disease = data_.loc[data['output']==0, 'fbs'].value_counts().hvplot.bar(alpha=0.4)


(patient_have_no_disease * patient_have_disease).opts(
    title="Heart Disease - fasting blood sugar", xlabel='fasting blood sugar > 120 mg/dl (1 = true; 0 = false)',
    ylabel='Count', width=700, height=550, legend_cols=2, legend_position='top_right'
)

Output:

patient_have_disease = data.loc[data['output']==1, 'restecg'].value_counts().hvplot.bar(alpha=0.4)
patient_have_no_disease = data.loc[data['output']==0, 'restecg'].value_counts().hvplot.bar(alpha=0.4)


(patient_have_no_disease * patient_have_disease).opts(
    title="Heart Disease - resting electrocardiographic results", xlabel='resting electrocardiographic results',
    ylabel='Count', width=700, height=550, legend_cols=2, legend_position='top_right'
)

Output:

plt.figure(figsize=(15, 15))


for i, column in enumerate(categorical_val, 1):
    plt.subplot(3, 3, i)
    data_[data_["output"] == 0][column].hist(bins=35, color='blue', label='Have Heart Disease = NO', alpha=0.6)
    data_[data_["output"] == 1][column].hist(bins=35, color='red', label='Have Heart Disease = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)

Output:

From above, we can conlcude following observations for Heart disease:

People with a chest pain score of 1, 2, or 3 are more likely to develop heart disease than those with a score of 0.
People with value 1 (signals non-normal heart rhythm, can vary from moderate symptoms to serious difficulties) on their resting electrocardiogram are more likely to develop heart disease.
Exercise-induced angina (exng): Those who score 0 (no ==> exercise-induced angina) are more likely to suffer heart disease than those who score 1 (yes ==> exercise-induced angina).
People with slope values of 2 (signs of an unhealthy heart) are more likely to develop heart disease than those with slope values of 0 (better heart rate with exercise) or 1 (minimal change, typical healthy heart), according to studies. The slope of the ST section of the peak workout.
People with a ca value of 0 are more prone to develop heart problems because the greater blood flow, measured by the number of main arteries (0–3) colored with fluoroscopy, the better.
Thallium stress result: Individuals with that value of 2 (fixed defect: formerly defective but now ok) are more prone to develop heart disease.

plt.figure(figsize=(15, 15))


for i, column in enumerate(continous_val, 1):
    plt.subplot(3, 2, i)
    data_[data_["output"] == 0][column].hist(bins=35, color='blue', label='Have Heart Disease = NO', alpha=0.6)
    data_[data_["output"] == 1][column].hist(bins=35, color='red', label='Have Heart Disease = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)

Output:

Resting blood pressure: trstbps (on admission to the hospital, in mm Hg). Usually, anything between 130 and 140 causes worry.
A serum cholesterol level of 200 or above warrants caution.
A person who has reached a maximal heart rate of greater than 140 is more likely to suffer heart disease.
Outdated ST Depression brought on by exercise compared to rest examines the heart's stress levels during activity; a sick heart will stress more.

Max Heart Rate versus Age for Heart Disease

# Creating Different figure
plt.figure(figsize=(10, 7))


# Scattering with positive references
plt.scatter(data_.age[data_.output==1],
            data_.thalachh[data_.output==1],
            c="salmon")


# Scattering with negative references
plt.scatter(data_.age[data_.output==0],
            data_.thalachh[data_.output==0],
            c="lightblue")


# Info for ease
plt.title("Heart Disease in function of Max Heart Rate and Age")
plt.xlabel("Age - Age of the Patient")
plt.ylabel("Max Heart Rate - Maximum Heart Rate of the Patient")
plt.legend(["Disease", "No-Disease"]);

Output:

Correlation Matrix

corr_matrix = data_.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

Output:

data_.drop('output', axis=1).corrwith(data_.output).hvplot.barh(
    width=800, height=600,
    title="Correlation between Numeric Features and Heart Disease",
    ylabel='Correlation', xlabel='Numerical Features',
)

Output:

The output variable has the lowest correlations with fbs and chol.
The output variable and all other variables are significantly correlated.

Processing of Data

Before training the machine learning models, we must scale all the values after examining the dataset and change certain category variables into dummy variables.

categorical_value.remove('output')
dataset = pd.get_dummies(data_, columns = categorical_value)


dataset.head()

Output:

print(data.columns)
print(dataset.columns)

Output:

from sklearn.preprocessing import StandardScaler


ssc = StandardScaler()
scale_col = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']
dataset[scale_col] = ssc.fit_transform(dataset[col_to_scale])


dataset.head()

Output:

Building Models

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


def printing_score(classifier, X_train, y_train, X_test, y_test, train=True):
    if train==True:
        prediction = classifier.predict(X_train)
        report = pd.DataFrame(classification_report(y_train, prediction, output_dict=True))
        print(" Result - Train :\n-------------------------------------------")
        print(f"Score for Accuracy: {accuracy_score(y_train, prediction) * 100:.2f}%")
        print("-----------------------------------------------")
        print(f"Report of Classification:\n{report}")
        print("------------------------------------------------")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, prediction)}\n")
       
    elif train==False:
        prediction = classifier.predict(X_test)
        report = pd.DataFrame(classification_report(y_test, prediction, output_dict=True))
        print("Test Result:\n-----------------------------------------------------")        
        print(f"Score for Accuracy: {accuracy_score(y_test, prediction) * 100:.2f}%")
        print("-----------------------------------------------")
        print(f"Report of Classification:\n{report}")
        print("------------------------------------------------")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, prediction)}\n")

Here, we will split our data into two: Train Dataset and Testing Dataset

from sklearn.model_selection import train_test_split


X = dataset.drop('output', axis=1)
y = dataset.output


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We will try different Machine Learning models.

1. Logistic Regression

from sklearn.linear_model import LogisticRegression


logistic_regression_classification = LogisticRegression(solver='liblinear')
logistic_regression_classification.fit(X_train, y_train)


printing_score(logistic_regression_classification , X_train, y_train, X_test, y_test, train=True)
printing_score(logistic_regression_classification , X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, logistic_regression_classification.predict(X_test)) * 100
score_train = accuracy_score(y_train, logistic_regression_classification.predict(X_train)) * 100


df_result = pd.DataFrame(data=[["Logistic Regression", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result

Output:

2. Support Vector Machine (SVM)

from sklearn.svm import SVC




svm_classification = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm_classification.fit(X_train, y_train)


printing_score(svm_classification, X_train, y_train, X_test, y_test, train=True)
printing_score(svm_classification, X_train, y_train, X_test, y_test, train=False)

Output:

test_score = accuracy_score(y_test, svm_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, svm_clf.predict(X_train)) * 100


results_df_2 = pd.DataFrame(data=[["Support Vector Machine", train_score, test_score]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df

Output:

3. Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier




tree_classification = DecisionTreeClassifier(random_state=42)
tree_classification.fit(X_train, y_train)


printing_score(tree_classification, X_train, y_train, X_test, y_test, train=True)
printing_score(tree_classification, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, tree_classification.predict(X_test)) * 100
score_train = accuracy_score(y_train, tree_classification.predict(X_train)) * 100


result = pd.DataFrame(data=[["Decision Tree Classifier", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result = df_result.append(result, ignore_index=True)
df_result

Output:

4. Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


random_f_classification = RandomForestClassifier(n_estimators=1000, random_state=42)
random_f_classification.fit(X_train, y_train)


printing_score(random_f_classification, X_train, y_train, X_test, y_test, train=True)
printing_score(random_f_classification, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, random_f_classification.predict(X_test)) * 100
score_train = accuracy_score(y_train, random_f_classification.predict(X_train)) * 100


result = pd.DataFrame(data=[["Random Forest Classifier", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result = df_result.append(result, ignore_index=True)
df_result

Output:

5. XGBoost Classifier

from xgboost import XGBClassifier


xgb_classifier = XGBClassifier(use_label_encoder=False)
xgb_classifier.fit(X_train, y_train)


printing_score(xgb_classifier, X_train, y_train, X_test, y_test, train=True)
printing_score(xgb_classifier, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, xgb_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, xgb_classifier.predict(X_train)) * 100


result = pd.DataFrame(data=[["XGBoost Classifier", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result = df_result.append(result, ignore_index=True)
df_result

Output:

Hyperparameter Tuning of Models

1. Logistic Regression Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV


params = {"C": np.logspace(-4, 4, 20),
          "solver": ["liblinear"]}


logictic_regression_classification = LogisticRegression()


logistic_regression_cv = GridSearchCV(logictic_regression_classification, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=5)
logistic_regression_cv.fit(X_train, y_train)
best_params = logistic_regression_cv.best_params_
print(f"Best parameters: {best_params}")
logictic_regression_classification = LogisticRegression(**best_params)


logictic_regression_classification.fit(X_train, y_train)


printing_score(logictic_regression_classification, X_train, y_train, X_test, y_test, train=True)
printing_score(logictic_regression_classification, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, logistic_regression_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, logistic_regression_classifier.predict(X_train)) * 100


df_result_tuned = pd.DataFrame(data=[[" Logistic Regression- Tuned", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result_tuned

Output:

2. Support Vector Machine(SVM) Hyperparameter Tuning

svm_classifier = SVC(kernel='rbf', gamma=0.1, C=1.0)


params = {"C":(0.1, 0.5, 1, 2, 5, 10, 20),
          "gamma":(0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1),
          "kernel":('linear', 'poly', 'rbf')}


svm_cv_ = GridSearchCV(svm_classifier, params, n_jobs=-1, cv=5, verbose=1, scoring="accuracy")
svm_cv_.fit(X_train, y_train)
best_params_ = svm_cv_.best_params_
print(f"Best params: {best_params_}")


svm_classifier = SVC(**best_params_)
svm_classifier.fit(X_train, y_train)


printing_score(svm_classifier, X_train, y_train, X_test, y_test, train=True)
printing_score(svm_classifier, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, svm_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, svm_classifier.predict(X_train)) * 100


result = pd.DataFrame(data=[[" Support Vector Machine-Tuned", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result_tuned = df_result_tuned.append(result, ignore_index=True)
df_result_tuned

Output:

3. Decision Tree Classifier Hyperparameter Tuning

params = {"criterion":("gini", "entropy"),
          "splitter":("best", "random"),
          "max_depth":(list(range(1, 20))),
          "min_samples_split":[2, 3, 4],
          "min_samples_leaf":list(range(1, 20))
          }


dtree_classifier = DecisionTreeClassifier(random_state=42)
dtree_cv = GridSearchCV(dtree_classifier, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=3)
dtree_cv.fit(X_train, y_train)
best_params_ = dtree_cv.best_params_
print(f'Best_params: {best_params_}')


dtree_classifier = DecisionTreeClassifier(**best_params_)
dtree_classifier.fit(X_train, y_train)


printing_score(dtree_classifier, X_train, y_train, X_test, y_test, train=True)
printing_score(dtree_classifier, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, dtree_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, dtree_classifier.predict(X_train)) * 100


result = pd.DataFrame(data=[[" Decision Tree Classifier- Tuned", score_train, score_test]],
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result_tuned = df_result_tuned.append(result, ignore_index=True)
df_result_tuned

Output:

4. Random Forest Classifier Hyperparameter Tuning

n_estimators = [500, 900, 1100, 1500]
max_features = ['auto', 'sqrt']
max_depth = [2, 3, 5, 10, 15, None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]


params_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf
              }


random_forest_classifier = RandomForestClassifier(random_state=42)
random_forest_cv = GridSearchCV(random_forest_classifier, params_grid, scoring="accuracy", cv=3, verbose=1, n_jobs=-1)
random_forest_cv.fit(X_train, y_train)
best_params_ = random_forest_cv.best_params_
print(f"Best parameters: {best_params_}")


random_forest_classifier = RandomForestClassifier(**best_params_)
random_forest_classifier.fit(X_train, y_train)




printing_score(random_forest_classifier, X_train, y_train, X_test, y_test, train=True)
printing_score(random_forest_classifier, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, random_forest_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, random_forest_classifier.predict(X_train)) * 100


result = pd.DataFrame(data=[["Random Forest Classifier-Tuned", score_train, score_test]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result_tuned = df_result_tuned.append(result, ignore_index=True)
df_result_tuned

Output:

5. XGBoost Classifier Hyperparameter Tuning

param_grid = dict(
    n_estimators=stats.randint(10, 1000),
    max_depth=stats.randint(1, 10),
    learning_rate=stats.uniform(0, 1)
)


xgb_classifier = XGBClassifier(use_label_encoder=False)
xgboost_cv = RandomizedSearchCV(
    xgb_classifier, param_grid, cv=3, n_iter=50,
    scoring='accuracy', n_jobs=-1, verbose=1
)
xgboost_cv.fit(X_train, y_train)
best_params_ = xgboost_cv.best_params_
print(f"Best paramters: {best_params_}")


xgb_classifier = XGBClassifier(**best_params_)
xgb_classifier.fit(X_train, y_train)


printing_score(xgb_classifier, X_train, y_train, X_test, y_test, train=True)
printing_score(xgb_classifier, X_train, y_train, X_test, y_test, train=False)

Output:

score_test = accuracy_score(y_test, xgb_classifier.predict(X_test)) * 100
score_train = accuracy_score(y_train, xgb_classifier.predict(X_train)) * 100


result = pd.DataFrame(data=[[" XGBoost Classifier -Tuned", score_train, score_test]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
df_result_tuned = df_result_tuned.append(result, ignore_index=True)
df_result_tuned

Output:

df_result

Output:

The outcomes don't appear to have significantly improved following hyperparameter adjustment. Maybe due to the tiny dataset.

According to Random Forest and XGBoost, the importance of the features

def feature_imp(df, model):
    fi = pd.DataFrame()
    fi["feature"] = df.columns
    fi["importance"] = model.feature_importances_
    return fi.sort_values(by="importance", ascending=False)


feature_imp(X, random_forest_clf).plot(kind='barh', figsize=(12,7), legend=False)

Output:

<AxesSubplot:>

feature_imp(X, xgb_classifier).plot(kind='barh', figsize=(12,7), legend=False)

Output:

<AxesSubplot:>

Machine Learning Tutorial

ML Regression Algorithm

ML Classification Algorithm

ML Clustering Algorithm

ML Association Rule learning Algorithm

Miscellaneous

Machine Learning Tutorial

ML Regression Algorithm

ML Classification Algorithm

ML Clustering Algorithm

ML Association Rule learning Algorithm

Miscellaneous

Heart Disease Prediction Using Machine Learning

Dataset

Problem Defined

Data Field

Implementation using Python

Importing Libraries

Loading the Dataset

EDA (Exploratory Data Analysis)

Max Heart Rate versus Age for Heart Disease

Correlation Matrix

Processing of Data

Building Models

1. Logistic Regression

2. Support Vector Machine (SVM)

3. Decision Tree Classifier

4. Random Forest Classifier

5. XGBoost Classifier

Hyperparameter Tuning of Models

1. Logistic Regression Hyperparameter Tuning

2. Support Vector Machine(SVM) Hyperparameter Tuning

3. Decision Tree Classifier Hyperparameter Tuning

4. Random Forest Classifier Hyperparameter Tuning

5. XGBoost Classifier Hyperparameter Tuning

According to Random Forest and XGBoost, the importance of the features