├── Liver Enzymes Analysis.png ├── Protein Levels Analysis.png ├── Bilirubin Levels Analysis.png ├── Gender Distribution Analysis.png ├── Age vs. Enzyme Levels Analysis.png ├── Actual vs Predicted plot (Random Forest).png ├── Age Distribution by Disease Status Analysis.png ├── Correlation Between Bilirubin and Enzymes Analysis.png ├── Liver_Disease_Prediction_Using_Machine_Learning_Report.pdf ├── .github └── workflows │ └── python-package-conda.yml ├── README.md └── Project (Indian Liver Patient).py /Liver Enzymes Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Liver Enzymes Analysis.png -------------------------------------------------------------------------------- /Protein Levels Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Protein Levels Analysis.png -------------------------------------------------------------------------------- /Bilirubin Levels Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Bilirubin Levels Analysis.png -------------------------------------------------------------------------------- /Gender Distribution Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Gender Distribution Analysis.png -------------------------------------------------------------------------------- /Age vs. Enzyme Levels Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Age vs. Enzyme Levels Analysis.png -------------------------------------------------------------------------------- /Actual vs Predicted plot (Random Forest).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Actual vs Predicted plot (Random Forest).png -------------------------------------------------------------------------------- /Age Distribution by Disease Status Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Age Distribution by Disease Status Analysis.png -------------------------------------------------------------------------------- /Correlation Between Bilirubin and Enzymes Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Correlation Between Bilirubin and Enzymes Analysis.png -------------------------------------------------------------------------------- /Liver_Disease_Prediction_Using_Machine_Learning_Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KumudRanjan4295/Liver_Disease_Prediction_Using_Machine_Learning/HEAD/Liver_Disease_Prediction_Using_Machine_Learning_Report.pdf -------------------------------------------------------------------------------- /.github/workflows/python-package-conda.yml: -------------------------------------------------------------------------------- 1 | name: Python Package using Conda 2 | 3 | on: [push] 4 | 5 | jobs: 6 | build-linux: 7 | runs-on: ubuntu-latest 8 | strategy: 9 | max-parallel: 5 10 | 11 | steps: 12 | - uses: actions/checkout@v4 13 | - name: Set up Python 3.10 14 | uses: actions/setup-python@v3 15 | with: 16 | python-version: '3.10' 17 | - name: Add conda to system path 18 | run: | 19 | # $CONDA is an environment variable pointing to the root of the miniconda directory 20 | echo $CONDA/bin >> $GITHUB_PATH 21 | - name: Install dependencies 22 | run: | 23 | conda env update --file environment.yml --name base 24 | - name: Lint with flake8 25 | run: | 26 | conda install flake8 27 | # stop the build if there are Python syntax errors or undefined names 28 | flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics 29 | # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide 30 | flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics 31 | - name: Test with pytest 32 | run: | 33 | conda install pytest 34 | pytest 35 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Liver Disease Prediction Using Machine Learning 2 | 3 | This project leverages machine learning to predict liver disease using clinical data from the **Indian Liver Patient Dataset**. It combines exploratory data analysis (EDA), classification, and regression modeling to extract meaningful healthcare insights. 4 | 5 | ## 📂 Dataset 6 | 7 | - **Source**: [Kaggle - Indian Liver Patient Dataset](https://www.kaggle.com/datasets/uciml/indian-liver-patient-records) 8 | - **Records**: 583 9 | - **Features**: Age, Gender, Bilirubin levels, Liver enzymes (ALT, AST, ALP), Albumin, A/G ratio 10 | - **Target Variable**: Liver Disease (0 = No Disease, 1 = Disease) 11 | 12 | ## 🔍 Project Objectives 13 | 14 | - Perform EDA to understand patterns in liver disease indicators 15 | - Compare healthy vs. diseased patient data statistically and visually 16 | - Build: 17 | - **SVM** model for disease classification 18 | - **Random Forest** model for bilirubin level prediction 19 | 20 | ## 🧪 Exploratory Data Analysis (EDA) 21 | 22 | - Analyzed gender and age-wise distribution 23 | - Investigated enzyme and protein level variations 24 | - Identified strong correlations using heatmaps 25 | - Handled missing values and encoded categorical data 26 | 27 | ## 🧠 Machine Learning Models 28 | 29 | ### 1. Support Vector Machine (SVM) 30 | - **Task**: Binary Classification 31 | - **Accuracy**: 72% 32 | - **Recall (Disease)**: 88% 33 | - **Precision**: 75% 34 | - **F1-Score**: 0.81 35 | - **Top Features**: AST, Albumin, ALP 36 | 37 | ### 2. Random Forest Regressor 38 | - **Task**: Predicting Total Bilirubin 39 | - **R² Score**: 0.68 40 | - **MAE**: 1.92 mg/dL 41 | - **RMSE**: 2.89 mg/dL 42 | 43 | ### 📊 Feature Importance (Top 5) 44 | | Feature | SVM Weight | RF Importance | 45 | |-----------------|------------|---------------| 46 | | AST (SGOT) | 0.22 | 0.23 | 47 | | Albumin | 0.19 | 0.21 | 48 | | ALP | 0.15 | 0.17 | 49 | | Age | 0.11 | 0.09 | 50 | | Total Proteins | 0.08 | 0.07 | 51 | 52 | ## ⚙️ Tech Stack 53 | 54 | - Python 55 | - Pandas, NumPy 56 | - Seaborn, Matplotlib 57 | - Scikit-learn 58 | 59 | ## 🧠 Key Insights 60 | 61 | - Males are 2.7× more likely to develop liver disease 62 | - Peak disease incidence observed between ages 45–60 63 | - High bilirubin and low albumin levels strongly indicate disease 64 | 65 | ## 🚀 Future Scope 66 | 67 | - Integrate deep learning (CNNs) for image+biochemical analysis 68 | - Deploy models as EHR-integrated decision support tools 69 | - Incorporate federated learning for secure, cross-hospital collaboration 70 | - Apply fairness audits to detect and mitigate bias 71 | 72 | ## 📄 Report 73 | 74 | The full project report is included as `Liver_Disease_Prediction_Using_Machine_Learning_Report.pdf`. 75 | 76 | ## 👤 Author 77 | 78 | **Kumud Ranjan** 79 | M.Tech Data Science and Engineering 80 | Lovely Professional University 81 | -------------------------------------------------------------------------------- /Project (Indian Liver Patient).py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Apr 12 12:59:26 2025 4 | 5 | @author: bgpda 6 | """ 7 | 8 | import pandas as pd 9 | import numpy as np 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | from sklearn.preprocessing import LabelEncoder, StandardScaler 13 | from sklearn.model_selection import train_test_split, GridSearchCV 14 | from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score 15 | from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor 16 | from sklearn.svm import SVC, SVR 17 | 18 | # Load the data 19 | df = pd.read_csv("C:/Users/bgpda/Desktop/LPU/Python_DataScience/indian_liver_patient.csv") 20 | 21 | # Print Data 22 | print(df) 23 | print(df.shape) 24 | # Basic Information 25 | print(df.info()) 26 | print(df.head()) 27 | print(df.describe()) 28 | 29 | # Check for missing values 30 | print(df.isnull().sum()) 31 | 32 | # Handle missing values (only one in Albumin_and_Globulin_Ratio) 33 | df['Albumin_and_Globulin_Ratio'].fillna(df['Albumin_and_Globulin_Ratio'].median(), inplace=True) 34 | 35 | # Convert Gender to numerical (Male=1, Female=0) 36 | le = LabelEncoder() 37 | df['Gender'] = le.fit_transform(df['Gender']) 38 | 39 | # Rename target forr clarity 40 | df = df.rename(columns={'Dataset':'Liver_Disease'}) 41 | 42 | # Check target distribution 43 | print(df['Liver_Disease'].value_counts()) 44 | 45 | ########################################### ANALYSIS ################################################### 46 | 47 | # 1. Gender Distribution Analysis: Examine the proportion of males and females in the dataset and their liver disease status. 48 | 49 | gender_analysis = df.groupby(['Gender', 'Liver_Disease']).size().unstack() 50 | gender_analysis.columns = ['No Disease', 'Disease'] 51 | gender_analysis.index = ['Female', 'Male'] 52 | 53 | plt.figure(figsize=(8,5)) 54 | gender_analysis.plot(kind='bar', stacked=True) 55 | plt.title('Liver Disease Cases by Gender') 56 | plt.ylabel('Count') 57 | plt.xticks(rotation=0) 58 | plt.show() 59 | 60 | # Findings: 61 | # (a) Males are significantly overrepresented in the dataset 62 | # (b) Males have a much higher incidence of liver disease compared to females 63 | 64 | 65 | # 2. Age Distribution by Disease Status Analysis: Compare age distributions between patients with and without liver disease. 66 | 67 | plt.figure(figsize=(10,6)) 68 | sns.violinplot(x='Liver_Disease', y='Age', data=df, split=True) 69 | plt.title('Age Distribution by Liver Disease Status') 70 | plt.xticks([0,1], ['Disease', 'No Disease']) 71 | plt.show() 72 | 73 | # Findings: 74 | # (a) Patients with liver disease tend to be slightly older 75 | # (b) The age range is similar for both groups (20-70 years) 76 | # (c) Younger patients (<20) are more likely to not have liver disease 77 | 78 | 79 | # 3. Bilirubin Levels Analysis: Compare bilirubin levels between healthy and diseased patients. 80 | 81 | plt.figure(figsize=(12,5)) 82 | plt.subplot(1,2,1) 83 | sns.boxplot(x='Liver_Disease', y='Total_Bilirubin', data=df) 84 | plt.title('Total Bilirubin by Disease Status') 85 | 86 | plt.subplot(1,2,2) 87 | sns.boxplot(x='Liver_Disease', y='Direct_Bilirubin', data=df) 88 | plt.title('Direct Bilirubin by Disease Status') 89 | plt.tight_layout() 90 | plt.show() 91 | 92 | # Findings: 93 | # (a) Both total and direct bilirubin levels are significantly higher in patients with liver disease 94 | # (b) Many outliers in the diseased group indicate severe cases 95 | 96 | 97 | # 4. Liver Enzymes Analysis: Compare key liver enzymes (ALT, AST, ALP) between groups. 98 | 99 | enzymes = ['Alkaline_Phosphotase', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase'] 100 | 101 | plt.figure(figsize=(15,5)) 102 | for i, enzyme in enumerate(enzymes, 1): 103 | plt.subplot(1,3,i) 104 | sns.boxplot(x='Liver_Disease', y=enzyme, data=df) 105 | plt.title(f'{enzyme} by Disease Status') 106 | plt.tight_layout() 107 | plt.show() 108 | 109 | # Findings: 110 | # (a) All three enzymes show significantly higher levels in diseased patients 111 | # (b) AST shows the most dramatic difference between groups 112 | # (c) Extreme outliers suggest some acute liver injury cases 113 | 114 | 115 | # 5. Protein Levels Analysis: Examine protein-related biomarkers (Total Proteins, Albumin, A/G Ratio). 116 | 117 | proteins = ['Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio'] 118 | 119 | plt.figure(figsize=(15,5)) 120 | for i, protein in enumerate(proteins, 1): 121 | plt.subplot(1,3,i) 122 | sns.boxplot(x='Liver_Disease', y=protein, data=df) 123 | plt.title(f'{protein} by Disease Status') 124 | plt.tight_layout() 125 | plt.show() 126 | 127 | # Findings: 128 | # (a) Diseased patients have lower total proteins and albumin levels 129 | # (b) Albumin/Globulin ratio is significantly lower in diseased patients 130 | # (c) These findings are consistent with liver dysfunction 131 | 132 | 133 | # 6. Correlation Between Bilirubin and Enzymes Analysis: Explore relationships between bilirubin levels and liver enzymes. 134 | 135 | corr_vars = ['Total_Bilirubin', 'Direct_Bilirubin', 'Alamine_Aminotransferase', 136 | 'Aspartate_Aminotransferase', 'Liver_Disease'] 137 | 138 | plt.figure(figsize=(10,8)) 139 | sns.heatmap(df[corr_vars].corr(), annot=True, cmap='coolwarm', center=0) 140 | plt.title('Correlation Between Bilirubin and Liver Enzymes') 141 | plt.show() 142 | 143 | # Findings: 144 | # (a) Strong correlation between total and direct bilirubin (0.87) 145 | # (b) Moderate correlation between bilirubin and liver enzymes 146 | # (c) All biomarkers show positive correlation with liver disease 147 | 148 | 149 | # 7. Age vs. Enzyme Levels Analysis: Examine how enzyme levels vary with age. 150 | 151 | plt.figure(figsize=(15,5)) 152 | sns.scatterplot(x='Age', y='Alamine_Aminotransferase', hue='Liver_Disease', 153 | data=df, alpha=0.6) 154 | plt.title('Age vs ALT Levels by Disease Status') 155 | plt.show() 156 | 157 | #Findings: 158 | # (a) Younger patients with disease tend to have extremely high ALT levels 159 | # (b) Older patients generally show more moderate elevation 160 | # (c) Healthy patients maintain low ALT levels regardless of age 161 | 162 | 163 | 164 | ######################################## Apply Models ############################################### 165 | 166 | 167 | # Create features and targets 168 | X = df.drop(['Liver_Disease', 'Total_Bilirubin', 'Direct_Bilirubin'], axis=1) # Features 169 | y_class = df['Liver_Disease'] # Classification target (1: disease, 0: no disease) 170 | y_reg = df['Total_Bilirubin'] # Regression target 171 | 172 | # Split data 173 | X_train, X_test, y_class_train, y_class_test, y_reg_train, y_reg_test = train_test_split(X, y_class, y_reg, test_size=0.2, random_state=42) 174 | 175 | # Scale features 176 | scaler = StandardScaler() 177 | X_train_scaled = scaler.fit_transform(X_train) 178 | X_test_scaled = scaler.transform(X_test) 179 | 180 | 181 | # Classification Models (Support Vector Machine) 182 | 183 | svm_clf = SVC(random_state=42) 184 | svm_clf.fit(X_train_scaled, y_class_train) 185 | svm_pred = svm_clf.predict(X_test_scaled) 186 | 187 | print("\nSVM Results:") 188 | print(f"Accuracy: {accuracy_score(y_class_test, svm_pred):.2f}") 189 | print("Classification Report:") 190 | print(classification_report(y_class_test, svm_pred)) 191 | 192 | 193 | # Regression Models (Random Forest) 194 | 195 | rf_reg = RandomForestRegressor(random_state=42) 196 | rf_reg.fit(X_train_scaled, y_reg_train) 197 | rf_reg_pred = rf_reg.predict(X_test_scaled) 198 | 199 | print("\nRandom Forest Regressor Results:") 200 | print(f"Mean Squared Error: {mean_squared_error(y_reg_test, rf_reg_pred):.2f}") 201 | print(f"R-squared: {r2_score(y_reg_test, rf_reg_pred):.2f}") 202 | 203 | # Actual vs Predicted plot (Random Forest) 204 | plt.figure(figsize=(8,6)) 205 | plt.scatter(y_reg_test, rf_reg_pred, alpha=0.5) 206 | plt.plot([y_reg_test.min(), y_reg_test.max()], [y_reg_test.min(), y_reg_test.max()], 'r--') 207 | plt.xlabel('Actual Bilirubin') 208 | plt.ylabel('Predicted Bilirubin') 209 | plt.title('Actual vs Predicted Bilirubin Levels (Random Forest)') 210 | plt.show() 211 | --------------------------------------------------------------------------------