├── README.md └── Python_INT557.py /README.md: -------------------------------------------------------------------------------- 1 | # 🌍 Global health Data Analysis and Prediction 2 | 3 | This project explores and analyzes the **Life Expectancy Dataset** to understand the key factors influencing life expectancy across the globe. It involves data preprocessing, exploratory data analysis (EDA), correlation studies, visualization, and machine learning modeling for regression and classification tasks. 4 | 5 | --- 6 | 7 | ## 📁 Dataset 8 | 9 | **Source**: [Kaggle Life Expectancy Data](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who) 10 | 11 | **Attributes**: 12 | - Country, Year, Status (Developed/Developing) 13 | - Life Expectancy, Adult Mortality, BMI, GDP, Schooling, Immunization stats (Polio, Diphtheria), Alcohol consumption, and more. 14 | 15 | --- 16 | 17 | ## 🔍 Objectives 18 | 19 | - Handle missing values and clean the dataset. 20 | - Perform EDA to extract trends and insights. 21 | - Visualize key factors affecting life expectancy. 22 | - Apply machine learning models to predict life expectancy (regression). 23 | - Classify countries based on life expectancy (binary classification). 24 | 25 | --- 26 | 27 | ## 🧪 Libraries Used 28 | 29 | - `pandas`, `numpy` – Data manipulation 30 | - `matplotlib`, `seaborn` – Visualization 31 | - `dtale` – Interactive data exploration 32 | - `scikit-learn` – ML modeling and preprocessing 33 | 34 | --- 35 | 36 | ## 📊 Exploratory Data Analysis 37 | 38 | Visualizations and key insights: 39 | 40 | 1. **Distribution of Life Expectancy**: Normal distribution centered around ~70 years. 41 | 2. **Developed vs Developing**: Developed countries show significantly higher life expectancy. 42 | 3. **Top 10 Countries**: Bar charts of countries with highest/lowest average life expectancy. 43 | 4. **Correlation Heatmap**: Positive correlation with schooling, BMI; negative with adult mortality. 44 | 5. **Trends Over Time**: Life expectancy generally increases with time. 45 | 6. **Scatterplots**: Explored relationships between life expectancy and Schooling, GDP, BMI, Immunization (Polio). 46 | 47 | --- 48 | 49 | ## ⚙️ Data Preprocessing 50 | 51 | - Handled missing values using column means and group-wise means. 52 | - Normalized features using `StandardScaler`. 53 | 54 | --- 55 | 56 | ## 🧠 Machine Learning 57 | 58 | ### Regression Models 59 | 60 | | Model | R² Score | Mean Squared Error | Mean Absolute Error | 61 | |--------------------|----------|--------------------|---------------------| 62 | | Linear Regression | ~0.88 | ~5.1 | ~1.8 | 63 | | Support Vector Regressor (SVR) | ~0.86 | ~6.0 | ~1.9 | 64 | 65 | 🏆 **Best Model**: Linear Regression 66 | 67 | ### Classification Model 68 | 69 | - **Target**: Binary classification – High Life Expectancy (>70 years) vs Low (≤70 years) 70 | - **Model**: K-Nearest Neighbors (KNN) 71 | - **Accuracy**: ~0.89 72 | - **Metrics**: Confusion Matrix, Precision, Recall, F1-score 73 | 74 | --- 75 | 76 | ## 🧾 How to Run 77 | 78 | 1. Clone this repository 79 | 2. Install the dependencies: 80 | ```bash 81 | pip install pandas seaborn matplotlib dtale scikit-learn 82 | -------------------------------------------------------------------------------- /Python_INT557.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Apr 12 02:08:56 2025 4 | 5 | @author: shiva 6 | """ 7 | import pandas as pd 8 | import matplotlib.pyplot as plt 9 | import dtale as dt 10 | import seaborn as sns 11 | 12 | # load dataset 13 | df=pd.read_csv("C:\\Users\\shiva\\OneDrive\\Desktop\\Life Expectancy Data.csv") 14 | data = dt.show(df) 15 | data.open_browser() 16 | 17 | # about data 18 | df.info() 19 | df.describe() 20 | 21 | print("Initial shape:", df.shape) 22 | print(df.info()) 23 | print("\nMissing values per column:\n", df.isnull().sum()) 24 | 25 | df.columns = df.columns.str.replace(' ', '') 26 | df.fillna({ 27 | 'Life_expectancy':df['Life_expectancy'].mean(), 28 | 'AdultMortality':df['AdultMortality'].mean(), 29 | 'BMI':df['BMI'].mean(), 30 | 'Polio':df['Polio'].mean(), 31 | 'Diphtheria':df['Diphtheria'].mean(), 32 | 'thinness1-19years':df['thinness1-19years'].mean(), 33 | 'thinness5-9years':df['thinness5-9years'].mean()}, inplace=True) 34 | 35 | cols_to_fill_with_mean = [ 36 | 'Alcohol', 'Hepatitis_B', 'Total_expenditure', 37 | 'GDP', 'Population', 'Income_composition_of_resources', 'Schooling' 38 | ] 39 | 40 | for col in cols_to_fill_with_mean: 41 | df[col] = df.groupby('Country')[col].transform(lambda x: x.fillna(x.mean())) 42 | 43 | #Distribution of Life Expectancy 44 | sns.histplot(df['Life_expectancy'], kde=True, color='red') 45 | plt.title('Distribution of Life Expectancy') 46 | plt.xlabel('Life Expectancy') 47 | plt.show()#life expectancy is approximately normally distributed with a peak around 70 years. 48 | 49 | # Life Expectancy by Status (Developed vs Developing) 50 | 51 | sns.boxplot(x='Status', y='Life_expectancy', data=df) 52 | plt.title('Life Expectancy by Development Status') 53 | plt.show()#Developed countries have significantly higher life expectancy than developing ones. 54 | 55 | top_countries = df.groupby('Country')['Life_expectancy'].mean().sort_values(ascending=True).head(10) 56 | top_countries.plot(kind='bar', color='seagreen') 57 | plt.title('Top 10 Countries with Lowest Average Life Expectancy') 58 | plt.ylabel('Life Expectancy') 59 | plt.show() 60 | # correlation 61 | cols = [ 62 | 'Life_expectancy', 'Alcohol','AdultMortality', 'Population', 63 | 'Income_composition_of_resources', 'Schooling','BMI' 64 | ] 65 | corr = df[cols].corr() # Calculate correlation matrix 66 | sns.heatmap(corr, cmap='coolwarm', annot=True, fmt='.2f') 67 | plt.title('Correlation Heatmap') 68 | plt.show() 69 | #4. Life Expectancy vs Adult Mortality4. Life Expectancy vs Adult Mortality 70 | sns.scatterplot(x='Life_expectancy', y='AdultMortality', hue='Status', data=df) 71 | plt.title('Life Expectancy vs Adult Mortality') 72 | plt.show() #Insight: Higher adult mortality is strongly associated with lower life expectancy. 73 | 74 | #5 Trend of Life Expectancy Over the Years 75 | sns.lineplot(x='Year', y='Life_expectancy', data=df) 76 | plt.title('Trend of Life Expectancy Over the Years') 77 | plt.show() #Insight: Life expectancy has generally increased over time globally. 78 | 79 | #6 Top 10 Countries with Highest Life Expectancy 80 | top_countries = df.groupby('Country')['Life_expectancy'].mean().sort_values(ascending=False).head(10) 81 | top_countries.plot(kind='bar', color='seagreen') 82 | plt.title('Top 10 Countries with Highest Average Life Expectancy') 83 | plt.ylabel('Life Expectancy') 84 | plt.show() 85 | #Insight: Countries like Japan, Switzerland, and Australia consistently rank high in life expectancy. 86 | 87 | #7 Life Expectancy vs Schooling 88 | sns.scatterplot(x='Schooling', y='Life_expectancy', hue='Status', data=df) 89 | plt.title('Life Expectancy vs Average Schooling') 90 | plt.show() #Insight: More years of schooling are generally linked to higher life expectancy. 91 | 92 | #8. Life Expectancy vs GDP 93 | sns.scatterplot(x='GDP', y='Life_expectancy', data=df) 94 | plt.title('Life Expectancy vs GDP') 95 | plt.show()#Insight: GDP has a weak but positive correlation with life expectancy, more evident in developing countries. 96 | 97 | # 9. Life Expectancy vs BMI 98 | sns.scatterplot(x='BMI', y='Life_expectancy', data=df) 99 | plt.title('Life Expectancy vs BMI') 100 | plt.show()#Insight: There is a healthy BMI range (~20-30) that associates with higher life expectancy. 101 | 102 | #10. Immunization Impact (Polio vs Life Expectancy) 103 | sns.scatterplot(x='Polio', y='Life_expectancy', hue='Status', data=df) 104 | plt.title('Polio Immunization vs Life Expectancy') 105 | plt.show()# Insight: Higher Polio immunization rates generally lead to better life expectancy outcomes. 106 | 107 | 108 | # apply model to the dataset 109 | from sklearn.model_selection import train_test_split 110 | from sklearn.linear_model import LinearRegression 111 | from sklearn.svm import SVR 112 | from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 113 | from sklearn.preprocessing import StandardScaler 114 | 115 | # Step 1: Prepare the data 116 | X = df.drop(['Life_expectancy', 'Country', 'Status'], axis=1) 117 | y = df['Life_expectancy'] 118 | 119 | # Step 2: Normalize the features 120 | scaler = StandardScaler() 121 | X_scaled = scaler.fit_transform(X) 122 | 123 | # Step 3: Train-test split 124 | X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) 125 | 126 | # Step 4: Linear Regression 127 | lr = LinearRegression() 128 | lr.fit(X_train, y_train) 129 | lr_pred = lr.predict(X_test) 130 | 131 | print("\n📘 Linear Regression Results:") 132 | print("R² Score:", round(r2_score(y_test, lr_pred), 4)) 133 | print("MSE:", round(mean_squared_error(y_test, lr_pred), 4)) 134 | print("MAE:", round(mean_absolute_error(y_test, lr_pred), 4)) 135 | 136 | # Step 5: SVR 137 | svr = SVR(kernel='rbf') 138 | svr.fit(X_train, y_train) 139 | svr_pred = svr.predict(X_test) 140 | 141 | print("\n⚙️ SVR Results:") 142 | print("R² Score:", round(r2_score(y_test, svr_pred), 4)) 143 | print("MSE:", round(mean_squared_error(y_test, svr_pred), 4)) 144 | print("MAE:", round(mean_absolute_error(y_test, svr_pred), 4)) 145 | 146 | # Step 6: Final Comparison 147 | lr_r2 = r2_score(y_test, lr_pred) 148 | svr_r2 = r2_score(y_test, svr_pred) 149 | 150 | if lr_r2 > svr_r2: 151 | print(f"\n🏆 Best Model: Linear Regression with R² Score = {round(lr_r2, 4)}") 152 | else: 153 | print(f"\n🏆 Best Model: SVR with R² Score = {round(svr_r2, 4)}") 154 | 155 | 156 | # classification 157 | # Binary classification: High life expectancy (>70) vs Low (≤70) 158 | df['Life_expectancy_class'] = (df['Life_expectancy'] > 70).astype(int) 159 | 160 | X = df.drop(columns=['Life_expectancy', 'Life_expectancy_class','Country','Status']) 161 | y = df['Life_expectancy_class'] 162 | 163 | scaler = StandardScaler() 164 | X_scaled = scaler.fit_transform(X) 165 | 166 | X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y) 167 | 168 | from sklearn.neighbors import KNeighborsClassifier 169 | from sklearn.metrics import classification_report, confusion_matrix,accuracy_score 170 | knn_clf = KNeighborsClassifier(n_neighbors=5) 171 | knn_clf.fit(X_train, y_train) 172 | 173 | y_pred = knn_clf.predict(X_test) 174 | 175 | print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred)) 176 | print("\nClassification Report:\n", classification_report(y_test, y_pred)) 177 | print("\nClassification Report:\n", accuracy_score(y_test, y_pred)) 178 | 179 | --------------------------------------------------------------------------------