├── README.md
└── Python_INT557.py


/README.md:
--------------------------------------------------------------------------------
 1 | # 🌍 Global health Data Analysis and Prediction
 2 | 
 3 | This project explores and analyzes the **Life Expectancy Dataset** to understand the key factors influencing life expectancy across the globe. It involves data preprocessing, exploratory data analysis (EDA), correlation studies, visualization, and machine learning modeling for regression and classification tasks.
 4 | 
 5 | ---
 6 | 
 7 | ## 📁 Dataset
 8 | 
 9 | **Source**: [Kaggle Life Expectancy Data](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who)
10 | 
11 | **Attributes**:
12 | - Country, Year, Status (Developed/Developing)
13 | - Life Expectancy, Adult Mortality, BMI, GDP, Schooling, Immunization stats (Polio, Diphtheria), Alcohol consumption, and more.
14 | 
15 | ---
16 | 
17 | ## 🔍 Objectives
18 | 
19 | - Handle missing values and clean the dataset.
20 | - Perform EDA to extract trends and insights.
21 | - Visualize key factors affecting life expectancy.
22 | - Apply machine learning models to predict life expectancy (regression).
23 | - Classify countries based on life expectancy (binary classification).
24 | 
25 | ---
26 | 
27 | ## 🧪 Libraries Used
28 | 
29 | - `pandas`, `numpy` – Data manipulation
30 | - `matplotlib`, `seaborn` – Visualization
31 | - `dtale` – Interactive data exploration
32 | - `scikit-learn` – ML modeling and preprocessing
33 | 
34 | ---
35 | 
36 | ## 📊 Exploratory Data Analysis
37 | 
38 | Visualizations and key insights:
39 | 
40 | 1. **Distribution of Life Expectancy**: Normal distribution centered around ~70 years.
41 | 2. **Developed vs Developing**: Developed countries show significantly higher life expectancy.
42 | 3. **Top 10 Countries**: Bar charts of countries with highest/lowest average life expectancy.
43 | 4. **Correlation Heatmap**: Positive correlation with schooling, BMI; negative with adult mortality.
44 | 5. **Trends Over Time**: Life expectancy generally increases with time.
45 | 6. **Scatterplots**: Explored relationships between life expectancy and Schooling, GDP, BMI, Immunization (Polio).
46 | 
47 | ---
48 | 
49 | ## ⚙️ Data Preprocessing
50 | 
51 | - Handled missing values using column means and group-wise means.
52 | - Normalized features using `StandardScaler`.
53 | 
54 | ---
55 | 
56 | ## 🧠 Machine Learning
57 | 
58 | ### Regression Models
59 | 
60 | | Model               | R² Score | Mean Squared Error | Mean Absolute Error |
61 | |--------------------|----------|--------------------|---------------------|
62 | | Linear Regression  | ~0.88    | ~5.1               | ~1.8                |
63 | | Support Vector Regressor (SVR) | ~0.86    | ~6.0               | ~1.9                |
64 | 
65 | 🏆 **Best Model**: Linear Regression
66 | 
67 | ### Classification Model
68 | 
69 | - **Target**: Binary classification – High Life Expectancy (>70 years) vs Low (≤70 years)
70 | - **Model**: K-Nearest Neighbors (KNN)
71 | - **Accuracy**: ~0.89
72 | - **Metrics**: Confusion Matrix, Precision, Recall, F1-score
73 | 
74 | ---
75 | 
76 | ## 🧾 How to Run
77 | 
78 | 1. Clone this repository
79 | 2. Install the dependencies:
80 |    ```bash
81 |    pip install pandas seaborn matplotlib dtale scikit-learn
82 | 


--------------------------------------------------------------------------------
/Python_INT557.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sat Apr 12 02:08:56 2025
  4 | 
  5 | @author: shiva
  6 | """
  7 | import pandas as pd 
  8 | import matplotlib.pyplot as plt
  9 | import dtale as dt
 10 | import seaborn as sns
 11 | 
 12 | # load dataset
 13 | df=pd.read_csv("C:\\Users\\shiva\\OneDrive\\Desktop\\Life Expectancy Data.csv")
 14 | data = dt.show(df)
 15 | data.open_browser()
 16 | 
 17 | # about data
 18 | df.info()
 19 | df.describe()
 20 | 
 21 | print("Initial shape:", df.shape)
 22 | print(df.info())
 23 | print("\nMissing values per column:\n", df.isnull().sum())
 24 | 
 25 | df.columns = df.columns.str.replace(' ', '')
 26 | df.fillna({
 27 |     'Life_expectancy':df['Life_expectancy'].mean(),
 28 |     'AdultMortality':df['AdultMortality'].mean(),
 29 |     'BMI':df['BMI'].mean(),
 30 |     'Polio':df['Polio'].mean(),
 31 |     'Diphtheria':df['Diphtheria'].mean(),
 32 |     'thinness1-19years':df['thinness1-19years'].mean(),
 33 |     'thinness5-9years':df['thinness5-9years'].mean()}, inplace=True)
 34 | 
 35 | cols_to_fill_with_mean = [
 36 |     'Alcohol', 'Hepatitis_B', 'Total_expenditure', 
 37 |     'GDP', 'Population', 'Income_composition_of_resources', 'Schooling'
 38 | ]
 39 | 
 40 | for col in cols_to_fill_with_mean:
 41 |     df[col] = df.groupby('Country')[col].transform(lambda x: x.fillna(x.mean()))
 42 | 
 43 | #Distribution of Life Expectancy
 44 | sns.histplot(df['Life_expectancy'], kde=True, color='red')
 45 | plt.title('Distribution of Life Expectancy')
 46 | plt.xlabel('Life Expectancy')
 47 | plt.show()#life expectancy is approximately normally distributed with a peak around 70 years.
 48 | 
 49 | # Life Expectancy by Status (Developed vs Developing)
 50 | 
 51 | sns.boxplot(x='Status', y='Life_expectancy', data=df)
 52 | plt.title('Life Expectancy by Development Status')
 53 | plt.show()#Developed countries have significantly higher life expectancy than developing ones.
 54 | 
 55 | top_countries = df.groupby('Country')['Life_expectancy'].mean().sort_values(ascending=True).head(10)
 56 | top_countries.plot(kind='bar', color='seagreen')
 57 | plt.title('Top 10 Countries with Lowest Average Life Expectancy')
 58 | plt.ylabel('Life Expectancy')
 59 | plt.show()
 60 | # correlation 
 61 | cols = [
 62 |     'Life_expectancy', 'Alcohol','AdultMortality', 'Population',
 63 |     'Income_composition_of_resources', 'Schooling','BMI'
 64 | ]
 65 | corr = df[cols].corr()  # Calculate correlation matrix
 66 | sns.heatmap(corr, cmap='coolwarm', annot=True, fmt='.2f')
 67 | plt.title('Correlation Heatmap')
 68 | plt.show()
 69 | #4. Life Expectancy vs Adult Mortality4. Life Expectancy vs Adult Mortality
 70 | sns.scatterplot(x='Life_expectancy', y='AdultMortality', hue='Status', data=df)
 71 | plt.title('Life Expectancy vs Adult Mortality')
 72 | plt.show() #Insight: Higher adult mortality is strongly associated with lower life expectancy.
 73 | 
 74 | #5  Trend of Life Expectancy Over the Years
 75 | sns.lineplot(x='Year', y='Life_expectancy', data=df)
 76 | plt.title('Trend of Life Expectancy Over the Years')
 77 | plt.show() #Insight: Life expectancy has generally increased over time globally.
 78 |  
 79 | #6  Top 10 Countries with Highest Life Expectancy
 80 | top_countries = df.groupby('Country')['Life_expectancy'].mean().sort_values(ascending=False).head(10)
 81 | top_countries.plot(kind='bar', color='seagreen')
 82 | plt.title('Top 10 Countries with Highest Average Life Expectancy')
 83 | plt.ylabel('Life Expectancy')
 84 | plt.show()
 85 | #Insight: Countries like Japan, Switzerland, and Australia consistently rank high in life expectancy.
 86 | 
 87 | #7  Life Expectancy vs Schooling
 88 | sns.scatterplot(x='Schooling', y='Life_expectancy', hue='Status', data=df)
 89 | plt.title('Life Expectancy vs Average Schooling')
 90 | plt.show() #Insight: More years of schooling are generally linked to higher life expectancy.
 91 | 
 92 | #8. Life Expectancy vs GDP
 93 | sns.scatterplot(x='GDP', y='Life_expectancy', data=df)
 94 | plt.title('Life Expectancy vs GDP')
 95 | plt.show()#Insight: GDP has a weak but positive correlation with life expectancy, more evident in developing countries.
 96 | 
 97 | # 9. Life Expectancy vs BMI
 98 | sns.scatterplot(x='BMI', y='Life_expectancy', data=df)
 99 | plt.title('Life Expectancy vs BMI')
100 | plt.show()#Insight: There is a healthy BMI range (~20-30) that associates with higher life expectancy.
101 | 
102 | #10. Immunization Impact (Polio vs Life Expectancy)
103 | sns.scatterplot(x='Polio', y='Life_expectancy', hue='Status', data=df)
104 | plt.title('Polio Immunization vs Life Expectancy')
105 | plt.show()# Insight: Higher Polio immunization rates generally lead to better life expectancy outcomes.
106 | 
107 | 
108 | # apply model to the dataset
109 | from sklearn.model_selection import train_test_split
110 | from sklearn.linear_model import LinearRegression
111 | from sklearn.svm import SVR
112 | from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
113 | from sklearn.preprocessing import StandardScaler
114 | 
115 | # Step 1: Prepare the data
116 | X = df.drop(['Life_expectancy', 'Country', 'Status'], axis=1)
117 | y = df['Life_expectancy']
118 | 
119 | # Step 2: Normalize the features
120 | scaler = StandardScaler()
121 | X_scaled = scaler.fit_transform(X)
122 | 
123 | # Step 3: Train-test split
124 | X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
125 | 
126 | # Step 4: Linear Regression
127 | lr = LinearRegression()
128 | lr.fit(X_train, y_train)
129 | lr_pred = lr.predict(X_test)
130 | 
131 | print("\n📘 Linear Regression Results:")
132 | print("R² Score:", round(r2_score(y_test, lr_pred), 4))
133 | print("MSE:", round(mean_squared_error(y_test, lr_pred), 4))
134 | print("MAE:", round(mean_absolute_error(y_test, lr_pred), 4))
135 | 
136 | # Step 5: SVR
137 | svr = SVR(kernel='rbf')
138 | svr.fit(X_train, y_train)
139 | svr_pred = svr.predict(X_test)
140 | 
141 | print("\n⚙️ SVR Results:")
142 | print("R² Score:", round(r2_score(y_test, svr_pred), 4))
143 | print("MSE:", round(mean_squared_error(y_test, svr_pred), 4))
144 | print("MAE:", round(mean_absolute_error(y_test, svr_pred), 4))
145 | 
146 | # Step 6: Final Comparison
147 | lr_r2 = r2_score(y_test, lr_pred)
148 | svr_r2 = r2_score(y_test, svr_pred)
149 | 
150 | if lr_r2 > svr_r2:
151 |     print(f"\n🏆 Best Model: Linear Regression with R² Score = {round(lr_r2, 4)}")
152 | else:
153 |     print(f"\n🏆 Best Model: SVR with R² Score = {round(svr_r2, 4)}")
154 | 
155 | 
156 | # classification
157 | # Binary classification: High life expectancy (>70) vs Low (≤70)
158 | df['Life_expectancy_class'] = (df['Life_expectancy'] > 70).astype(int)
159 | 
160 | X = df.drop(columns=['Life_expectancy', 'Life_expectancy_class','Country','Status'])
161 | y = df['Life_expectancy_class']
162 | 
163 | scaler = StandardScaler()
164 | X_scaled = scaler.fit_transform(X)
165 | 
166 | X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)
167 | 
168 | from sklearn.neighbors import KNeighborsClassifier
169 | from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
170 | knn_clf = KNeighborsClassifier(n_neighbors=5)
171 | knn_clf.fit(X_train, y_train)
172 | 
173 | y_pred = knn_clf.predict(X_test)
174 | 
175 | print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
176 | print("\nClassification Report:\n", classification_report(y_test, y_pred))
177 | print("\nClassification Report:\n", accuracy_score(y_test, y_pred))
178 | 
179 | 


--------------------------------------------------------------------------------