├── README.md ├── file └── climate_disease_spread_model.py /README.md: -------------------------------------------------------------------------------- 1 | # 🌍 Climate-Driven-Disease-Spread-Prediction-Model 2 | 3 | **Author:** [Otutu Anslem](https://github.com/Otutu11) 4 | **Repository:** [@Otutu11](https://github.com/Otutu11) 5 | 6 | --- 7 | 8 | ## 📌 Overview 9 | 10 | This project presents a **synthetic end-to-end machine learning pipeline** that predicts **disease spread driven by climatic and environmental factors**. 11 | It combines **regression** to estimate next-week disease incidence and **classification** to detect outbreak risk, using multi-regional, weekly panel data. 12 | 13 | --- 14 | 15 | ## ⚙️ Features 16 | 17 | - 📅 Generates **synthetic dataset** (>3,000 rows) with: 18 | - Climate: temperature, humidity, rainfall, wind, NDVI 19 | - Environmental: sanitation, health access, population density 20 | - Behavioural: mobility, stagnant water index 21 | - Health: weekly disease incidence and outbreak labels 22 | - 🤖 Trains: 23 | - `RandomForestRegressor` → Predict next-week incidence (per 100k) 24 | - `GradientBoostingClassifier` → Detect outbreak risk (binary) 25 | - 📊 Outputs: 26 | - Dataset (`.csv`) 27 | - Trained models (`.joblib`) 28 | - Performance metrics, plots, and reports 29 | 30 | --- 31 | 32 | ## 📁 Project Structure 33 | 34 | . 35 | ├── climate_disease_spread_model.py 36 | ├── outputs/ 37 | │ ├── synthetic_climate_disease_dataset.csv 38 | │ ├── incidence_regressor.joblib 39 | │ ├── outbreak_classifier.joblib 40 | │ ├── regression_parity.png 41 | │ ├── regression_feature_importance.png 42 | │ ├── classification_roc.png 43 | │ ├── classification_confusion.png 44 | │ ├── classification_feature_importance.png 45 | │ ├── classification_report.txt 46 | │ ├── cv_results.txt 47 | │ └── README_Climate_Disease_Model.txt 48 | └── README.md 49 | 50 | yaml 51 | Copy code 52 | 53 | --- 54 | 55 | ## 🚀 Getting Started 56 | 57 | ### 1. Clone the repository 58 | ```bash 59 | git clone https://github.com/Otutu11/Climate-Driven-Disease-Spread-Prediction-Model.git 60 | cd Climate-Driven-Disease-Spread-Prediction-Model 61 | 2. Install dependencies 62 | bash 63 | Copy code 64 | pip install numpy pandas matplotlib scikit-learn joblib 65 | 3. Run the script 66 | bash 67 | Copy code 68 | python climate_disease_spread_model.py 69 | 📈 Usage Example 70 | python 71 | Copy code 72 | from joblib import load 73 | import pandas as pd 74 | 75 | # Load dataset 76 | df = pd.read_csv("outputs/synthetic_climate_disease_dataset.csv") 77 | 78 | # Prepare features 79 | features = [ 80 | "temp_c","humidity","rain_mm","wind_ms","ndvi", 81 | "pop_density","sanitation","vacc_coverage","health_access", 82 | "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3" 83 | ] 84 | X = df[features].values 85 | 86 | # Load models 87 | reg = load("outputs/incidence_regressor.joblib") 88 | clf = load("outputs/outbreak_classifier.joblib") 89 | 90 | # Predictions 91 | pred_incidence = reg.predict(X) 92 | pred_outbreak_prob = clf.predict_proba(X)[:, 1] 93 | 📌 Notes 94 | This project uses synthetic data purely for demonstration. 95 | 96 | Not intended for real epidemiological decision-making. 97 | 98 | Extendable to real datasets by swapping the synthetic generator section with actual data ingestion. 99 | 100 | 📜 License 101 | This project is open-source under the MIT License. 102 | Feel free to use, modify, and distribute with attribution. 103 | 104 | Maintained by: Otutu Anslem 105 | 106 | 📧 Contributions and feedback are welcome! 107 | -------------------------------------------------------------------------------- /file: -------------------------------------------------------------------------------- 1 | # file: climate_disease_spread_model.py 2 | # Purpose: Synthetic end-to-end demo for "Climate-Driven-Disease-Spread-Prediction-Model" 3 | # Author: You :) 4 | # Run: python climate_disease_spread_model.py 5 | # 6 | # What it does: 7 | # - Generates a synthetic dataset (>100 rows) linking climate, environment, mobility, and health covariates 8 | # to weekly disease incidence and outbreak occurrence. 9 | # - Trains: 10 | # * Regression (RandomForestRegressor) to predict next-week incidence per 100k 11 | # * Classification (GradientBoostingClassifier) to flag outbreak risk (binary) 12 | # - Saves: CSV dataset, trained models, plots, and text reports under outputs/ 13 | # 14 | # Notes: 15 | # - Purely synthetic; do not use for real epidemiological inference. 16 | # - Minimal deps: numpy, pandas, matplotlib, scikit-learn, joblib 17 | 18 | import os 19 | import numpy as np 20 | import pandas as pd 21 | import matplotlib.pyplot as plt 22 | 23 | from sklearn.model_selection import TimeSeriesSplit, cross_val_score 24 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier, RandomForestClassifier 25 | from sklearn.metrics import ( 26 | mean_absolute_error, mean_squared_error, r2_score, 27 | roc_auc_score, roc_curve, confusion_matrix, classification_report 28 | ) 29 | from joblib import dump 30 | 31 | # ----------------------------- 32 | # 0) Setup 33 | # ----------------------------- 34 | np.random.seed(7) 35 | OUTDIR = "outputs" 36 | os.makedirs(OUTDIR, exist_ok=True) 37 | 38 | # ----------------------------- 39 | # 1) Generate synthetic time-series panel data 40 | # ----------------------------- 41 | # Settings 42 | N_REGIONS = 25 43 | N_WEEKS_PER_REGION = 120 # total rows = 3000 (>100) 44 | N = N_REGIONS * N_WEEKS_PER_REGION 45 | 46 | regions = np.repeat(np.arange(N_REGIONS), N_WEEKS_PER_REGION) 47 | week = np.tile(np.arange(N_WEEKS_PER_REGION), N_REGIONS) 48 | 49 | # Seasonality helpers 50 | # Weekly seasonality (e.g., vector-borne peaks in rainy season) 51 | season = 2 * np.pi * week / 52.0 52 | rainy_season_factor = 0.5 + 0.5 * (np.sin(season - 0.5) > 0).astype(float) 53 | 54 | # Climate variables (synthetic but plausible ranges) 55 | temp_c = 22 + 8*np.sin(season) + np.random.normal(0, 1.2, N) # ~14..30 seasonal 56 | humidity = np.clip(0.55 + 0.25*np.sin(season-0.8) + np.random.normal(0, 0.07, N), 0, 1) 57 | rain_mm = np.clip(40*rainy_season_factor + np.random.gamma(2, 5, N), 0, 300) 58 | wind_ms = np.clip(np.random.normal(3.5, 1.2, N), 0, 15) 59 | ndvi = np.clip(0.35 + 0.15*np.sin(season-1.2) + np.random.normal(0, 0.05, N), 0, 1) 60 | 61 | # Environmental/socioeconomic covariates (region-specific base + noise) 62 | pop_density_base = np.random.uniform(200, 2000, N_REGIONS) 63 | sanitation_base = np.clip(np.random.normal(0.6, 0.12, N_REGIONS), 0, 1) 64 | vacc_coverage_base = np.clip(np.random.normal(0.55, 0.10, N_REGIONS), 0, 1) 65 | health_access_base = np.clip(np.random.normal(0.5, 0.1, N_REGIONS), 0, 1) 66 | 67 | pop_density = pop_density_base[regions] + np.random.normal(0, 40, N) 68 | sanitation = np.clip(sanitation_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 69 | vacc_coverage = np.clip(vacc_coverage_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 70 | health_access = np.clip(health_access_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 71 | 72 | # Mobility & exposure 73 | mobility_idx = np.clip(0.5 + 0.3*np.sin(season+0.5) + np.random.normal(0, 0.1, N), 0, 1) 74 | stagnant_water_idx = np.clip(0.3 + 0.5*rainy_season_factor + np.random.normal(0, 0.08, N), 0, 1) 75 | 76 | # Baseline regional risk (latent) 77 | region_risk = np.random.normal(0, 0.4, N_REGIONS)[regions] 78 | 79 | # True latent incidence (per 100k) before noise (use a simple structural formula) 80 | # Higher with temp, humidity, rainfall, stagnant water, mobility, density; lower with vacc & sanitation. 81 | latent_incidence = ( 82 | 5 83 | + 0.6*(temp_c - 22) 84 | + 1.2*(humidity - 0.5) 85 | + 0.03*rain_mm 86 | + 4.0*stagnant_water_idx 87 | + 3.0*mobility_idx 88 | + 0.004*pop_density 89 | - 3.5*vacc_coverage 90 | - 2.0*sanitation 91 | - 1.0*health_access 92 | + 0.8*region_risk 93 | ) 94 | 95 | # Add stochasticity and enforce non-negativity 96 | incidence = np.clip(latent_incidence + np.random.normal(0, 1.5, N), 0, None) 97 | 98 | # Create a lag feature for autoregression (last week's incidence by region) 99 | df = pd.DataFrame({ 100 | "region": regions, 101 | "week": week, 102 | "temp_c": temp_c, 103 | "humidity": humidity, 104 | "rain_mm": rain_mm, 105 | "wind_ms": wind_ms, 106 | "ndvi": ndvi, 107 | "pop_density": pop_density, 108 | "sanitation": sanitation, 109 | "vacc_coverage": vacc_coverage, 110 | "health_access": health_access, 111 | "mobility_idx": mobility_idx, 112 | "stagnant_water_idx": stagnant_water_idx, 113 | "incidence": incidence 114 | }) 115 | 116 | df["incidence_lag1"] = df.groupby("region")["incidence"].shift(1) 117 | df["incidence_ma3"] = df.groupby("region")["incidence"].rolling(3).mean().reset_index(level=0, drop=True) 118 | df["rain_mm_ma3"] = df.groupby("region")["rain_mm"].rolling(3).mean().reset_index(level=0, drop=True) 119 | 120 | # Fill first-lag rows with reasonable defaults 121 | for col in ["incidence_lag1", "incidence_ma3", "rain_mm_ma3"]: 122 | df[col] = df[col].fillna(df[col].median()) 123 | 124 | # Target variables 125 | # Next-week incidence (regression) — shift by -1 week within region 126 | df["incidence_next"] = df.groupby("region")["incidence"].shift(-1) 127 | # For last week per region, fill with same-week value to keep dataset size stable 128 | last_mask = df["incidence_next"].isna() 129 | df.loc[last_mask, "incidence_next"] = df.loc[last_mask, "incidence"] 130 | 131 | # Outbreak classification: top 30% of incidence_next considered an outbreak 132 | thresh = np.percentile(df["incidence_next"], 70) 133 | df["outbreak"] = (df["incidence_next"] >= thresh).astype(int) 134 | 135 | # Save CSV 136 | csv_path = os.path.join(OUTDIR, "synthetic_climate_disease_dataset.csv") 137 | df.to_csv(csv_path, index=False) 138 | 139 | # ----------------------------- 140 | # 2) Train-test split (time-aware by region) 141 | # ----------------------------- 142 | # We'll split using the last 20% of weeks as test within each region to mimic forecasting. 143 | def time_split_mask(frame, holdout_frac=0.2): 144 | test_mask = np.zeros(len(frame), dtype=bool) 145 | for r in frame["region"].unique(): 146 | idx = frame.index[frame["region"] == r] 147 | n = len(idx) 148 | split_point = int((1 - holdout_frac) * n) 149 | test_idx = idx[split_point:] 150 | test_mask[test_idx] = True 151 | return test_mask 152 | 153 | test_mask = time_split_mask(df, holdout_frac=0.2) 154 | 155 | features = [ 156 | "temp_c","humidity","rain_mm","wind_ms","ndvi", 157 | "pop_density","sanitation","vacc_coverage","health_access", 158 | "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3" 159 | ] 160 | 161 | X = df[features].values 162 | y_reg = df["incidence_next"].values 163 | y_cls = df["outbreak"].values 164 | 165 | Xtr_r, Xte_r = X[~test_mask], X[test_mask] 166 | ytr_r, yte_r = y_reg[~test_mask], y_reg[test_mask] 167 | 168 | Xtr_c, Xte_c = X[~test_mask], X[test_mask] 169 | ytr_c, yte_c = y_cls[~test_mask], y_cls[test_mask] 170 | 171 | # ----------------------------- 172 | # 3) Regression model 173 | # ----------------------------- 174 | reg = RandomForestRegressor(n_estimators=500, random_state=7, n_jobs=-1) 175 | reg.fit(Xtr_r, ytr_r) 176 | yp_r = reg.predict(Xte_r) 177 | 178 | mae = mean_absolute_error(yte_r, yp_r) 179 | rmse = mean_squared_error(yte_r, yp_r, squared=False) 180 | r2 = r2_score(yte_r, yp_r) 181 | 182 | # Parity plot 183 | plt.figure() 184 | plt.scatter(yte_r, yp_r, alpha=0.5) 185 | mn = min(yte_r.min(), yp_r.min()); mx = max(yte_r.max(), yp_r.max()) 186 | plt.plot([mn, mx], [mn, mx]) 187 | plt.xlabel("True next-week incidence (per 100k)") 188 | plt.ylabel("Predicted next-week incidence (per 100k)") 189 | plt.title(f"Regression Parity | MAE={mae:.2f} RMSE={rmse:.2f} R2={r2:.2f}") 190 | plt.tight_layout() 191 | plt.savefig(os.path.join(OUTDIR, "regression_parity.png")); plt.close() 192 | 193 | # Feature importance (regression) 194 | imp_r = pd.Series(reg.feature_importances_, index=features).sort_values() 195 | plt.figure() 196 | plt.barh(imp_r.index, imp_r.values) 197 | plt.xlabel("Importance") 198 | plt.title("Regression Feature Importances") 199 | plt.tight_layout() 200 | plt.savefig(os.path.join(OUTDIR, "regression_feature_importance.png")); plt.close() 201 | 202 | # ----------------------------- 203 | # 4) Classification model 204 | # ----------------------------- 205 | clf = GradientBoostingClassifier(random_state=7) 206 | clf.fit(Xtr_c, ytr_c) 207 | proba = clf.predict_proba(Xte_c)[:, 1] 208 | yp_c = (proba >= 0.5).astype(int) 209 | 210 | acc = (yp_c == yte_c).mean() 211 | auc = roc_auc_score(yte_c, proba) 212 | 213 | # ROC curve 214 | fpr, tpr, thr = roc_curve(yte_c, proba) 215 | plt.figure() 216 | plt.plot(fpr, tpr); plt.plot([0,1],[0,1]) 217 | plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate") 218 | plt.title(f"ROC Curve (AUC={auc:.3f})") 219 | plt.tight_layout() 220 | plt.savefig(os.path.join(OUTDIR, "classification_roc.png")); plt.close() 221 | 222 | # Confusion Matrix 223 | cm = confusion_matrix(yte_c, yp_c) 224 | plt.figure() 225 | im = plt.imshow(cm, interpolation="nearest") 226 | plt.title("Confusion Matrix") 227 | plt.xlabel("Predicted"); plt.ylabel("True") 228 | plt.xticks([0,1], ["No Outbreak","Outbreak"]) 229 | plt.yticks([0,1], ["No Outbreak","Outbreak"]) 230 | for (i,j), v in np.ndenumerate(cm): 231 | plt.text(j, i, int(v), ha="center", va="center") 232 | plt.colorbar(im, fraction=0.046, pad=0.04) 233 | plt.tight_layout() 234 | plt.savefig(os.path.join(OUTDIR, "classification_confusion.png")); plt.close() 235 | 236 | # Classification report 237 | report = classification_report(yte_c, yp_c, target_names=["No Outbreak","Outbreak"]) 238 | with open(os.path.join(OUTDIR, "classification_report.txt"), "w") as f: 239 | f.write(report) 240 | 241 | # Surrogate RF for feature importance 242 | rf_sur = RandomForestClassifier(n_estimators=500, random_state=7, n_jobs=-1) 243 | rf_sur.fit(Xtr_c, ytr_c) 244 | imp_c = pd.Series(rf_sur.feature_importances_, index=features).sort_values() 245 | plt.figure() 246 | plt.barh(imp_c.index, imp_c.values) 247 | plt.xlabel("Importance") 248 | plt.title("Classification Feature Importances (RF surrogate)") 249 | plt.tight_layout() 250 | plt.savefig(os.path.join(OUTDIR, "classification_feature_importance.png")); plt.close() 251 | 252 | # ----------------------------- 253 | # 5) Time-series cross-validation (optional, regression R^2) 254 | # ----------------------------- 255 | tscv = TimeSeriesSplit(n_splits=5) 256 | cv_r2 = cross_val_score(reg, X, y_reg, cv=tscv, scoring="r2") 257 | with open(os.path.join(OUTDIR, "cv_results.txt"), "w") as f: 258 | f.write(f"Regression R2 (5-fold TimeSeriesSplit): mean={cv_r2.mean():.3f}, std={cv_r2.std():.3f}\n") 259 | f.write(f"Fold scores: {np.round(cv_r2, 3)}\n") 260 | 261 | # ----------------------------- 262 | # 6) Save artifacts 263 | # ----------------------------- 264 | dump(reg, os.path.join(OUTDIR, "incidence_regressor.joblib")) 265 | dump(clf, os.path.join(OUTDIR, "outbreak_classifier.joblib")) 266 | 267 | with open(os.path.join(OUTDIR, "README_Climate_Disease_Model.txt"), "w") as f: 268 | f.write( 269 | "Climate-Driven-Disease-Spread-Prediction-Model (Synthetic Demo)\n\n" 270 | "Files:\n" 271 | "- synthetic_climate_disease_dataset.csv : synthetic weekly panel dataset across regions\n" 272 | "- incidence_regressor.joblib : RandomForestRegressor for next-week incidence\n" 273 | "- outbreak_classifier.joblib : GradientBoostingClassifier for outbreak detection\n" 274 | "- regression_parity.png, regression_feature_importance.png\n" 275 | "- classification_roc.png, classification_confusion.png, classification_feature_importance.png\n" 276 | "- classification_report.txt, cv_results.txt\n\n" 277 | "Quick usage:\n" 278 | "from joblib import load\n" 279 | "import pandas as pd\n" 280 | "df = pd.read_csv('outputs/synthetic_climate_disease_dataset.csv')\n" 281 | "features = ['temp_c','humidity','rain_mm','wind_ms','ndvi','pop_density','sanitation',\n" 282 | " 'vacc_coverage','health_access','mobility_idx','stagnant_water_idx',\n" 283 | " 'incidence_lag1','incidence_ma3','rain_mm_ma3']\n" 284 | "X = df[features].values\n" 285 | "reg = load('outputs/incidence_regressor.joblib'); y_pred = reg.predict(X)\n" 286 | "clf = load('outputs/outbreak_classifier.joblib'); p_out = clf.predict_proba(X)[:,1]\n" 287 | ) 288 | 289 | # ----------------------------- 290 | # 7) Print summary to console 291 | # ----------------------------- 292 | print("=== Regression ===") 293 | print(f"MAE={mae:.3f} RMSE={rmse:.3f} R2={r2:.3f}") 294 | print("\n=== Classification ===") 295 | print(f"Accuracy={acc:.3f} ROC-AUC={auc:.3f}") 296 | print("\nClassification report:\n", report) 297 | print(f"\nTimeSeriesSplit R2: mean={cv_r2.mean():.3f} ± {cv_r2.std():.3f}") 298 | print(f"\nArtifacts saved to: {OUTDIR}/") 299 | 300 | 301 | -------------------------------------------------------------------------------- /climate_disease_spread_model.py: -------------------------------------------------------------------------------- 1 | # file: climate_disease_spread_model.py 2 | # Purpose: Synthetic end-to-end demo for "Climate-Driven-Disease-Spread-Prediction-Model" 3 | # Author: You :) 4 | # Run: python climate_disease_spread_model.py 5 | # 6 | # What it does: 7 | # - Generates a synthetic dataset (>100 rows) linking climate, environment, mobility, and health covariates 8 | # to weekly disease incidence and outbreak occurrence. 9 | # - Trains: 10 | # * Regression (RandomForestRegressor) to predict next-week incidence per 100k 11 | # * Classification (GradientBoostingClassifier) to flag outbreak risk (binary) 12 | # - Saves: CSV dataset, trained models, plots, and text reports under outputs/ 13 | # 14 | # Notes: 15 | # - Purely synthetic; do not use for real epidemiological inference. 16 | # - Minimal deps: numpy, pandas, matplotlib, scikit-learn, joblib 17 | 18 | import os 19 | import numpy as np 20 | import pandas as pd 21 | import matplotlib.pyplot as plt 22 | 23 | from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score 24 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier, RandomForestClassifier 25 | from sklearn.metrics import ( 26 | mean_absolute_error, mean_squared_error, r2_score, 27 | roc_auc_score, roc_curve, confusion_matrix, classification_report 28 | ) 29 | from joblib import dump 30 | 31 | # ----------------------------- 32 | # 0) Setup 33 | # ----------------------------- 34 | np.random.seed(7) 35 | OUTDIR = "outputs" 36 | os.makedirs(OUTDIR, exist_ok=True) 37 | 38 | # ----------------------------- 39 | # 1) Generate synthetic time-series panel data 40 | # ----------------------------- 41 | # Settings 42 | N_REGIONS = 25 43 | N_WEEKS_PER_REGION = 120 # total rows = 3000 (>100) 44 | N = N_REGIONS * N_WEEKS_PER_REGION 45 | 46 | regions = np.repeat(np.arange(N_REGIONS), N_WEEKS_PER_REGION) 47 | week = np.tile(np.arange(N_WEEKS_PER_REGION), N_REGIONS) 48 | 49 | # Seasonality helpers 50 | # Weekly seasonality (e.g., vector-borne peaks in rainy season) 51 | season = 2 * np.pi * week / 52.0 52 | rainy_season_factor = 0.5 + 0.5 * (np.sin(season - 0.5) > 0).astype(float) 53 | 54 | # Climate variables (synthetic but plausible ranges) 55 | temp_c = 22 + 8*np.sin(season) + np.random.normal(0, 1.2, N) # 14..30 seasonal 56 | humidity = np.clip(0.55 + 0.25*np.sin(season-0.8) + np.random.normal(0, 0.07, N), 0, 1) 57 | rain_mm = np.clip(40*rainy_season_factor + np.random.gamma(2, 5, N), 0, 300) 58 | wind_ms = np.clip(np.random.normal(3.5, 1.2, N), 0, 15) 59 | ndvi = np.clip(0.35 + 0.15*np.sin(season-1.2) + np.random.normal(0, 0.05, N), 0, 1) 60 | 61 | # Environmental/socioeconomic covariates (region-specific base + noise) 62 | pop_density_base = np.random.uniform(200, 2000, N_REGIONS) 63 | sanitation_base = np.clip(np.random.normal(0.6, 0.12, N_REGIONS), 0, 1) 64 | vacc_coverage_base = np.clip(np.random.normal(0.55, 0.10, N_REGIONS), 0, 1) 65 | health_access_base = np.clip(np.random.normal(0.5, 0.1, N_REGIONS), 0, 1) 66 | 67 | pop_density = pop_density_base[regions] + np.random.normal(0, 40, N) 68 | sanitation = np.clip(sanitation_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 69 | vacc_coverage = np.clip(vacc_coverage_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 70 | health_access = np.clip(health_access_base[regions] + np.random.normal(0, 0.03, N), 0, 1) 71 | 72 | # Mobility & exposure 73 | mobility_idx = np.clip(0.5 + 0.3*np.sin(season+0.5) + np.random.normal(0, 0.1, N), 0, 1) 74 | stagnant_water_idx = np.clip(0.3 + 0.5*rainy_season_factor + np.random.normal(0, 0.08, N), 0, 1) 75 | 76 | # Baseline regional risk (latent) 77 | region_risk = np.random.normal(0, 0.4, N_REGIONS)[regions] 78 | 79 | # True latent incidence (per 100k) before noise (use a simple structural formula) 80 | # Higher with temp, humidity, rainfall, stagnant water, mobility, density; lower with vacc & sanitation. 81 | latent_incidence = ( 82 | 5 83 | + 0.6*(temp_c - 22) 84 | + 1.2*(humidity - 0.5) 85 | + 0.03*rain_mm 86 | + 4.0*stagnant_water_idx 87 | + 3.0*mobility_idx 88 | + 0.004*pop_density 89 | - 3.5*vacc_coverage 90 | - 2.0*sanitation 91 | - 1.0*health_access 92 | + 0.8*region_risk 93 | ) 94 | 95 | # Add stochasticity and enforce non-negativity 96 | incidence = np.clip(latent_incidence + np.random.normal(0, 1.5, N), 0, None) 97 | 98 | # Create a lag feature for autoregression (last week's incidence by region) 99 | df = pd.DataFrame({ 100 | "region": regions, 101 | "week": week, 102 | "temp_c": temp_c, 103 | "humidity": humidity, 104 | "rain_mm": rain_mm, 105 | "wind_ms": wind_ms, 106 | "ndvi": ndvi, 107 | "pop_density": pop_density, 108 | "sanitation": sanitation, 109 | "vacc_coverage": vacc_coverage, 110 | "health_access": health_access, 111 | "mobility_idx": mobility_idx, 112 | "stagnant_water_idx": stagnant_water_idx, 113 | "incidence": incidence 114 | }) 115 | 116 | df["incidence_lag1"] = df.groupby("region")["incidence"].shift(1) 117 | df["incidence_ma3"] = df.groupby("region")["incidence"].rolling(3).mean().reset_index(level=0, drop=True) 118 | df["rain_mm_ma3"] = df.groupby("region")["rain_mm"].rolling(3).mean().reset_index(level=0, drop=True) 119 | 120 | # Fill first-lag rows with reasonable defaults 121 | for col in ["incidence_lag1", "incidence_ma3", "rain_mm_ma3"]: 122 | df[col] = df[col].fillna(df[col].median()) 123 | 124 | # Target variables 125 | # Next-week incidence (regression) — shift by -1 week within region 126 | df["incidence_next"] = df.groupby("region")["incidence"].shift(-1) 127 | # For last week per region, fill with same-week value to keep dataset size stable 128 | last_mask = df["incidence_next"].isna() 129 | df.loc[last_mask, "incidence_next"] = df.loc[last_mask, "incidence"] 130 | 131 | # Outbreak classification: top 30% of incidence_next considered an outbreak 132 | thresh = np.percentile(df["incidence_next"], 70) 133 | df["outbreak"] = (df["incidence_next"] >= thresh).astype(int) 134 | 135 | # Save CSV 136 | csv_path = os.path.join(OUTDIR, "synthetic_climate_disease_dataset.csv") 137 | df.to_csv(csv_path, index=False) 138 | 139 | # ----------------------------- 140 | # 2) Train-test split (time-aware by region) 141 | # ----------------------------- 142 | # We'll split using the last 20% of weeks as test within each region to mimic forecasting. 143 | def time_split_mask(frame, holdout_frac=0.2): 144 | test_mask = np.zeros(len(frame), dtype=bool) 145 | for r in frame["region"].unique(): 146 | idx = frame.index[frame["region"] == r] 147 | n = len(idx) 148 | split_point = int((1 - holdout_frac) * n) 149 | test_idx = idx[split_point:] 150 | test_mask[test_idx] = True 151 | return test_mask 152 | 153 | test_mask = time_split_mask(df, holdout_frac=0.2) 154 | 155 | features = [ 156 | "temp_c","humidity","rain_mm","wind_ms","ndvi", 157 | "pop_density","sanitation","vacc_coverage","health_access", 158 | "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3" 159 | ] 160 | 161 | X = df[features].values 162 | y_reg = df["incidence_next"].values 163 | y_cls = df["outbreak"].values 164 | 165 | Xtr_r, Xte_r = X[~test_mask], X[test_mask] 166 | ytr_r, yte_r = y_reg[~test_mask], y_reg[test_mask] 167 | 168 | Xtr_c, Xte_c = X[~test_mask], X[test_mask] 169 | ytr_c, yte_c = y_cls[~test_mask], y_cls[test_mask] 170 | 171 | # ----------------------------- 172 | # 3) Regression model 173 | # ----------------------------- 174 | reg = RandomForestRegressor(n_estimators=500, random_state=7, n_jobs=-1) 175 | reg.fit(Xtr_r, ytr_r) 176 | yp_r = reg.predict(Xte_r) 177 | 178 | mae = mean_absolute_error(yte_r, yp_r) 179 | rmse = mean_squared_error(yte_r, yp_r, squared=False) 180 | r2 = r2_score(yte_r, yp_r) 181 | 182 | # Parity plot 183 | plt.figure() 184 | plt.scatter(yte_r, yp_r, alpha=0.5) 185 | mn = min(yte_r.min(), yp_r.min()); mx = max(yte_r.max(), yp_r.max()) 186 | plt.plot([mn, mx], [mn, mx]) 187 | plt.xlabel("True next-week incidence (per 100k)") 188 | plt.ylabel("Predicted next-week incidence (per 100k)") 189 | plt.title(f"Regression Parity | MAE={mae:.2f} RMSE={rmse:.2f} R2={r2:.2f}") 190 | plt.tight_layout() 191 | plt.savefig(os.path.join(OUTDIR, "regression_parity.png")); plt.close() 192 | 193 | # Feature importance (regression) 194 | imp_r = pd.Series(reg.feature_importances_, index=features).sort_values() 195 | plt.figure() 196 | plt.barh(imp_r.index, imp_r.values) 197 | plt.xlabel("Importance") 198 | plt.title("Regression Feature Importances") 199 | plt.tight_layout() 200 | plt.savefig(os.path.join(OUTDIR, "regression_feature_importance.png")); plt.close() 201 | 202 | # ----------------------------- 203 | # 4) Classification model 204 | # ----------------------------- 205 | clf = GradientBoostingClassifier(random_state=7) 206 | clf.fit(Xtr_c, ytr_c) 207 | proba = clf.predict_proba(Xte_c)[:, 1] 208 | yp_c = (proba >= 0.5).astype(int) 209 | 210 | acc = (yp_c == yte_c).mean() 211 | auc = roc_auc_score(yte_c, proba) 212 | 213 | # ROC curve 214 | fpr, tpr, thr = roc_curve(yte_c, proba) 215 | plt.figure() 216 | plt.plot(fpr, tpr); plt.plot([0,1],[0,1]) 217 | plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate") 218 | plt.title(f"ROC Curve (AUC={auc:.3f})") 219 | plt.tight_layout() 220 | plt.savefig(os.path.join(OUTDIR, "classification_roc.png")); plt.close() 221 | 222 | # Confusion Matrix 223 | cm = confusion_matrix(yte_c, yp_c) 224 | plt.figure() 225 | im = plt.imshow(cm, interpolation="nearest") 226 | plt.title("Confusion Matrix") 227 | plt.xlabel("Predicted"); plt.ylabel("True") 228 | plt.xticks([0,1], ["No Outbreak","Outbreak"]) 229 | plt.yticks([0,1], ["No Outbreak","Outbreak"]) 230 | for (i,j), v in np.ndenumerate(cm): 231 | plt.text(j, i, int(v), ha="center", va="center") 232 | plt.colorbar(im, fraction=0.046, pad=0.04) 233 | plt.tight_layout() 234 | plt.savefig(os.path.join(OUTDIR, "classification_confusion.png")); plt.close() 235 | 236 | # Classification report 237 | report = classification_report(yte_c, yp_c, target_names=["No Outbreak","Outbreak"]) 238 | with open(os.path.join(OUTDIR, "classification_report.txt"), "w") as f: 239 | f.write(report) 240 | 241 | # Surrogate RF for feature importance 242 | rf_sur = RandomForestClassifier(n_estimators=500, random_state=7, n_jobs=-1) 243 | rf_sur.fit(Xtr_c, ytr_c) 244 | imp_c = pd.Series(rf_sur.feature_importances_, index=features).sort_values() 245 | plt.figure() 246 | plt.barh(imp_c.index, imp_c.values) 247 | plt.xlabel("Importance") 248 | plt.title("Classification Feature Importances (RF surrogate)") 249 | plt.tight_layout() 250 | plt.savefig(os.path.join(OUTDIR, "classification_feature_importance.png")); plt.close() 251 | 252 | # ----------------------------- 253 | # 5) Time-series cross-validation (optional, regression R^2) 254 | # ----------------------------- 255 | tscv = TimeSeriesSplit(n_splits=5) 256 | cv_r2 = cross_val_score(reg, X, y_reg, cv=tscv, scoring="r2") 257 | with open(os.path.join(OUTDIR, "cv_results.txt"), "w") as f: 258 | f.write(f"Regression R2 (5-fold TimeSeriesSplit): mean={cv_r2.mean():.3f}, std={cv_r2.std():.3f}\n") 259 | f.write(f"Fold scores: {np.round(cv_r2, 3)}\n") 260 | 261 | # ----------------------------- 262 | # 6) Save artifacts 263 | # ----------------------------- 264 | dump(reg, os.path.join(OUTDIR, "incidence_regressor.joblib")) 265 | dump(clf, os.path.join(OUTDIR, "outbreak_classifier.joblib")) 266 | 267 | with open(os.path.join(OUTDIR, "README_Climate_Disease_Model.txt"), "w") as f: 268 | f.write( 269 | "Climate-Driven-Disease-Spread-Prediction-Model (Synthetic Demo)\n\n" 270 | "Files:\n" 271 | "- synthetic_climate_disease_dataset.csv : synthetic weekly panel dataset across regions\n" 272 | "- incidence_regressor.joblib : RandomForestRegressor for next-week incidence\n" 273 | "- outbreak_classifier.joblib : GradientBoostingClassifier for outbreak detection\n" 274 | "- regression_parity.png, regression_feature_importance.png\n" 275 | "- classification_roc.png, classification_confusion.png, classification_feature_importance.png\n" 276 | "- classification_report.txt, cv_results.txt\n\n" 277 | "Quick usage:\n" 278 | "from joblib import load\n" 279 | "import pandas as pd\n" 280 | "df = pd.read_csv('outputs/synthetic_climate_disease_dataset.csv')\n" 281 | "features = ['temp_c','humidity','rain_mm','wind_ms','ndvi','pop_density','sanitation',\n" 282 | " 'vacc_coverage','health_access','mobility_idx','stagnant_water_idx',\n" 283 | " 'incidence_lag1','incidence_ma3','rain_mm_ma3']\n" 284 | "X = df[features].values\n" 285 | "reg = load('outputs/incidence_regressor.joblib'); y_pred = reg.predict(X)\n" 286 | "clf = load('outputs/outbreak_classifier.joblib'); p_out = clf.predict_proba(X)[:,1]\n" 287 | ) 288 | 289 | # ----------------------------- 290 | # 7) Print summary to console 291 | # ----------------------------- 292 | print("=== Regression ===") 293 | print(f"MAE={mae:.3f} RMSE={rmse:.3f} R2={r2:.3f}") 294 | print("\n=== Classification ===") 295 | print(f"Accuracy={acc:.3f} ROC-AUC={auc:.3f}") 296 | print("\nClassification report:\n", report) 297 | print(f"\nTimeSeriesSplit R2: mean={cv_r2.mean():.3f} ± {cv_r2.std():.3f}") 298 | print(f"\nArtifacts saved to: {OUTDIR}/") 299 | --------------------------------------------------------------------------------