├── README.md
├── file
└── climate_disease_spread_model.py


/README.md:
--------------------------------------------------------------------------------
  1 | # 🌍 Climate-Driven-Disease-Spread-Prediction-Model
  2 | 
  3 | **Author:** [Otutu Anslem](https://github.com/Otutu11)  
  4 | **Repository:** [@Otutu11](https://github.com/Otutu11)
  5 | 
  6 | ---
  7 | 
  8 | ## 📌 Overview
  9 | 
 10 | This project presents a **synthetic end-to-end machine learning pipeline** that predicts **disease spread driven by climatic and environmental factors**.  
 11 | It combines **regression** to estimate next-week disease incidence and **classification** to detect outbreak risk, using multi-regional, weekly panel data.
 12 | 
 13 | ---
 14 | 
 15 | ## ⚙️ Features
 16 | 
 17 | - 📅 Generates **synthetic dataset** (>3,000 rows) with:
 18 |   - Climate: temperature, humidity, rainfall, wind, NDVI
 19 |   - Environmental: sanitation, health access, population density
 20 |   - Behavioural: mobility, stagnant water index
 21 |   - Health: weekly disease incidence and outbreak labels
 22 | - 🤖 Trains:
 23 |   - `RandomForestRegressor` → Predict next-week incidence (per 100k)
 24 |   - `GradientBoostingClassifier` → Detect outbreak risk (binary)
 25 | - 📊 Outputs:
 26 |   - Dataset (`.csv`)
 27 |   - Trained models (`.joblib`)
 28 |   - Performance metrics, plots, and reports
 29 | 
 30 | ---
 31 | 
 32 | ## 📁 Project Structure
 33 | 
 34 | .
 35 | ├── climate_disease_spread_model.py
 36 | ├── outputs/
 37 | │ ├── synthetic_climate_disease_dataset.csv
 38 | │ ├── incidence_regressor.joblib
 39 | │ ├── outbreak_classifier.joblib
 40 | │ ├── regression_parity.png
 41 | │ ├── regression_feature_importance.png
 42 | │ ├── classification_roc.png
 43 | │ ├── classification_confusion.png
 44 | │ ├── classification_feature_importance.png
 45 | │ ├── classification_report.txt
 46 | │ ├── cv_results.txt
 47 | │ └── README_Climate_Disease_Model.txt
 48 | └── README.md
 49 | 
 50 | yaml
 51 | Copy code
 52 | 
 53 | ---
 54 | 
 55 | ## 🚀 Getting Started
 56 | 
 57 | ### 1. Clone the repository
 58 | ```bash
 59 | git clone https://github.com/Otutu11/Climate-Driven-Disease-Spread-Prediction-Model.git
 60 | cd Climate-Driven-Disease-Spread-Prediction-Model
 61 | 2. Install dependencies
 62 | bash
 63 | Copy code
 64 | pip install numpy pandas matplotlib scikit-learn joblib
 65 | 3. Run the script
 66 | bash
 67 | Copy code
 68 | python climate_disease_spread_model.py
 69 | 📈 Usage Example
 70 | python
 71 | Copy code
 72 | from joblib import load
 73 | import pandas as pd
 74 | 
 75 | # Load dataset
 76 | df = pd.read_csv("outputs/synthetic_climate_disease_dataset.csv")
 77 | 
 78 | # Prepare features
 79 | features = [
 80 |     "temp_c","humidity","rain_mm","wind_ms","ndvi",
 81 |     "pop_density","sanitation","vacc_coverage","health_access",
 82 |     "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3"
 83 | ]
 84 | X = df[features].values
 85 | 
 86 | # Load models
 87 | reg = load("outputs/incidence_regressor.joblib")
 88 | clf = load("outputs/outbreak_classifier.joblib")
 89 | 
 90 | # Predictions
 91 | pred_incidence = reg.predict(X)
 92 | pred_outbreak_prob = clf.predict_proba(X)[:, 1]
 93 | 📌 Notes
 94 | This project uses synthetic data purely for demonstration.
 95 | 
 96 | Not intended for real epidemiological decision-making.
 97 | 
 98 | Extendable to real datasets by swapping the synthetic generator section with actual data ingestion.
 99 | 
100 | 📜 License
101 | This project is open-source under the MIT License.
102 | Feel free to use, modify, and distribute with attribution.
103 | 
104 | Maintained by: Otutu Anslem
105 | 
106 | 📧 Contributions and feedback are welcome!
107 | 


--------------------------------------------------------------------------------
/file:
--------------------------------------------------------------------------------
  1 | # file: climate_disease_spread_model.py
  2 | # Purpose: Synthetic end-to-end demo for "Climate-Driven-Disease-Spread-Prediction-Model"
  3 | # Author: You :)
  4 | # Run: python climate_disease_spread_model.py
  5 | #
  6 | # What it does:
  7 | # - Generates a synthetic dataset (>100 rows) linking climate, environment, mobility, and health covariates
  8 | #   to weekly disease incidence and outbreak occurrence.
  9 | # - Trains:
 10 | #     * Regression (RandomForestRegressor) to predict next-week incidence per 100k
 11 | #     * Classification (GradientBoostingClassifier) to flag outbreak risk (binary)
 12 | # - Saves: CSV dataset, trained models, plots, and text reports under outputs/
 13 | #
 14 | # Notes:
 15 | # - Purely synthetic; do not use for real epidemiological inference.
 16 | # - Minimal deps: numpy, pandas, matplotlib, scikit-learn, joblib
 17 | 
 18 | import os
 19 | import numpy as np
 20 | import pandas as pd
 21 | import matplotlib.pyplot as plt
 22 | 
 23 | from sklearn.model_selection import TimeSeriesSplit, cross_val_score
 24 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier, RandomForestClassifier
 25 | from sklearn.metrics import (
 26 |     mean_absolute_error, mean_squared_error, r2_score,
 27 |     roc_auc_score, roc_curve, confusion_matrix, classification_report
 28 | )
 29 | from joblib import dump
 30 | 
 31 | # -----------------------------
 32 | # 0) Setup
 33 | # -----------------------------
 34 | np.random.seed(7)
 35 | OUTDIR = "outputs"
 36 | os.makedirs(OUTDIR, exist_ok=True)
 37 | 
 38 | # -----------------------------
 39 | # 1) Generate synthetic time-series panel data
 40 | # -----------------------------
 41 | # Settings
 42 | N_REGIONS = 25
 43 | N_WEEKS_PER_REGION = 120  # total rows = 3000 (>100)
 44 | N = N_REGIONS * N_WEEKS_PER_REGION
 45 | 
 46 | regions = np.repeat(np.arange(N_REGIONS), N_WEEKS_PER_REGION)
 47 | week = np.tile(np.arange(N_WEEKS_PER_REGION), N_REGIONS)
 48 | 
 49 | # Seasonality helpers
 50 | # Weekly seasonality (e.g., vector-borne peaks in rainy season)
 51 | season = 2 * np.pi * week / 52.0
 52 | rainy_season_factor = 0.5 + 0.5 * (np.sin(season - 0.5) > 0).astype(float)
 53 | 
 54 | # Climate variables (synthetic but plausible ranges)
 55 | temp_c = 22 + 8*np.sin(season) + np.random.normal(0, 1.2, N)  # ~14..30 seasonal
 56 | humidity = np.clip(0.55 + 0.25*np.sin(season-0.8) + np.random.normal(0, 0.07, N), 0, 1)
 57 | rain_mm = np.clip(40*rainy_season_factor + np.random.gamma(2, 5, N), 0, 300)
 58 | wind_ms = np.clip(np.random.normal(3.5, 1.2, N), 0, 15)
 59 | ndvi = np.clip(0.35 + 0.15*np.sin(season-1.2) + np.random.normal(0, 0.05, N), 0, 1)
 60 | 
 61 | # Environmental/socioeconomic covariates (region-specific base + noise)
 62 | pop_density_base = np.random.uniform(200, 2000, N_REGIONS)
 63 | sanitation_base = np.clip(np.random.normal(0.6, 0.12, N_REGIONS), 0, 1)
 64 | vacc_coverage_base = np.clip(np.random.normal(0.55, 0.10, N_REGIONS), 0, 1)
 65 | health_access_base = np.clip(np.random.normal(0.5, 0.1, N_REGIONS), 0, 1)
 66 | 
 67 | pop_density = pop_density_base[regions] + np.random.normal(0, 40, N)
 68 | sanitation = np.clip(sanitation_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 69 | vacc_coverage = np.clip(vacc_coverage_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 70 | health_access = np.clip(health_access_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 71 | 
 72 | # Mobility & exposure
 73 | mobility_idx = np.clip(0.5 + 0.3*np.sin(season+0.5) + np.random.normal(0, 0.1, N), 0, 1)
 74 | stagnant_water_idx = np.clip(0.3 + 0.5*rainy_season_factor + np.random.normal(0, 0.08, N), 0, 1)
 75 | 
 76 | # Baseline regional risk (latent)
 77 | region_risk = np.random.normal(0, 0.4, N_REGIONS)[regions]
 78 | 
 79 | # True latent incidence (per 100k) before noise (use a simple structural formula)
 80 | # Higher with temp, humidity, rainfall, stagnant water, mobility, density; lower with vacc & sanitation.
 81 | latent_incidence = (
 82 |     5
 83 |     + 0.6*(temp_c - 22)
 84 |     + 1.2*(humidity - 0.5)
 85 |     + 0.03*rain_mm
 86 |     + 4.0*stagnant_water_idx
 87 |     + 3.0*mobility_idx
 88 |     + 0.004*pop_density
 89 |     - 3.5*vacc_coverage
 90 |     - 2.0*sanitation
 91 |     - 1.0*health_access
 92 |     + 0.8*region_risk
 93 | )
 94 | 
 95 | # Add stochasticity and enforce non-negativity
 96 | incidence = np.clip(latent_incidence + np.random.normal(0, 1.5, N), 0, None)
 97 | 
 98 | # Create a lag feature for autoregression (last week's incidence by region)
 99 | df = pd.DataFrame({
100 |     "region": regions,
101 |     "week": week,
102 |     "temp_c": temp_c,
103 |     "humidity": humidity,
104 |     "rain_mm": rain_mm,
105 |     "wind_ms": wind_ms,
106 |     "ndvi": ndvi,
107 |     "pop_density": pop_density,
108 |     "sanitation": sanitation,
109 |     "vacc_coverage": vacc_coverage,
110 |     "health_access": health_access,
111 |     "mobility_idx": mobility_idx,
112 |     "stagnant_water_idx": stagnant_water_idx,
113 |     "incidence": incidence
114 | })
115 | 
116 | df["incidence_lag1"] = df.groupby("region")["incidence"].shift(1)
117 | df["incidence_ma3"] = df.groupby("region")["incidence"].rolling(3).mean().reset_index(level=0, drop=True)
118 | df["rain_mm_ma3"] = df.groupby("region")["rain_mm"].rolling(3).mean().reset_index(level=0, drop=True)
119 | 
120 | # Fill first-lag rows with reasonable defaults
121 | for col in ["incidence_lag1", "incidence_ma3", "rain_mm_ma3"]:
122 |     df[col] = df[col].fillna(df[col].median())
123 | 
124 | # Target variables
125 | # Next-week incidence (regression) — shift by -1 week within region
126 | df["incidence_next"] = df.groupby("region")["incidence"].shift(-1)
127 | # For last week per region, fill with same-week value to keep dataset size stable
128 | last_mask = df["incidence_next"].isna()
129 | df.loc[last_mask, "incidence_next"] = df.loc[last_mask, "incidence"]
130 | 
131 | # Outbreak classification: top 30% of incidence_next considered an outbreak
132 | thresh = np.percentile(df["incidence_next"], 70)
133 | df["outbreak"] = (df["incidence_next"] >= thresh).astype(int)
134 | 
135 | # Save CSV
136 | csv_path = os.path.join(OUTDIR, "synthetic_climate_disease_dataset.csv")
137 | df.to_csv(csv_path, index=False)
138 | 
139 | # -----------------------------
140 | # 2) Train-test split (time-aware by region)
141 | # -----------------------------
142 | # We'll split using the last 20% of weeks as test within each region to mimic forecasting.
143 | def time_split_mask(frame, holdout_frac=0.2):
144 |     test_mask = np.zeros(len(frame), dtype=bool)
145 |     for r in frame["region"].unique():
146 |         idx = frame.index[frame["region"] == r]
147 |         n = len(idx)
148 |         split_point = int((1 - holdout_frac) * n)
149 |         test_idx = idx[split_point:]
150 |         test_mask[test_idx] = True
151 |     return test_mask
152 | 
153 | test_mask = time_split_mask(df, holdout_frac=0.2)
154 | 
155 | features = [
156 |     "temp_c","humidity","rain_mm","wind_ms","ndvi",
157 |     "pop_density","sanitation","vacc_coverage","health_access",
158 |     "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3"
159 | ]
160 | 
161 | X = df[features].values
162 | y_reg = df["incidence_next"].values
163 | y_cls = df["outbreak"].values
164 | 
165 | Xtr_r, Xte_r = X[~test_mask], X[test_mask]
166 | ytr_r, yte_r = y_reg[~test_mask], y_reg[test_mask]
167 | 
168 | Xtr_c, Xte_c = X[~test_mask], X[test_mask]
169 | ytr_c, yte_c = y_cls[~test_mask], y_cls[test_mask]
170 | 
171 | # -----------------------------
172 | # 3) Regression model
173 | # -----------------------------
174 | reg = RandomForestRegressor(n_estimators=500, random_state=7, n_jobs=-1)
175 | reg.fit(Xtr_r, ytr_r)
176 | yp_r = reg.predict(Xte_r)
177 | 
178 | mae  = mean_absolute_error(yte_r, yp_r)
179 | rmse = mean_squared_error(yte_r, yp_r, squared=False)
180 | r2   = r2_score(yte_r, yp_r)
181 | 
182 | # Parity plot
183 | plt.figure()
184 | plt.scatter(yte_r, yp_r, alpha=0.5)
185 | mn = min(yte_r.min(), yp_r.min()); mx = max(yte_r.max(), yp_r.max())
186 | plt.plot([mn, mx], [mn, mx])
187 | plt.xlabel("True next-week incidence (per 100k)")
188 | plt.ylabel("Predicted next-week incidence (per 100k)")
189 | plt.title(f"Regression Parity | MAE={mae:.2f} RMSE={rmse:.2f} R2={r2:.2f}")
190 | plt.tight_layout()
191 | plt.savefig(os.path.join(OUTDIR, "regression_parity.png")); plt.close()
192 | 
193 | # Feature importance (regression)
194 | imp_r = pd.Series(reg.feature_importances_, index=features).sort_values()
195 | plt.figure()
196 | plt.barh(imp_r.index, imp_r.values)
197 | plt.xlabel("Importance")
198 | plt.title("Regression Feature Importances")
199 | plt.tight_layout()
200 | plt.savefig(os.path.join(OUTDIR, "regression_feature_importance.png")); plt.close()
201 | 
202 | # -----------------------------
203 | # 4) Classification model
204 | # -----------------------------
205 | clf = GradientBoostingClassifier(random_state=7)
206 | clf.fit(Xtr_c, ytr_c)
207 | proba = clf.predict_proba(Xte_c)[:, 1]
208 | yp_c  = (proba >= 0.5).astype(int)
209 | 
210 | acc = (yp_c == yte_c).mean()
211 | auc = roc_auc_score(yte_c, proba)
212 | 
213 | # ROC curve
214 | fpr, tpr, thr = roc_curve(yte_c, proba)
215 | plt.figure()
216 | plt.plot(fpr, tpr); plt.plot([0,1],[0,1])
217 | plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
218 | plt.title(f"ROC Curve (AUC={auc:.3f})")
219 | plt.tight_layout()
220 | plt.savefig(os.path.join(OUTDIR, "classification_roc.png")); plt.close()
221 | 
222 | # Confusion Matrix
223 | cm = confusion_matrix(yte_c, yp_c)
224 | plt.figure()
225 | im = plt.imshow(cm, interpolation="nearest")
226 | plt.title("Confusion Matrix")
227 | plt.xlabel("Predicted"); plt.ylabel("True")
228 | plt.xticks([0,1], ["No Outbreak","Outbreak"])
229 | plt.yticks([0,1], ["No Outbreak","Outbreak"])
230 | for (i,j), v in np.ndenumerate(cm):
231 |     plt.text(j, i, int(v), ha="center", va="center")
232 | plt.colorbar(im, fraction=0.046, pad=0.04)
233 | plt.tight_layout()
234 | plt.savefig(os.path.join(OUTDIR, "classification_confusion.png")); plt.close()
235 | 
236 | # Classification report
237 | report = classification_report(yte_c, yp_c, target_names=["No Outbreak","Outbreak"])
238 | with open(os.path.join(OUTDIR, "classification_report.txt"), "w") as f:
239 |     f.write(report)
240 | 
241 | # Surrogate RF for feature importance
242 | rf_sur = RandomForestClassifier(n_estimators=500, random_state=7, n_jobs=-1)
243 | rf_sur.fit(Xtr_c, ytr_c)
244 | imp_c = pd.Series(rf_sur.feature_importances_, index=features).sort_values()
245 | plt.figure()
246 | plt.barh(imp_c.index, imp_c.values)
247 | plt.xlabel("Importance")
248 | plt.title("Classification Feature Importances (RF surrogate)")
249 | plt.tight_layout()
250 | plt.savefig(os.path.join(OUTDIR, "classification_feature_importance.png")); plt.close()
251 | 
252 | # -----------------------------
253 | # 5) Time-series cross-validation (optional, regression R^2)
254 | # -----------------------------
255 | tscv = TimeSeriesSplit(n_splits=5)
256 | cv_r2 = cross_val_score(reg, X, y_reg, cv=tscv, scoring="r2")
257 | with open(os.path.join(OUTDIR, "cv_results.txt"), "w") as f:
258 |     f.write(f"Regression R2 (5-fold TimeSeriesSplit): mean={cv_r2.mean():.3f}, std={cv_r2.std():.3f}\n")
259 |     f.write(f"Fold scores: {np.round(cv_r2, 3)}\n")
260 | 
261 | # -----------------------------
262 | # 6) Save artifacts
263 | # -----------------------------
264 | dump(reg, os.path.join(OUTDIR, "incidence_regressor.joblib"))
265 | dump(clf, os.path.join(OUTDIR, "outbreak_classifier.joblib"))
266 | 
267 | with open(os.path.join(OUTDIR, "README_Climate_Disease_Model.txt"), "w") as f:
268 |     f.write(
269 |         "Climate-Driven-Disease-Spread-Prediction-Model (Synthetic Demo)\n\n"
270 |         "Files:\n"
271 |         "- synthetic_climate_disease_dataset.csv : synthetic weekly panel dataset across regions\n"
272 |         "- incidence_regressor.joblib : RandomForestRegressor for next-week incidence\n"
273 |         "- outbreak_classifier.joblib : GradientBoostingClassifier for outbreak detection\n"
274 |         "- regression_parity.png, regression_feature_importance.png\n"
275 |         "- classification_roc.png, classification_confusion.png, classification_feature_importance.png\n"
276 |         "- classification_report.txt, cv_results.txt\n\n"
277 |         "Quick usage:\n"
278 |         "from joblib import load\n"
279 |         "import pandas as pd\n"
280 |         "df = pd.read_csv('outputs/synthetic_climate_disease_dataset.csv')\n"
281 |         "features = ['temp_c','humidity','rain_mm','wind_ms','ndvi','pop_density','sanitation',\n"
282 |         "            'vacc_coverage','health_access','mobility_idx','stagnant_water_idx',\n"
283 |         "            'incidence_lag1','incidence_ma3','rain_mm_ma3']\n"
284 |         "X = df[features].values\n"
285 |         "reg = load('outputs/incidence_regressor.joblib'); y_pred = reg.predict(X)\n"
286 |         "clf = load('outputs/outbreak_classifier.joblib'); p_out = clf.predict_proba(X)[:,1]\n"
287 |     )
288 | 
289 | # -----------------------------
290 | # 7) Print summary to console
291 | # -----------------------------
292 | print("=== Regression ===")
293 | print(f"MAE={mae:.3f}  RMSE={rmse:.3f}  R2={r2:.3f}")
294 | print("\n=== Classification ===")
295 | print(f"Accuracy={acc:.3f}  ROC-AUC={auc:.3f}")
296 | print("\nClassification report:\n", report)
297 | print(f"\nTimeSeriesSplit R2: mean={cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
298 | print(f"\nArtifacts saved to: {OUTDIR}/")
299 | 
300 | 
301 | 


--------------------------------------------------------------------------------
/climate_disease_spread_model.py:
--------------------------------------------------------------------------------
  1 | # file: climate_disease_spread_model.py
  2 | # Purpose: Synthetic end-to-end demo for "Climate-Driven-Disease-Spread-Prediction-Model"
  3 | # Author: You :)
  4 | # Run: python climate_disease_spread_model.py
  5 | #
  6 | # What it does:
  7 | # - Generates a synthetic dataset (>100 rows) linking climate, environment, mobility, and health covariates
  8 | #   to weekly disease incidence and outbreak occurrence.
  9 | # - Trains:
 10 | #     * Regression (RandomForestRegressor) to predict next-week incidence per 100k
 11 | #     * Classification (GradientBoostingClassifier) to flag outbreak risk (binary)
 12 | # - Saves: CSV dataset, trained models, plots, and text reports under outputs/
 13 | #
 14 | # Notes:
 15 | # - Purely synthetic; do not use for real epidemiological inference.
 16 | # - Minimal deps: numpy, pandas, matplotlib, scikit-learn, joblib
 17 | 
 18 | import os
 19 | import numpy as np
 20 | import pandas as pd
 21 | import matplotlib.pyplot as plt
 22 | 
 23 | from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
 24 | from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier, RandomForestClassifier
 25 | from sklearn.metrics import (
 26 |     mean_absolute_error, mean_squared_error, r2_score,
 27 |     roc_auc_score, roc_curve, confusion_matrix, classification_report
 28 | )
 29 | from joblib import dump
 30 | 
 31 | # -----------------------------
 32 | # 0) Setup
 33 | # -----------------------------
 34 | np.random.seed(7)
 35 | OUTDIR = "outputs"
 36 | os.makedirs(OUTDIR, exist_ok=True)
 37 | 
 38 | # -----------------------------
 39 | # 1) Generate synthetic time-series panel data
 40 | # -----------------------------
 41 | # Settings
 42 | N_REGIONS = 25
 43 | N_WEEKS_PER_REGION = 120  # total rows = 3000 (>100)
 44 | N = N_REGIONS * N_WEEKS_PER_REGION
 45 | 
 46 | regions = np.repeat(np.arange(N_REGIONS), N_WEEKS_PER_REGION)
 47 | week = np.tile(np.arange(N_WEEKS_PER_REGION), N_REGIONS)
 48 | 
 49 | # Seasonality helpers
 50 | # Weekly seasonality (e.g., vector-borne peaks in rainy season)
 51 | season = 2 * np.pi * week / 52.0
 52 | rainy_season_factor = 0.5 + 0.5 * (np.sin(season - 0.5) > 0).astype(float)
 53 | 
 54 | # Climate variables (synthetic but plausible ranges)
 55 | temp_c = 22 + 8*np.sin(season) + np.random.normal(0, 1.2, N)  # 14..30 seasonal
 56 | humidity = np.clip(0.55 + 0.25*np.sin(season-0.8) + np.random.normal(0, 0.07, N), 0, 1)
 57 | rain_mm = np.clip(40*rainy_season_factor + np.random.gamma(2, 5, N), 0, 300)
 58 | wind_ms = np.clip(np.random.normal(3.5, 1.2, N), 0, 15)
 59 | ndvi = np.clip(0.35 + 0.15*np.sin(season-1.2) + np.random.normal(0, 0.05, N), 0, 1)
 60 | 
 61 | # Environmental/socioeconomic covariates (region-specific base + noise)
 62 | pop_density_base = np.random.uniform(200, 2000, N_REGIONS)
 63 | sanitation_base = np.clip(np.random.normal(0.6, 0.12, N_REGIONS), 0, 1)
 64 | vacc_coverage_base = np.clip(np.random.normal(0.55, 0.10, N_REGIONS), 0, 1)
 65 | health_access_base = np.clip(np.random.normal(0.5, 0.1, N_REGIONS), 0, 1)
 66 | 
 67 | pop_density = pop_density_base[regions] + np.random.normal(0, 40, N)
 68 | sanitation = np.clip(sanitation_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 69 | vacc_coverage = np.clip(vacc_coverage_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 70 | health_access = np.clip(health_access_base[regions] + np.random.normal(0, 0.03, N), 0, 1)
 71 | 
 72 | # Mobility & exposure
 73 | mobility_idx = np.clip(0.5 + 0.3*np.sin(season+0.5) + np.random.normal(0, 0.1, N), 0, 1)
 74 | stagnant_water_idx = np.clip(0.3 + 0.5*rainy_season_factor + np.random.normal(0, 0.08, N), 0, 1)
 75 | 
 76 | # Baseline regional risk (latent)
 77 | region_risk = np.random.normal(0, 0.4, N_REGIONS)[regions]
 78 | 
 79 | # True latent incidence (per 100k) before noise (use a simple structural formula)
 80 | # Higher with temp, humidity, rainfall, stagnant water, mobility, density; lower with vacc & sanitation.
 81 | latent_incidence = (
 82 |     5
 83 |     + 0.6*(temp_c - 22)
 84 |     + 1.2*(humidity - 0.5)
 85 |     + 0.03*rain_mm
 86 |     + 4.0*stagnant_water_idx
 87 |     + 3.0*mobility_idx
 88 |     + 0.004*pop_density
 89 |     - 3.5*vacc_coverage
 90 |     - 2.0*sanitation
 91 |     - 1.0*health_access
 92 |     + 0.8*region_risk
 93 | )
 94 | 
 95 | # Add stochasticity and enforce non-negativity
 96 | incidence = np.clip(latent_incidence + np.random.normal(0, 1.5, N), 0, None)
 97 | 
 98 | # Create a lag feature for autoregression (last week's incidence by region)
 99 | df = pd.DataFrame({
100 |     "region": regions,
101 |     "week": week,
102 |     "temp_c": temp_c,
103 |     "humidity": humidity,
104 |     "rain_mm": rain_mm,
105 |     "wind_ms": wind_ms,
106 |     "ndvi": ndvi,
107 |     "pop_density": pop_density,
108 |     "sanitation": sanitation,
109 |     "vacc_coverage": vacc_coverage,
110 |     "health_access": health_access,
111 |     "mobility_idx": mobility_idx,
112 |     "stagnant_water_idx": stagnant_water_idx,
113 |     "incidence": incidence
114 | })
115 | 
116 | df["incidence_lag1"] = df.groupby("region")["incidence"].shift(1)
117 | df["incidence_ma3"] = df.groupby("region")["incidence"].rolling(3).mean().reset_index(level=0, drop=True)
118 | df["rain_mm_ma3"] = df.groupby("region")["rain_mm"].rolling(3).mean().reset_index(level=0, drop=True)
119 | 
120 | # Fill first-lag rows with reasonable defaults
121 | for col in ["incidence_lag1", "incidence_ma3", "rain_mm_ma3"]:
122 |     df[col] = df[col].fillna(df[col].median())
123 | 
124 | # Target variables
125 | # Next-week incidence (regression) — shift by -1 week within region
126 | df["incidence_next"] = df.groupby("region")["incidence"].shift(-1)
127 | # For last week per region, fill with same-week value to keep dataset size stable
128 | last_mask = df["incidence_next"].isna()
129 | df.loc[last_mask, "incidence_next"] = df.loc[last_mask, "incidence"]
130 | 
131 | # Outbreak classification: top 30% of incidence_next considered an outbreak
132 | thresh = np.percentile(df["incidence_next"], 70)
133 | df["outbreak"] = (df["incidence_next"] >= thresh).astype(int)
134 | 
135 | # Save CSV
136 | csv_path = os.path.join(OUTDIR, "synthetic_climate_disease_dataset.csv")
137 | df.to_csv(csv_path, index=False)
138 | 
139 | # -----------------------------
140 | # 2) Train-test split (time-aware by region)
141 | # -----------------------------
142 | # We'll split using the last 20% of weeks as test within each region to mimic forecasting.
143 | def time_split_mask(frame, holdout_frac=0.2):
144 |     test_mask = np.zeros(len(frame), dtype=bool)
145 |     for r in frame["region"].unique():
146 |         idx = frame.index[frame["region"] == r]
147 |         n = len(idx)
148 |         split_point = int((1 - holdout_frac) * n)
149 |         test_idx = idx[split_point:]
150 |         test_mask[test_idx] = True
151 |     return test_mask
152 | 
153 | test_mask = time_split_mask(df, holdout_frac=0.2)
154 | 
155 | features = [
156 |     "temp_c","humidity","rain_mm","wind_ms","ndvi",
157 |     "pop_density","sanitation","vacc_coverage","health_access",
158 |     "mobility_idx","stagnant_water_idx","incidence_lag1","incidence_ma3","rain_mm_ma3"
159 | ]
160 | 
161 | X = df[features].values
162 | y_reg = df["incidence_next"].values
163 | y_cls = df["outbreak"].values
164 | 
165 | Xtr_r, Xte_r = X[~test_mask], X[test_mask]
166 | ytr_r, yte_r = y_reg[~test_mask], y_reg[test_mask]
167 | 
168 | Xtr_c, Xte_c = X[~test_mask], X[test_mask]
169 | ytr_c, yte_c = y_cls[~test_mask], y_cls[test_mask]
170 | 
171 | # -----------------------------
172 | # 3) Regression model
173 | # -----------------------------
174 | reg = RandomForestRegressor(n_estimators=500, random_state=7, n_jobs=-1)
175 | reg.fit(Xtr_r, ytr_r)
176 | yp_r = reg.predict(Xte_r)
177 | 
178 | mae  = mean_absolute_error(yte_r, yp_r)
179 | rmse = mean_squared_error(yte_r, yp_r, squared=False)
180 | r2   = r2_score(yte_r, yp_r)
181 | 
182 | # Parity plot
183 | plt.figure()
184 | plt.scatter(yte_r, yp_r, alpha=0.5)
185 | mn = min(yte_r.min(), yp_r.min()); mx = max(yte_r.max(), yp_r.max())
186 | plt.plot([mn, mx], [mn, mx])
187 | plt.xlabel("True next-week incidence (per 100k)")
188 | plt.ylabel("Predicted next-week incidence (per 100k)")
189 | plt.title(f"Regression Parity | MAE={mae:.2f} RMSE={rmse:.2f} R2={r2:.2f}")
190 | plt.tight_layout()
191 | plt.savefig(os.path.join(OUTDIR, "regression_parity.png")); plt.close()
192 | 
193 | # Feature importance (regression)
194 | imp_r = pd.Series(reg.feature_importances_, index=features).sort_values()
195 | plt.figure()
196 | plt.barh(imp_r.index, imp_r.values)
197 | plt.xlabel("Importance")
198 | plt.title("Regression Feature Importances")
199 | plt.tight_layout()
200 | plt.savefig(os.path.join(OUTDIR, "regression_feature_importance.png")); plt.close()
201 | 
202 | # -----------------------------
203 | # 4) Classification model
204 | # -----------------------------
205 | clf = GradientBoostingClassifier(random_state=7)
206 | clf.fit(Xtr_c, ytr_c)
207 | proba = clf.predict_proba(Xte_c)[:, 1]
208 | yp_c  = (proba >= 0.5).astype(int)
209 | 
210 | acc = (yp_c == yte_c).mean()
211 | auc = roc_auc_score(yte_c, proba)
212 | 
213 | # ROC curve
214 | fpr, tpr, thr = roc_curve(yte_c, proba)
215 | plt.figure()
216 | plt.plot(fpr, tpr); plt.plot([0,1],[0,1])
217 | plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
218 | plt.title(f"ROC Curve (AUC={auc:.3f})")
219 | plt.tight_layout()
220 | plt.savefig(os.path.join(OUTDIR, "classification_roc.png")); plt.close()
221 | 
222 | # Confusion Matrix
223 | cm = confusion_matrix(yte_c, yp_c)
224 | plt.figure()
225 | im = plt.imshow(cm, interpolation="nearest")
226 | plt.title("Confusion Matrix")
227 | plt.xlabel("Predicted"); plt.ylabel("True")
228 | plt.xticks([0,1], ["No Outbreak","Outbreak"])
229 | plt.yticks([0,1], ["No Outbreak","Outbreak"])
230 | for (i,j), v in np.ndenumerate(cm):
231 |     plt.text(j, i, int(v), ha="center", va="center")
232 | plt.colorbar(im, fraction=0.046, pad=0.04)
233 | plt.tight_layout()
234 | plt.savefig(os.path.join(OUTDIR, "classification_confusion.png")); plt.close()
235 | 
236 | # Classification report
237 | report = classification_report(yte_c, yp_c, target_names=["No Outbreak","Outbreak"])
238 | with open(os.path.join(OUTDIR, "classification_report.txt"), "w") as f:
239 |     f.write(report)
240 | 
241 | # Surrogate RF for feature importance
242 | rf_sur = RandomForestClassifier(n_estimators=500, random_state=7, n_jobs=-1)
243 | rf_sur.fit(Xtr_c, ytr_c)
244 | imp_c = pd.Series(rf_sur.feature_importances_, index=features).sort_values()
245 | plt.figure()
246 | plt.barh(imp_c.index, imp_c.values)
247 | plt.xlabel("Importance")
248 | plt.title("Classification Feature Importances (RF surrogate)")
249 | plt.tight_layout()
250 | plt.savefig(os.path.join(OUTDIR, "classification_feature_importance.png")); plt.close()
251 | 
252 | # -----------------------------
253 | # 5) Time-series cross-validation (optional, regression R^2)
254 | # -----------------------------
255 | tscv = TimeSeriesSplit(n_splits=5)
256 | cv_r2 = cross_val_score(reg, X, y_reg, cv=tscv, scoring="r2")
257 | with open(os.path.join(OUTDIR, "cv_results.txt"), "w") as f:
258 |     f.write(f"Regression R2 (5-fold TimeSeriesSplit): mean={cv_r2.mean():.3f}, std={cv_r2.std():.3f}\n")
259 |     f.write(f"Fold scores: {np.round(cv_r2, 3)}\n")
260 | 
261 | # -----------------------------
262 | # 6) Save artifacts
263 | # -----------------------------
264 | dump(reg, os.path.join(OUTDIR, "incidence_regressor.joblib"))
265 | dump(clf, os.path.join(OUTDIR, "outbreak_classifier.joblib"))
266 | 
267 | with open(os.path.join(OUTDIR, "README_Climate_Disease_Model.txt"), "w") as f:
268 |     f.write(
269 |         "Climate-Driven-Disease-Spread-Prediction-Model (Synthetic Demo)\n\n"
270 |         "Files:\n"
271 |         "- synthetic_climate_disease_dataset.csv : synthetic weekly panel dataset across regions\n"
272 |         "- incidence_regressor.joblib : RandomForestRegressor for next-week incidence\n"
273 |         "- outbreak_classifier.joblib : GradientBoostingClassifier for outbreak detection\n"
274 |         "- regression_parity.png, regression_feature_importance.png\n"
275 |         "- classification_roc.png, classification_confusion.png, classification_feature_importance.png\n"
276 |         "- classification_report.txt, cv_results.txt\n\n"
277 |         "Quick usage:\n"
278 |         "from joblib import load\n"
279 |         "import pandas as pd\n"
280 |         "df = pd.read_csv('outputs/synthetic_climate_disease_dataset.csv')\n"
281 |         "features = ['temp_c','humidity','rain_mm','wind_ms','ndvi','pop_density','sanitation',\n"
282 |         "            'vacc_coverage','health_access','mobility_idx','stagnant_water_idx',\n"
283 |         "            'incidence_lag1','incidence_ma3','rain_mm_ma3']\n"
284 |         "X = df[features].values\n"
285 |         "reg = load('outputs/incidence_regressor.joblib'); y_pred = reg.predict(X)\n"
286 |         "clf = load('outputs/outbreak_classifier.joblib'); p_out = clf.predict_proba(X)[:,1]\n"
287 |     )
288 | 
289 | # -----------------------------
290 | # 7) Print summary to console
291 | # -----------------------------
292 | print("=== Regression ===")
293 | print(f"MAE={mae:.3f}  RMSE={rmse:.3f}  R2={r2:.3f}")
294 | print("\n=== Classification ===")
295 | print(f"Accuracy={acc:.3f}  ROC-AUC={auc:.3f}")
296 | print("\nClassification report:\n", report)
297 | print(f"\nTimeSeriesSplit R2: mean={cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
298 | print(f"\nArtifacts saved to: {OUTDIR}/")
299 | 


--------------------------------------------------------------------------------