├── synthetic_species_distribution_dataset.xlsx ├── README.md └── file /synthetic_species_distribution_dataset.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Nelvinebi/Species-Distribution-Modeling-with-Environmental-Variables/HEAD/synthetic_species_distribution_dataset.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Species Distribution Modeling with Environmental Variables (Synthetic Data) 2 | 3 | This project demonstrates Species Distribution Modeling (SDM) using synthetic environmental data. It simulates environmental variables such as elevation, slope, temperature, precipitation, NDVI, distance to water, and human footprint, then generates species occurrence probabilities based on ecological response curves. Presence/absence data is sampled and can be used to train and evaluate machine learning models. 4 | 5 | Features 6 | 7 | Generates synthetic environmental variables on a grid (e.g., 20×20, 40×40). 8 | 9 | Simulates species occurrence probabilities from ecological response curves. 10 | 11 | Produces presence/absence labels based on habitat suitability. 12 | 13 | Saves datasets in Excel (.xlsx) and CSV (.csv) formats. 14 | 15 | Compatible with ML workflows (e.g., Logistic Regression, Random Forest). 16 | 17 | Includes visualization of occurrence probability maps and presence points. 18 | 19 | Installation 20 | 21 | Clone the repository and install dependencies: 22 | 23 | git clone https://github.com/yourusername/Species-Distribution-Modeling-with-Environmental-Variables.git 24 | cd Species-Distribution-Modeling-with-Environmental-Variables 25 | pip install -r requirements.txt 26 | 27 | 28 | Dependencies: 29 | 30 | Python 3.8+ 31 | 32 | NumPy 33 | 34 | Pandas 35 | 36 | Scikit-learn 37 | 38 | Matplotlib 39 | 40 | OpenPyXL (for Excel export) 41 | 42 | Usage 43 | 44 | Run the script to generate synthetic SDM data: 45 | 46 | python sdm_synthetic.py --rows 40 --cols 40 --seed 42 --out outputs 47 | 48 | 49 | Arguments: 50 | 51 | --rows: Grid rows (default: 40) 52 | 53 | --cols: Grid columns (default: 40) 54 | 55 | --seed: Random seed for reproducibility (default: 42) 56 | 57 | --out: Output directory (default: outputs) 58 | 59 | Example Output 60 | 61 | outputs/sdm_synthetic_dataset.csv 62 | 63 | outputs/sdm_synthetic_dataset.xlsx 64 | 65 | outputs/map_true_prob.png (True probability map) 66 | 67 | outputs/presence_overlay.png (Presence points overlay) 68 | 69 | Dataset Columns 70 | 71 | x, y: Normalized spatial coordinates 72 | 73 | elevation, slope: Topographic variables 74 | 75 | temperature, precipitation: Climatic variables 76 | 77 | ndvi: Vegetation index (greenness) 78 | 79 | dist_to_water: Distance to river feature 80 | 81 | human_footprint: Anthropogenic pressure 82 | 83 | occurrence_prob_true: True simulated suitability 84 | 85 | presence: 1 = species present, 0 = absent 86 | 87 | Applications 88 | 89 | Teaching and demonstrating SDM workflows without real ecological data. 90 | 91 | Benchmarking ML algorithms for ecological niche modeling. 92 | 93 | Exploring ecological response functions and habitat suitability patterns. 94 | 95 | Author 96 | 97 | Developed by [Name: Agbozu Ebingiye Nelvin 98 | Github: https://github.com/Nelvinebi 99 | ] 100 | 101 | License 102 | 103 | This project is released under the MIT License. See LICENSE for details. 104 | -------------------------------------------------------------------------------- /file: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Species-Distribution-Modeling-with-Environmental-Variables (Synthetic) 4 | ---------------------------------------------------------------------- 5 | 6 | Generates synthetic environmental predictors on a grid, simulates species 7 | occurrence probabilities from ecological response curves, samples 8 | presence/absence, trains ML models (Logistic Regression & Random Forest), 9 | evaluates performance, and exports artifacts. 10 | 11 | Requirements: 12 | - Python 3.8+ 13 | - numpy, pandas, scikit-learn, matplotlib 14 | 15 | Usage: 16 | python sdm_synthetic.py --rows 40 --cols 40 --seed 42 --out outputs 17 | """ 18 | 19 | import os 20 | import math 21 | import argparse 22 | import numpy as np 23 | import pandas as pd 24 | import matplotlib.pyplot as plt 25 | 26 | from sklearn.model_selection import train_test_split 27 | from sklearn.metrics import ( 28 | roc_auc_score, roc_curve, average_precision_score, precision_recall_curve, 29 | confusion_matrix, accuracy_score, f1_score 30 | ) 31 | from sklearn.linear_model import LogisticRegression 32 | from sklearn.ensemble import RandomForestClassifier 33 | from sklearn.inspection import permutation_importance, PartialDependenceDisplay 34 | 35 | RNG = np.random.default_rng(42) 36 | 37 | 38 | def sigmoid(z): 39 | return 1.0 / (1.0 + np.exp(-z)) 40 | 41 | 42 | def generate_environment(rows=40, cols=40, seed=42): 43 | """ 44 | Create a synthetic landscape with environmental variables on a regular grid. 45 | Returns a DataFrame with one row per cell and a metadata dict. 46 | """ 47 | rng = np.random.default_rng(seed) 48 | 49 | # Grid (normalized coords) 50 | x = np.linspace(0, 1, cols) 51 | y = np.linspace(0, 1, rows) 52 | xx, yy = np.meshgrid(x, y) 53 | 54 | # Topography: rolling hills 55 | elev = 0.4 * np.sin(2*np.pi*xx) * np.cos(2*np.pi*yy) + 0.6*yy 56 | elev = (elev - elev.min()) / (elev.max() - elev.min()) 57 | 58 | # Slope proxy from gradients 59 | gy, gx = np.gradient(elev) 60 | slope = np.sqrt(gx**2 + gy**2) 61 | slope = (slope - slope.min()) / (slope.max() - slope.min() + 1e-9) 62 | 63 | # Temperature (°C): warmer at lower elevation + latitudinal gradient 64 | temp = 30 - 8*elev - 3*(yy - 0.5) + 0.6*rng.normal(size=xx.shape) 65 | 66 | # Precipitation (mm): wetter in the "west" and at mid-elevations 67 | precip = 900 + 400*(1-xx) + 150*np.exp(-((elev-0.5)/0.25)**2) + 30*rng.normal(size=xx.shape) 68 | 69 | # NDVI-like greenness: favor moderate temp & higher precip 70 | ndvi = ( 71 | 0.7*np.exp(-0.5*((temp - 26)/4.5)**2) + 72 | 0.3*(precip - precip.min())/(precip.max()-precip.min() + 1e-9) 73 | ) 74 | ndvi += 0.05*rng.normal(size=ndvi.shape) 75 | ndvi = np.clip(ndvi, 0, 1) 76 | 77 | # Distance to a meandering river (simple vertical distance to curve) 78 | river = 0.55 + 0.12*np.sin(4*np.pi*xx) 79 | dist_to_water = np.abs(yy - river) # in normalized units (0..~0.7) 80 | 81 | # Human footprint (urban cores) via two RBF centers 82 | def rbf(cx, cy, scale): 83 | return np.exp(-(((xx - cx)**2 + (yy - cy)**2) / (2 * scale**2))) 84 | human_fp = 0.9*rbf(0.25, 0.35, 0.12) + 0.8*rbf(0.75, 0.70, 0.10) 85 | human_fp += 0.05*rng.normal(size=human_fp.shape) 86 | human_fp = (human_fp - human_fp.min())/(human_fp.max()-human_fp.min() + 1e-9) 87 | 88 | # Assemble DataFrame 89 | df = pd.DataFrame({ 90 | "x": xx.ravel(), 91 | "y": yy.ravel(), 92 | "elevation": elev.ravel(), 93 | "slope": slope.ravel(), 94 | "temperature": temp.ravel(), 95 | "precipitation": precip.ravel(), 96 | "ndvi": ndvi.ravel(), 97 | "dist_to_water": dist_to_water.ravel(), 98 | "human_footprint": human_fp.ravel(), 99 | }) 100 | meta = {"rows": rows, "cols": cols} 101 | return df, meta 102 | 103 | 104 | def simulate_occurrence(df, seed=42, target_prevalence=0.35): 105 | """ 106 | Convert environmental variables to occurrence probability using 107 | simple ecological response curves, then sample presence/absence. 108 | """ 109 | rng = np.random.default_rng(seed) 110 | 111 | # Preferences (Gaussian optima) 112 | temp_opt, temp_sd = 26.0, 4.0 113 | prec_opt, prec_sd = 1100.0, 250.0 114 | elev_opt, elev_sd = 0.45, 0.25 115 | 116 | suit_temp = np.exp(-0.5*((df["temperature"] - temp_opt)/temp_sd)**2) 117 | suit_prec = np.exp(-0.5*((df["precipitation"] - prec_opt)/prec_sd)**2) 118 | suit_elev = np.exp(-0.5*((df["elevation"] - elev_opt)/elev_sd)**2) 119 | suit_ndvi = df["ndvi"] # already in 0..1 120 | suit_water = np.exp(-df["dist_to_water"]/0.08) # decays with distance 121 | suit_human = 1 - df["human_footprint"] # prefers low human footprint 122 | suit_slope = np.exp(-df["slope"]/0.25) # gentle slopes preferred 123 | 124 | # Combine (weighted linear predictor) 125 | z = ( 126 | 1.6*suit_temp + 127 | 1.4*suit_prec + 128 | 1.2*suit_elev + 129 | 1.8*suit_ndvi + 130 | 1.0*suit_water + 131 | 1.3*suit_human + 132 | 0.6*suit_slope + 133 | 0.2*rng.normal(size=len(df)) 134 | ) 135 | 136 | # Shift to hit target prevalence approximately 137 | mu = z.mean() 138 | def logit(p): return np.log(p/(1-p)) 139 | bias = logit(target_prevalence) - mu 140 | p_occ = sigmoid(z + bias) 141 | 142 | y = rng.uniform(size=len(df)) < p_occ 143 | df = df.copy() 144 | df["occurrence_prob_true"] = p_occ 145 | df["presence"] = y.astype(int) 146 | return df 147 | 148 | 149 | def evaluate_and_plot(models, X_train, X_test, y_train, y_test, feature_names, out_dir): 150 | """Evaluate models, save ROC/PR plots, feature importance and PDPs for RF.""" 151 | os.makedirs(out_dir, exist_ok=True) 152 | 153 | # ROC & PR curves 154 | plt.figure(figsize=(6,5)) 155 | for name, mdl in models.items(): 156 | s = mdl.predict_proba(X_test)[:,1] 157 | fpr, tpr, _ = roc_curve(y_test, s) 158 | auc = roc_auc_score(y_test, s) 159 | plt.plot(fpr, tpr, label=f"{name} (AUC={auc:.3f})") 160 | plt.plot([0,1],[0,1], linestyle="--") 161 | plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate") 162 | plt.title("ROC Curves") 163 | plt.legend(loc="lower right") 164 | plt.tight_layout(); plt.savefig(os.path.join(out_dir, "roc_curves.png"), dpi=200); plt.close() 165 | 166 | plt.figure(figsize=(6,5)) 167 | for name, mdl in models.items(): 168 | s = mdl.predict_proba(X_test)[:,1] 169 | prec, rec, _ = precision_recall_curve(y_test, s) 170 | ap = average_precision_score(y_test, s) 171 | plt.plot(rec, prec, label=f"{name} (AP={ap:.3f})") 172 | plt.xlabel("Recall"); plt.ylabel("Precision") 173 | plt.title("Precision-Recall Curves") 174 | plt.legend(loc="lower left") 175 | plt.tight_layout(); plt.savefig(os.path.join(out_dir, "pr_curves.png"), dpi=200); plt.close() 176 | 177 | # Metrics summary 178 | lines = [] 179 | for name, mdl in models.items(): 180 | s = mdl.predict_proba(X_test)[:,1] 181 | y_pred = (s >= 0.5).astype(int) 182 | auc = roc_auc_score(y_test, s) 183 | ap = average_precision_score(y_test, s) 184 | acc = accuracy_score(y_test, y_pred) 185 | f1 = f1_score(y_test, y_pred) 186 | cm = confusion_matrix(y_test, y_pred) 187 | lines.append(f"{name:>14} | AUC={auc:.3f} AP={ap:.3f} ACC={acc:.3f} F1={f1:.3f}\nCM=\n{cm}\n") 188 | with open(os.path.join(out_dir, "metrics.txt"), "w") as f: 189 | f.writelines(lines) 190 | 191 | # Permutation importance (RF only) 192 | if "RandomForest" in models: 193 | rf = models["RandomForest"] 194 | perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1) 195 | imp = (pd.DataFrame({ 196 | "feature": feature_names, 197 | "importance_mean": perm.importances_mean, 198 | "importance_std": perm.importances_std 199 | }) 200 | .sort_values("importance_mean", ascending=False)) 201 | imp.to_csv(os.path.join(out_dir, "rf_permutation_importance.csv"), index=False) 202 | 203 | plt.figure(figsize=(7,6)) 204 | plt.barh(imp["feature"][::-1], imp["importance_mean"][::-1]) 205 | plt.xlabel("Permutation Importance (mean decrease in AUC proxy)") 206 | plt.title("RandomForest Feature Importance") 207 | plt.tight_layout(); plt.savefig(os.path.join(out_dir, "rf_feature_importance.png"), dpi=200); plt.close() 208 | 209 | # Partial dependence for top 3 210 | top3 = imp["feature"].head(3).tolist() 211 | fig_list = [] 212 | for feat in top3: 213 | fig = plt.figure(figsize=(5.5,4.5)) 214 | try: 215 | PartialDependenceDisplay.from_estimator(rf, X_test, [feature_names.index(feat)], feature_names=feature_names, grid_resolution=50) 216 | plt.title(f"PDP: {feat}") 217 | plt.tight_layout() 218 | fig_path = os.path.join(out_dir, f"pdp_{feat}.png") 219 | plt.savefig(fig_path, dpi=200); plt.close() 220 | fig_list.append(fig_path) 221 | except Exception as e: 222 | plt.close() 223 | print(f"PDP failed for {feat}: {e}") 224 | 225 | return lines 226 | 227 | 228 | def plot_maps(meta, df, proba_pred, out_dir): 229 | """Save heatmaps for true suitability, predicted probability, and presence points.""" 230 | r, c = meta["rows"], meta["cols"] 231 | os.makedirs(out_dir, exist_ok=True) 232 | 233 | def imsave(Z, title, fname, cmap=None, vmin=None, vmax=None): 234 | plt.figure(figsize=(6,5)) 235 | im = plt.imshow(Z.reshape(r, c), origin="lower", vmin=vmin, vmax=vmax, cmap=cmap) 236 | plt.title(title); plt.colorbar(im, shrink=0.8) 237 | plt.tight_layout(); plt.savefig(os.path.join(out_dir, fname), dpi=200); plt.close() 238 | 239 | imsave(df["occurrence_prob_true"].values, "True Occurrence Probability", "map_true_prob.png") 240 | imsave(proba_pred, "Predicted Occurrence Probability (RF)", "map_pred_prob_rf.png") 241 | imsave(df["ndvi"].values, "NDVI", "map_ndvi.png", vmin=0, vmax=1) 242 | 243 | # Presence point overlay on true prob 244 | plt.figure(figsize=(6,5)) 245 | plt.imshow(df["occurrence_prob_true"].values.reshape(r,c), origin="lower") 246 | yy, xx = (df["y"].values.reshape(r,c), df["x"].values.reshape(r,c)) 247 | pres = df["presence"].values.reshape(r,c) 248 | ys, xs = np.where(pres == 1) 249 | plt.scatter(xs, ys, s=6, c="white", marker="o", linewidths=0.3, edgecolors="black") 250 | plt.title("Presence Points on True Probability") 251 | plt.tight_layout(); plt.savefig(os.path.join(out_dir, "presence_overlay.png"), dpi=200); plt.close() 252 | 253 | 254 | def main(): 255 | ap = argparse.ArgumentParser() 256 | ap.add_argument("--rows", type=int, default=40, help="Grid rows (>10).") 257 | ap.add_argument("--cols", type=int, default=40, help="Grid cols (>10).") 258 | ap.add_argument("--seed", type=int, default=42, help="Random seed.") 259 | ap.add_argument("--out", type=str, default="outputs", help="Output folder.") 260 | args = ap.parse_args() 261 | 262 | os.makedirs(args.out, exist_ok=True) 263 | 264 | # 1) Generate environment & simulate species 265 | env_df, meta = generate_environment(args.rows, args.cols, seed=args.seed) 266 | df = simulate_occurrence(env_df, seed=args.seed, target_prevalence=0.35) 267 | 268 | # Ensure >100 points 269 | if len(df) <= 100: 270 | raise ValueError(f"Dataset too small ({len(df)}). Increase --rows/--cols.") 271 | 272 | # Save dataset 273 | csv_path = os.path.join(args.out, "sdm_synthetic_dataset.csv") 274 | xlsx_path = os.path.join(args.out, "sdm_synthetic_dataset.xlsx") 275 | df.to_csv(csv_path, index=False) 276 | try: 277 | df.to_excel(xlsx_path, index=False) # requires openpyxl 278 | except Exception as e: 279 | print(f"Excel export skipped ({e}). CSV saved at {csv_path}") 280 | 281 | # 2) Train/test split 282 | FEATURES = ["x","y","elevation","slope","temperature","precipitation","ndvi","dist_to_water","human_footprint"] 283 | TARGET = "presence" 284 | X = df[FEATURES].values 285 | y = df[TARGET].values 286 | 287 | X_train, X_test, y_train, y_test = train_test_split( 288 | X, y, test_size=0.25, random_state=args.seed, stratify=y 289 | ) 290 | 291 | # 3) Train models 292 | logit = LogisticRegression(max_iter=1000) 293 | logit.fit(X_train, y_train) 294 | 295 | rf = RandomForestClassifier( 296 | n_estimators=500, max_depth=None, min_samples_leaf=1, 297 | random_state=args.seed, n_jobs=-1, class_weight=None 298 | ) 299 | rf.fit(X_train, y_train) 300 | 301 | models = {"Logistic": logit, "RandomForest": rf} 302 | 303 | # 4) Evaluate & plot curves/importance 304 | metrics_lines = evaluate_and_plot(models, X_train, X_test, y_train, y_test, FEATURES, args.out) 305 | 306 | # 5) Probability map from RF over full grid 307 | proba_pred_rf = rf.predict_proba(X)[:,1] 308 | plot_maps(meta, df, proba_pred_rf, args.out) 309 | 310 | # 6) Print summary 311 | print(f"Synthetic SDM dataset saved to:\n - {csv_path}\n - {xlsx_path} (if openpyxl available)") 312 | print("Metrics:\n" + "".join(metrics_lines)) 313 | print(f"Artifacts in '{args.out}': roc_curves.png, pr_curves.png, rf_feature_importance.png, " 314 | f"pdp_*.png, map_true_prob.png, map_pred_prob_rf.png, presence_overlay.png, metrics.txt") 315 | 316 | 317 | if __name__ == "__main__": 318 | main() 319 | --------------------------------------------------------------------------------