├── synthetic_species_distribution_dataset.xlsx
├── README.md
└── file


/synthetic_species_distribution_dataset.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Nelvinebi/Species-Distribution-Modeling-with-Environmental-Variables/HEAD/synthetic_species_distribution_dataset.xlsx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Species Distribution Modeling with Environmental Variables (Synthetic Data)
  2 | 
  3 | This project demonstrates Species Distribution Modeling (SDM) using synthetic environmental data. It simulates environmental variables such as elevation, slope, temperature, precipitation, NDVI, distance to water, and human footprint, then generates species occurrence probabilities based on ecological response curves. Presence/absence data is sampled and can be used to train and evaluate machine learning models.
  4 | 
  5 | Features
  6 | 
  7 | Generates synthetic environmental variables on a grid (e.g., 20×20, 40×40).
  8 | 
  9 | Simulates species occurrence probabilities from ecological response curves.
 10 | 
 11 | Produces presence/absence labels based on habitat suitability.
 12 | 
 13 | Saves datasets in Excel (.xlsx) and CSV (.csv) formats.
 14 | 
 15 | Compatible with ML workflows (e.g., Logistic Regression, Random Forest).
 16 | 
 17 | Includes visualization of occurrence probability maps and presence points.
 18 | 
 19 | Installation
 20 | 
 21 | Clone the repository and install dependencies:
 22 | 
 23 | git clone https://github.com/yourusername/Species-Distribution-Modeling-with-Environmental-Variables.git
 24 | cd Species-Distribution-Modeling-with-Environmental-Variables
 25 | pip install -r requirements.txt
 26 | 
 27 | 
 28 | Dependencies:
 29 | 
 30 | Python 3.8+
 31 | 
 32 | NumPy
 33 | 
 34 | Pandas
 35 | 
 36 | Scikit-learn
 37 | 
 38 | Matplotlib
 39 | 
 40 | OpenPyXL (for Excel export)
 41 | 
 42 | Usage
 43 | 
 44 | Run the script to generate synthetic SDM data:
 45 | 
 46 | python sdm_synthetic.py --rows 40 --cols 40 --seed 42 --out outputs
 47 | 
 48 | 
 49 | Arguments:
 50 | 
 51 | --rows: Grid rows (default: 40)
 52 | 
 53 | --cols: Grid columns (default: 40)
 54 | 
 55 | --seed: Random seed for reproducibility (default: 42)
 56 | 
 57 | --out: Output directory (default: outputs)
 58 | 
 59 | Example Output
 60 | 
 61 | outputs/sdm_synthetic_dataset.csv
 62 | 
 63 | outputs/sdm_synthetic_dataset.xlsx
 64 | 
 65 | outputs/map_true_prob.png (True probability map)
 66 | 
 67 | outputs/presence_overlay.png (Presence points overlay)
 68 | 
 69 | Dataset Columns
 70 | 
 71 | x, y: Normalized spatial coordinates
 72 | 
 73 | elevation, slope: Topographic variables
 74 | 
 75 | temperature, precipitation: Climatic variables
 76 | 
 77 | ndvi: Vegetation index (greenness)
 78 | 
 79 | dist_to_water: Distance to river feature
 80 | 
 81 | human_footprint: Anthropogenic pressure
 82 | 
 83 | occurrence_prob_true: True simulated suitability
 84 | 
 85 | presence: 1 = species present, 0 = absent
 86 | 
 87 | Applications
 88 | 
 89 | Teaching and demonstrating SDM workflows without real ecological data.
 90 | 
 91 | Benchmarking ML algorithms for ecological niche modeling.
 92 | 
 93 | Exploring ecological response functions and habitat suitability patterns.
 94 | 
 95 | Author
 96 | 
 97 | Developed by [Name: Agbozu Ebingiye Nelvin
 98 |               Github: https://github.com/Nelvinebi
 99 | ]
100 | 
101 | License
102 | 
103 | This project is released under the MIT License. See LICENSE for details.
104 | 


--------------------------------------------------------------------------------
/file:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Species-Distribution-Modeling-with-Environmental-Variables (Synthetic)
  4 | ----------------------------------------------------------------------
  5 | 
  6 | Generates synthetic environmental predictors on a grid, simulates species
  7 | occurrence probabilities from ecological response curves, samples
  8 | presence/absence, trains ML models (Logistic Regression & Random Forest),
  9 | evaluates performance, and exports artifacts.
 10 | 
 11 | Requirements:
 12 |   - Python 3.8+
 13 |   - numpy, pandas, scikit-learn, matplotlib
 14 | 
 15 | Usage:
 16 |   python sdm_synthetic.py --rows 40 --cols 40 --seed 42 --out outputs
 17 | """
 18 | 
 19 | import os
 20 | import math
 21 | import argparse
 22 | import numpy as np
 23 | import pandas as pd
 24 | import matplotlib.pyplot as plt
 25 | 
 26 | from sklearn.model_selection import train_test_split
 27 | from sklearn.metrics import (
 28 |     roc_auc_score, roc_curve, average_precision_score, precision_recall_curve,
 29 |     confusion_matrix, accuracy_score, f1_score
 30 | )
 31 | from sklearn.linear_model import LogisticRegression
 32 | from sklearn.ensemble import RandomForestClassifier
 33 | from sklearn.inspection import permutation_importance, PartialDependenceDisplay
 34 | 
 35 | RNG = np.random.default_rng(42)
 36 | 
 37 | 
 38 | def sigmoid(z):
 39 |     return 1.0 / (1.0 + np.exp(-z))
 40 | 
 41 | 
 42 | def generate_environment(rows=40, cols=40, seed=42):
 43 |     """
 44 |     Create a synthetic landscape with environmental variables on a regular grid.
 45 |     Returns a DataFrame with one row per cell and a metadata dict.
 46 |     """
 47 |     rng = np.random.default_rng(seed)
 48 | 
 49 |     # Grid (normalized coords)
 50 |     x = np.linspace(0, 1, cols)
 51 |     y = np.linspace(0, 1, rows)
 52 |     xx, yy = np.meshgrid(x, y)
 53 | 
 54 |     # Topography: rolling hills
 55 |     elev = 0.4 * np.sin(2*np.pi*xx) * np.cos(2*np.pi*yy) + 0.6*yy
 56 |     elev = (elev - elev.min()) / (elev.max() - elev.min())
 57 | 
 58 |     # Slope proxy from gradients
 59 |     gy, gx = np.gradient(elev)
 60 |     slope = np.sqrt(gx**2 + gy**2)
 61 |     slope = (slope - slope.min()) / (slope.max() - slope.min() + 1e-9)
 62 | 
 63 |     # Temperature (°C): warmer at lower elevation + latitudinal gradient
 64 |     temp = 30 - 8*elev - 3*(yy - 0.5) + 0.6*rng.normal(size=xx.shape)
 65 | 
 66 |     # Precipitation (mm): wetter in the "west" and at mid-elevations
 67 |     precip = 900 + 400*(1-xx) + 150*np.exp(-((elev-0.5)/0.25)**2) + 30*rng.normal(size=xx.shape)
 68 | 
 69 |     # NDVI-like greenness: favor moderate temp & higher precip
 70 |     ndvi = (
 71 |         0.7*np.exp(-0.5*((temp - 26)/4.5)**2) +
 72 |         0.3*(precip - precip.min())/(precip.max()-precip.min() + 1e-9)
 73 |     )
 74 |     ndvi += 0.05*rng.normal(size=ndvi.shape)
 75 |     ndvi = np.clip(ndvi, 0, 1)
 76 | 
 77 |     # Distance to a meandering river (simple vertical distance to curve)
 78 |     river = 0.55 + 0.12*np.sin(4*np.pi*xx)
 79 |     dist_to_water = np.abs(yy - river)  # in normalized units (0..~0.7)
 80 | 
 81 |     # Human footprint (urban cores) via two RBF centers
 82 |     def rbf(cx, cy, scale):
 83 |         return np.exp(-(((xx - cx)**2 + (yy - cy)**2) / (2 * scale**2)))
 84 |     human_fp = 0.9*rbf(0.25, 0.35, 0.12) + 0.8*rbf(0.75, 0.70, 0.10)
 85 |     human_fp += 0.05*rng.normal(size=human_fp.shape)
 86 |     human_fp = (human_fp - human_fp.min())/(human_fp.max()-human_fp.min() + 1e-9)
 87 | 
 88 |     # Assemble DataFrame
 89 |     df = pd.DataFrame({
 90 |         "x": xx.ravel(),
 91 |         "y": yy.ravel(),
 92 |         "elevation": elev.ravel(),
 93 |         "slope": slope.ravel(),
 94 |         "temperature": temp.ravel(),
 95 |         "precipitation": precip.ravel(),
 96 |         "ndvi": ndvi.ravel(),
 97 |         "dist_to_water": dist_to_water.ravel(),
 98 |         "human_footprint": human_fp.ravel(),
 99 |     })
100 |     meta = {"rows": rows, "cols": cols}
101 |     return df, meta
102 | 
103 | 
104 | def simulate_occurrence(df, seed=42, target_prevalence=0.35):
105 |     """
106 |     Convert environmental variables to occurrence probability using
107 |     simple ecological response curves, then sample presence/absence.
108 |     """
109 |     rng = np.random.default_rng(seed)
110 | 
111 |     # Preferences (Gaussian optima)
112 |     temp_opt, temp_sd = 26.0, 4.0
113 |     prec_opt, prec_sd = 1100.0, 250.0
114 |     elev_opt, elev_sd = 0.45, 0.25
115 | 
116 |     suit_temp = np.exp(-0.5*((df["temperature"] - temp_opt)/temp_sd)**2)
117 |     suit_prec = np.exp(-0.5*((df["precipitation"] - prec_opt)/prec_sd)**2)
118 |     suit_elev = np.exp(-0.5*((df["elevation"] - elev_opt)/elev_sd)**2)
119 |     suit_ndvi = df["ndvi"]  # already in 0..1
120 |     suit_water = np.exp(-df["dist_to_water"]/0.08)  # decays with distance
121 |     suit_human = 1 - df["human_footprint"]          # prefers low human footprint
122 |     suit_slope = np.exp(-df["slope"]/0.25)          # gentle slopes preferred
123 | 
124 |     # Combine (weighted linear predictor)
125 |     z = (
126 |         1.6*suit_temp +
127 |         1.4*suit_prec +
128 |         1.2*suit_elev +
129 |         1.8*suit_ndvi +
130 |         1.0*suit_water +
131 |         1.3*suit_human +
132 |         0.6*suit_slope +
133 |         0.2*rng.normal(size=len(df))
134 |     )
135 | 
136 |     # Shift to hit target prevalence approximately
137 |     mu = z.mean()
138 |     def logit(p): return np.log(p/(1-p))
139 |     bias = logit(target_prevalence) - mu
140 |     p_occ = sigmoid(z + bias)
141 | 
142 |     y = rng.uniform(size=len(df)) < p_occ
143 |     df = df.copy()
144 |     df["occurrence_prob_true"] = p_occ
145 |     df["presence"] = y.astype(int)
146 |     return df
147 | 
148 | 
149 | def evaluate_and_plot(models, X_train, X_test, y_train, y_test, feature_names, out_dir):
150 |     """Evaluate models, save ROC/PR plots, feature importance and PDPs for RF."""
151 |     os.makedirs(out_dir, exist_ok=True)
152 | 
153 |     # ROC & PR curves
154 |     plt.figure(figsize=(6,5))
155 |     for name, mdl in models.items():
156 |         s = mdl.predict_proba(X_test)[:,1]
157 |         fpr, tpr, _ = roc_curve(y_test, s)
158 |         auc = roc_auc_score(y_test, s)
159 |         plt.plot(fpr, tpr, label=f"{name} (AUC={auc:.3f})")
160 |     plt.plot([0,1],[0,1], linestyle="--")
161 |     plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
162 |     plt.title("ROC Curves")
163 |     plt.legend(loc="lower right")
164 |     plt.tight_layout(); plt.savefig(os.path.join(out_dir, "roc_curves.png"), dpi=200); plt.close()
165 | 
166 |     plt.figure(figsize=(6,5))
167 |     for name, mdl in models.items():
168 |         s = mdl.predict_proba(X_test)[:,1]
169 |         prec, rec, _ = precision_recall_curve(y_test, s)
170 |         ap = average_precision_score(y_test, s)
171 |         plt.plot(rec, prec, label=f"{name} (AP={ap:.3f})")
172 |     plt.xlabel("Recall"); plt.ylabel("Precision")
173 |     plt.title("Precision-Recall Curves")
174 |     plt.legend(loc="lower left")
175 |     plt.tight_layout(); plt.savefig(os.path.join(out_dir, "pr_curves.png"), dpi=200); plt.close()
176 | 
177 |     # Metrics summary
178 |     lines = []
179 |     for name, mdl in models.items():
180 |         s = mdl.predict_proba(X_test)[:,1]
181 |         y_pred = (s >= 0.5).astype(int)
182 |         auc = roc_auc_score(y_test, s)
183 |         ap = average_precision_score(y_test, s)
184 |         acc = accuracy_score(y_test, y_pred)
185 |         f1 = f1_score(y_test, y_pred)
186 |         cm = confusion_matrix(y_test, y_pred)
187 |         lines.append(f"{name:>14} | AUC={auc:.3f}  AP={ap:.3f}  ACC={acc:.3f}  F1={f1:.3f}\nCM=\n{cm}\n")
188 |     with open(os.path.join(out_dir, "metrics.txt"), "w") as f:
189 |         f.writelines(lines)
190 | 
191 |     # Permutation importance (RF only)
192 |     if "RandomForest" in models:
193 |         rf = models["RandomForest"]
194 |         perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
195 |         imp = (pd.DataFrame({
196 |             "feature": feature_names,
197 |             "importance_mean": perm.importances_mean,
198 |             "importance_std": perm.importances_std
199 |         })
200 |         .sort_values("importance_mean", ascending=False))
201 |         imp.to_csv(os.path.join(out_dir, "rf_permutation_importance.csv"), index=False)
202 | 
203 |         plt.figure(figsize=(7,6))
204 |         plt.barh(imp["feature"][::-1], imp["importance_mean"][::-1])
205 |         plt.xlabel("Permutation Importance (mean decrease in AUC proxy)")
206 |         plt.title("RandomForest Feature Importance")
207 |         plt.tight_layout(); plt.savefig(os.path.join(out_dir, "rf_feature_importance.png"), dpi=200); plt.close()
208 | 
209 |         # Partial dependence for top 3
210 |         top3 = imp["feature"].head(3).tolist()
211 |         fig_list = []
212 |         for feat in top3:
213 |             fig = plt.figure(figsize=(5.5,4.5))
214 |             try:
215 |                 PartialDependenceDisplay.from_estimator(rf, X_test, [feature_names.index(feat)], feature_names=feature_names, grid_resolution=50)
216 |                 plt.title(f"PDP: {feat}")
217 |                 plt.tight_layout()
218 |                 fig_path = os.path.join(out_dir, f"pdp_{feat}.png")
219 |                 plt.savefig(fig_path, dpi=200); plt.close()
220 |                 fig_list.append(fig_path)
221 |             except Exception as e:
222 |                 plt.close()
223 |                 print(f"PDP failed for {feat}: {e}")
224 | 
225 |     return lines
226 | 
227 | 
228 | def plot_maps(meta, df, proba_pred, out_dir):
229 |     """Save heatmaps for true suitability, predicted probability, and presence points."""
230 |     r, c = meta["rows"], meta["cols"]
231 |     os.makedirs(out_dir, exist_ok=True)
232 | 
233 |     def imsave(Z, title, fname, cmap=None, vmin=None, vmax=None):
234 |         plt.figure(figsize=(6,5))
235 |         im = plt.imshow(Z.reshape(r, c), origin="lower", vmin=vmin, vmax=vmax, cmap=cmap)
236 |         plt.title(title); plt.colorbar(im, shrink=0.8)
237 |         plt.tight_layout(); plt.savefig(os.path.join(out_dir, fname), dpi=200); plt.close()
238 | 
239 |     imsave(df["occurrence_prob_true"].values, "True Occurrence Probability", "map_true_prob.png")
240 |     imsave(proba_pred, "Predicted Occurrence Probability (RF)", "map_pred_prob_rf.png")
241 |     imsave(df["ndvi"].values, "NDVI", "map_ndvi.png", vmin=0, vmax=1)
242 | 
243 |     # Presence point overlay on true prob
244 |     plt.figure(figsize=(6,5))
245 |     plt.imshow(df["occurrence_prob_true"].values.reshape(r,c), origin="lower")
246 |     yy, xx = (df["y"].values.reshape(r,c), df["x"].values.reshape(r,c))
247 |     pres = df["presence"].values.reshape(r,c)
248 |     ys, xs = np.where(pres == 1)
249 |     plt.scatter(xs, ys, s=6, c="white", marker="o", linewidths=0.3, edgecolors="black")
250 |     plt.title("Presence Points on True Probability")
251 |     plt.tight_layout(); plt.savefig(os.path.join(out_dir, "presence_overlay.png"), dpi=200); plt.close()
252 | 
253 | 
254 | def main():
255 |     ap = argparse.ArgumentParser()
256 |     ap.add_argument("--rows", type=int, default=40, help="Grid rows (>10).")
257 |     ap.add_argument("--cols", type=int, default=40, help="Grid cols (>10).")
258 |     ap.add_argument("--seed", type=int, default=42, help="Random seed.")
259 |     ap.add_argument("--out",  type=str, default="outputs", help="Output folder.")
260 |     args = ap.parse_args()
261 | 
262 |     os.makedirs(args.out, exist_ok=True)
263 | 
264 |     # 1) Generate environment & simulate species
265 |     env_df, meta = generate_environment(args.rows, args.cols, seed=args.seed)
266 |     df = simulate_occurrence(env_df, seed=args.seed, target_prevalence=0.35)
267 | 
268 |     # Ensure >100 points
269 |     if len(df) <= 100:
270 |         raise ValueError(f"Dataset too small ({len(df)}). Increase --rows/--cols.")
271 | 
272 |     # Save dataset
273 |     csv_path = os.path.join(args.out, "sdm_synthetic_dataset.csv")
274 |     xlsx_path = os.path.join(args.out, "sdm_synthetic_dataset.xlsx")
275 |     df.to_csv(csv_path, index=False)
276 |     try:
277 |         df.to_excel(xlsx_path, index=False)  # requires openpyxl
278 |     except Exception as e:
279 |         print(f"Excel export skipped ({e}). CSV saved at {csv_path}")
280 | 
281 |     # 2) Train/test split
282 |     FEATURES = ["x","y","elevation","slope","temperature","precipitation","ndvi","dist_to_water","human_footprint"]
283 |     TARGET = "presence"
284 |     X = df[FEATURES].values
285 |     y = df[TARGET].values
286 | 
287 |     X_train, X_test, y_train, y_test = train_test_split(
288 |         X, y, test_size=0.25, random_state=args.seed, stratify=y
289 |     )
290 | 
291 |     # 3) Train models
292 |     logit = LogisticRegression(max_iter=1000)
293 |     logit.fit(X_train, y_train)
294 | 
295 |     rf = RandomForestClassifier(
296 |         n_estimators=500, max_depth=None, min_samples_leaf=1,
297 |         random_state=args.seed, n_jobs=-1, class_weight=None
298 |     )
299 |     rf.fit(X_train, y_train)
300 | 
301 |     models = {"Logistic": logit, "RandomForest": rf}
302 | 
303 |     # 4) Evaluate & plot curves/importance
304 |     metrics_lines = evaluate_and_plot(models, X_train, X_test, y_train, y_test, FEATURES, args.out)
305 | 
306 |     # 5) Probability map from RF over full grid
307 |     proba_pred_rf = rf.predict_proba(X)[:,1]
308 |     plot_maps(meta, df, proba_pred_rf, args.out)
309 | 
310 |     # 6) Print summary
311 |     print(f"Synthetic SDM dataset saved to:\n  - {csv_path}\n  - {xlsx_path} (if openpyxl available)")
312 |     print("Metrics:\n" + "".join(metrics_lines))
313 |     print(f"Artifacts in '{args.out}': roc_curves.png, pr_curves.png, rf_feature_importance.png, "
314 |           f"pdp_*.png, map_true_prob.png, map_pred_prob_rf.png, presence_overlay.png, metrics.txt")
315 | 
316 | 
317 | if __name__ == "__main__":
318 |     main()
319 | 


--------------------------------------------------------------------------------