├── .gitattributes
├── architecture.png
├── first_place_solution_doc.docx
├── requirements.txt
├── .gitignore
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bakarys01/hydropower-kalam-winner/HEAD/architecture.png


--------------------------------------------------------------------------------
/first_place_solution_doc.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bakarys01/hydropower-kalam-winner/HEAD/first_place_solution_doc.docx


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | pandas
3 | matplotlib
4 | seaborn
5 | scikit-learn
6 | lightgbm
7 | tqdm
8 | bayesian-optimization
9 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Jupyter notebook checkpoints
 2 | .ipynb_checkpoints/
 3 | 
 4 | # Python cache
 5 | __pycache__/
 6 | *.pyc
 7 | 
 8 | # Virtual environments
 9 | venv/
10 | .env/
11 | env/
12 | 
13 | # Data files
14 | *.csv
15 | *.xlsx
16 | *.xls
17 | *.xlsm
18 | *.xlsb
19 | 
20 | # Datasets folder
21 | datasets/
22 | datasets/*
23 | output/
24 | output/*
25 | 
26 | # Output files
27 | *.pkl
28 | *.h5
29 | *.model
30 | 
31 | # Logs
32 | logs/
33 | *.log
34 | 
35 | # OS specific files
36 | .DS_Store
37 | Thumbs.db
38 | 
39 | # IDE specific files
40 | .idea/
41 | .vscode/
42 | *.swp
43 | *.swo


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # First Place Solution  
  2 | **15 Apr 2025, 16:44 · 11 min read**
  3 | 
  4 | ---
  5 | 
  6 | ## 1. Overview and Objectives
  7 | **Purpose:** This document describes a winning solution for the *IBM SkillsBuild Hydropower Climate Optimization Challenge*. The goal is to predict daily energy consumption for multiple consumer devices in Kalam, Pakistan. The solution applies comprehensive feature engineering, strategic data segmentation, and advanced ensemble modeling to handle various seasonal and climate-related complexities.
  8 | 
  9 | **Objectives and Expected Outcomes:**
 10 | 1. **Accurate kWh Predictions**: Achieve low RMSE for daily consumption forecasts.
 11 | 2. **Seasonality and Weather Integration**: Incorporate Pakistan's unique seasonal and weather factors.
 12 | 3. **Robustness**: Effectively manage zero-consumption periods and device filtering.
 13 | 4. **Efficient Inference Pipeline**: Offer a systematic approach from raw data transformation to final predictions.
 14 | 
 15 | ---
 16 | 
 17 | ## 🔥 Summary
 18 | This repository contains the winning solution for the IBM SkillsBuild Hydropower Climate Optimisation Challenge. It provides a complete end‑to‑end pipeline—from data preprocessing and feature engineering, through model training and ensemble optimization, to inference and submission file generation.
 19 | 
 20 | **Key Results**  
 21 | - **Private Leaderboard RMSE:** 4.312706649  
 22 | - **Final Rank:** 🥇 1st Place  
 23 | 
 24 | ---
 25 | 
 26 | ## 🧹 Data Preprocessing & Cleaning
 27 | - **Device Filtering:** Excluded consumer devices and users not present in the test set to avoid leakage and noise.  
 28 | - **Aggregation:** Converted raw 5‑minute readings into daily statistics (sum, mean, std, min, max) for kWh, voltage (red/blue/yellow), current, and power factor.  
 29 | - **Offline Period Detection:** Marked weeks with zero consumption across all days as "offline" and filtered them out to focus on meaningful usage patterns.  
 30 | - **Final Dataset:** Saved as `output/filtred_online_days_enriched.csv` containing only "online" days for the users in the test set.
 31 | 
 32 | ---
 33 | 
 34 | ## 🌦️ Feature Engineering (Comprehensive Climate Integration)
 35 | - **Climate Data Merge:** Joined in daily aggregates of temperature, dewpoint, precipitation, snowfall, snow cover, and wind components.  
 36 | - **Cyclical Temporal Features:** Sine/cosine encodings for day-of-week, day-of-year, month, week, and day-in-season.  
 37 | - **Pakistan‑Specific Features:** Flags for public holidays and Ramadan periods; season indicators tuned to Kalam's climate.  
 38 | - **Advanced Trends:** Exponential moving averages, volatility, acceleration, and extreme‑event indicators for temperature.  
 39 | - **Heating/Cooling Metrics:** Heating degree days, cooling degree days, and temperature‑dewpoint differentials.  
 40 | - **Interactions:** Temperature × weekday interactions and weather × consumption effects.
 41 | 
 42 | ---
 43 | 
 44 | ## 🧪 Strategic Data Segmentation
 45 | Divided into four temporal segments to capture distinct seasonal behaviors:
 46 | 
 47 | | Segment | Periods                                 | Rationale                                            |
 48 | | ------- | --------------------------------------- | ---------------------------------------------------- |
 49 | | **Data1** | Aug–Sep 2024 & Oct 2023 (late summer/early fall) | High consumption → model learns peak usage patterns  |
 50 | | **Data2** | Nov–Dec 2023 & Jul 2024 (winter & mid-summer)    | Moderate consumption periods                         |
 51 | | **Data3** | All other intervals                         | Mostly zero/minimal consumption                      |
 52 | | **Data4** | Entire dataset                              | Global model for overarching patterns                |
 53 | 
 54 | ---
 55 | 
 56 | ## 🧠 Advanced Ensemble Modeling
 57 | 1. **Seven LightGBM Configurations** per segment:
 58 |    - **Precise:** deep, conservative trees (max_depth=8).  
 59 |    - **Feature‑Selective:** aggressive feature sampling.  
 60 |    - **Robust:** outlier‑resistant (min_data_in_leaf=20).  
 61 |    - **Deep Forest:** very deep with many estimators.  
 62 |    - **Highly Regularized:** strong L1/L2 penalties.  
 63 |    - **Fast Learner:** high learning rate for rapid convergence.  
 64 |    - **Balanced:** tuned for bias‑variance tradeoff.  
 65 | 2. **Cross‑Validation:** 5‑fold CV to evaluate each base model.  
 66 | 3. **Bayesian Optimization:** Search optimal ensemble weights per fold instead of a meta‑model.  
 67 | 4. **Multi‑Level Blending:** Combine segment‑specific ensembles with global weighting.
 68 | 
 69 | ---
 70 | 
 71 | ## 🔗 Dataset Access via Kaggle
 72 | Due to GitHub file size limits, raw CSV/XLSX files are **not** included here. Please download from Kaggle:
 73 | 
 74 | 1. Visit:  
 75 |    👉 [IBM SkillsBuild Hydropower Climate Optimisation (Updated)](https://www.kaggle.com/datasets/muhammadqasimshabbir/ibmskillsbuildhydropowerclimateoptimisationupdated)
 76 | 2. Click **Download All**.  
 77 | 3. Unzip and place into this repo under:
 78 |    ```
 79 |    datasets/
 80 |    ├── Data/
 81 |    │   └── Data.csv
 82 |    ├── SampleSubmission.csv
 83 |    └── Climate Data/
 84 |        └── Kalam Climate Data.xlsx
 85 |    ```
 86 | 
 87 | ---
 88 | 
 89 | ## 📂 Repository Structure
 90 | ```
 91 | .
 92 | ├── datasets/                           # Kaggle data (not committed)
 93 | ├── output/                             # Generated data, models, plots
 94 | ├── first_place_solution.ipynb          # Full notebook with code & docs
 95 | ├── final_submission.csv                # Submission file matching leaderboard
 96 | ├── requirements.txt                    # Python dependencies
 97 | └── README.md                           # This documentation
 98 | ```
 99 | 
100 | ---
101 | 
102 | ## 🛠️ How to Run
103 | 1. Clone the repo  
104 | 2. Download & place the dataset as described above  
105 | 3. Create and activate a Conda/Python environment:  
106 |    ```bash
107 |    pip install -r requirements.txt
108 |    ```  
109 | 4. Open `first_place_solution.ipynb` in Jupyter/Colab  
110 | 5. Run cells top to bottom—full training takes ~25–30 minutes  
111 | 
112 | ---
113 | 
114 | ## 📈 Performance Metrics
115 | - **Public Leaderboard RMSE:** 5.454981323  
116 | - **Private Leaderboard RMSE:** 4.312706649  
117 | - **Ensemble Improvement:** Bayesian weighting improved RMSE by ~4.2%
118 | 
119 | ---
120 | 
121 | ## 🕒 Run Time
122 | - **Data preprocessing & feature engineering:** ~3–5 minutes  
123 | - **Model training (4 segments × 7 configs × 5‑fold CV):** ~20–25 minutes  
124 | - **Inference & submission export:** < 1 minute  
125 | 
126 | ---
127 | 
128 | ## ❓ Additional Notes
129 | - **Reproducibility:** `seed_everything(42)` ensures consistent results.  
130 | - **Logging:** Comprehensive logging for each stage (ETL, training, inference).  
131 | - **Error Handling:** Fallback models in case of training errors; robust NaN column filtering.  
132 | 
133 | ---
134 | 
135 | ## 📞 Contact
136 | Bakary Sidibé  
137 | ✉️ bakarysidibe1995@gmail.com  
138 | 🔗 [LinkedIn](https://www.linkedin.com/in/bakary-sidibe-256419111/)  
139 | 
140 | ---
141 | 
142 | *Thank you for reviewing this solution!*


--------------------------------------------------------------------------------